Weird artifacts with SD controlnet inpaint pipeline #9922

SantiagoJN · 2024-11-13T13:02:57Z

SantiagoJN
Nov 13, 2024

Hi,

I'm using an IP-Adapter to perform image conditioning with StableDiffusion 1.5 (using the checkpoint from RealisticVision5.1), and the controlnet weights from Illyasviel, more specifically the ones for depth and normal maps.

The thing is that, when using these controlnets with the StableDiffusionControlNetPipeline, I manage to get decent results, but as soon as I swap to the StableDiffusionControlNetInpaintPipeline, all I get are weird artifacts. Below I include how I'm using the pipeline, and the results I'm talking about.

ControlNet with depths

Using the conditioning images:

I get the following results:

By calling the pipeline with

images = ip_model.generate(pil_image=stones, image=depth_map, num_samples=4, num_inference_steps=30, guidance_scale=5, seed=42, controlnet_conditioning_scale=0.9)

, where the generate function is that of the IP-Adapter, pil_image=stones inputs the stones' image to be used as image conditioning, and the depth map represents the strawberries' map.

ControlNet with normals

Using a similar setup, conditioning with

I get:

ControlNet inpaint with normals

Now we have more ingredients:

From left to right, they represent stones (the image used for conditioning), the source image, maskedimage (the source after applying the mask), normals, and _mask. With this, I get the following results:

Which pretty much ignore the normal map, even with a scale of 0.9. The code I used for obtaining these results is the following one:

images = ip_model.generate(pil_image=stones, num_samples=4, num_inference_steps=50, control_image=normals, guidance_scale=5.0, clip_skip=1, seed=42, image=masked_image, mask_image=mask, controlnet_conditioning_scale=0.9 )

ControlNet inpaint with depth

Similarly, I use the following setup to condition the diffusion for depth, using an inpainting pipeline:

Getting the results:

Using the code
images = ip_model.generate(pil_image=stones, num_samples=4, num_inference_steps=50, control_image=depth, seed=42, image=masked_image, mask_image=mask, controlnet_conditioning_scale=0.9,

As far as I've seen in the documentation and other discussions, using as base model a checkpoint not specialized in inpainting shouldn't be a problem, since I'm using a pipeline that is already handling that feature.

I'm pretty new, so I wouldn't be surprised if I'm making a very basic mistake, so any suggestion/correction would be greatly appreciated.
Thanks!

asomoza · 2024-11-14T01:33:18Z

asomoza
Nov 14, 2024
Maintainer

Hi, using the inpainting pipeline with a non-inpaint model is not optimal, so it's always better to use an inpainting model, even more when you're restricting more the generation with a controlnet.

The inpainting pipeline is just a way of using a mask with a source image, the inpainting models are finetuned for this special task, so they complement each other.

4 replies

SantiagoJN Nov 14, 2024
Author

Thank you very much for your answer, it makes complete sense. I have a couple of questions of which I'd really like to hear your thoughts:

Doing some tests, previously I was able to overcome these weird artifacts by using the base version of Stable Diffusion XL. I thought that the reason was simply because the XL version creates higher resolution images. Could it be that, due to the differences in the architecture and training dataset between SD1.5 and SDXL1.0, the XL version is much more robust for inpainting, although none of these models have been trained explicitly for this task?
I was interested on using this community checkpoint of Realistic Vision 5.1 inpainting, but there is not such model in hugging face (author's models here). However, the model downloaded from civitai is directly the .safetensors checkpoint, not the project with directories and subdirectories that the diffusers library expects. Do you have any recommendation of some tool to adapt this .safetensors model into something understandable by diffusers? (or any more suitable way to load them). I've seen this article but I don't know yet how well it works.
Regarding the training and inference of the IP-Adapter, is it OK to train an IP-Adapter with the vanilla RealisticVision 5.1 (not finetuned for inpainting), and then use the learned weights to do inference with the inpainting version of RV5.1, using the inpainting pipeline from diffusers? Would you follow a different approach?

Thank you again for your time!!

asomoza Nov 14, 2024
Maintainer

... SD1.5 and SDXL1.0, the XL version is much more robust for inpainting, although none of these models have been trained explicitly for this task

yes, at least, with all my tests backs then, the base SDXL is a lot better for inpainting than the base SD 1.5, probably because it has more parameters and overall has better quality, so the generations are better too even with inpainting. Same as newer models, SD 3.5 and Flux are better at inpainting than vanilla SDXL.

I recommend you better use differential diffusion or an inpainting controlnet, both don't require an inpainting finetuned model and are a lot better than just using the pipeline you're using.

I was interested on using this community checkpoint of Realistic Vision 5.1...

You can load that file with from_single_file without any need of converting them or doing something more complicated. If you want to convert them later, you can load it and then just save it with save_pretrained.

Regarding the training and inference of the IP-Adapter, is it OK to train an IP-Adapter...

yes, for IP Adapter you don't need to train it with an inpainting model, IP Adapters are just for extracting images features and condition the image generation with them, it doesn't matter if you use them for normal generation, image to image or inpainting. But I would recommend to train it with the base model and not a finetune, this way you ensure it will work with all the finetuned models and not just one.

Thank you again for your time!!

No problem, don't hesitate to ask more questions if you have them.

SantiagoJN Nov 15, 2024
Author

Ah, pieces are falling into place. Thanks for sharing your knowledge, it's of great help!

I would like to ask you a couple more questions:

If I understood it properly, in order to condition a diffusion model with inpainting, a depth map, and a normal map, it should be somehow equivalent to use
1. An inpainting pipeline like StableDiffusionControlNetInpaintPipeline with a base model suitable for inpainting (either SDXL, SD3.5, or some finetuned checkpoint), and using controlnets for conditioning depth and normals
2. A non-inpaint pipeline like StableDiffusionControlNetPipeline with a base model like SD1.5 (or some finetuned checkpoint not trained for inpainting), and 3 controlnets for conditioning according to depth maps, normal maps, and inpainting masks

Is that correct? If so, would you recommend using one in special? Should I try something else?

In my project, I would like to generate images only conditioned by image embeddings, completely ignoring the text prompt. As far as I know, for training the checkpoints I use as base models, people use pairs of image+prompt, and that relation is learned in the model's weights. During the training of an IPAdapter, I've tried to use constant general prompts like "an image", zeroing the prompt embeds, or using a small noise as prompt embeds, but none of these approaches has worked so far. Is there any other approach you would follow to learn IPA's weights that don't account for information coming from the text prompt?
- During inference, I need to slightly increase the CFG scale to get outputs that follow my image embeds. However, when I do this, I quickly get very weird artifacts. My hypothesis is that this may be caused by a bad interaction between image and prompt embeds, which the network has not completely learned to ignore. Does it make sense?

Thanks in advance! :)

SantiagoJN Nov 25, 2024
Author

Hi, I attach below a couple examples regarding the very last point, showing these wrinkle-like and transparent features in generated images:

Thanks for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird artifacts with SD controlnet inpaint pipeline #9922

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Weird artifacts with SD controlnet inpaint pipeline #9922

SantiagoJN Nov 13, 2024

ControlNet with depths

ControlNet with normals

ControlNet inpaint with normals

ControlNet inpaint with depth

Replies: 1 comment · 4 replies

asomoza Nov 14, 2024 Maintainer

SantiagoJN Nov 14, 2024 Author

asomoza Nov 14, 2024 Maintainer

SantiagoJN Nov 15, 2024 Author

SantiagoJN Nov 25, 2024 Author

SantiagoJN
Nov 13, 2024

Replies: 1 comment 4 replies

asomoza
Nov 14, 2024
Maintainer

SantiagoJN Nov 14, 2024
Author

asomoza Nov 14, 2024
Maintainer

SantiagoJN Nov 15, 2024
Author

SantiagoJN Nov 25, 2024
Author