Outpainting I - Controlnet version #7482
Replies: 12 comments 44 replies
-
awesome! thank you for sharing your knowledge! this is super educational and tremendously valuable to our community :) |
Beta Was this translation helpful? Give feedback.
-
hello great guide. is this different than what we have in controlnet of automatic1111? can we replicate this in controlnet of Mikubill in auto1111? anyone tested? |
Beta Was this translation helpful? Give feedback.
-
@asomoza soooo cool, according to your pipeline, I think it is possible to achieve the same quality of fooocus or even better. |
Beta Was this translation helpful? Give feedback.
-
https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_inpaint.py try to use this pipeline which accepts a mask so you can keep the original area unchanged. |
Beta Was this translation helpful? Give feedback.
-
We should fuse them, this will have the best effect.In my opinion, they can coexist. |
Beta Was this translation helpful? Give feedback.
-
Nice work! I noticed a problem. The source image has been significantly changed (especially the color change). After all, ip_adapter can't perfectly retain the image information, which causes the image to be unnatural after source image being pasted. Fooocus does not have this problem. |
Beta Was this translation helpful? Give feedback.
-
@asomoza https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint_sd_xl.py this pipeline can use both sdxl inpaint and controlnet, and by insert the code here https://github.com/exx8/differential-diffusion/blob/main/SDXL/diff_pipe.py#L976 you can use differential diffusers too |
Beta Was this translation helpful? Give feedback.
-
That's wonderful!!!! But I encountered a little bug, I use a image of car with white background to generate new background, but i found it is not able to retain every detail of the car, especially the logo of car always changed and destoryed. why did it happen? How to solve it? Thank you! |
Beta Was this translation helpful? Give feedback.
-
here is part2 of the outpainting guide https://huggingface.co/blog/OzzyGT/outpainting-differential-diffusion |
Beta Was this translation helpful? Give feedback.
-
i've been trying to add this in my personal project but its eating up my vram like crazy. i am not able to figure out how can i streamline all the 3 pipelines into a single one. this is my current schema-
now the thing is the flow of the whole process is pipeline_1 to pipeline_3 and consecutive pipelines are just modification of the previous ones. if i modify the pipelines in place, i wont be able to use it on the next image(since image needs to pass from pipeline_1 , which is now modified into pipeline_3 due to the previous pass). if i make 3 seperate pipeline it just eats up my vram (25gb rtx3090). is there a way i can make a single reusable block ? @asomoza |
Beta Was this translation helpful? Give feedback.
-
@asomoza thanks for this amazing guide. I have gone through all of your three blog posts and I have observed that all of the workflows struggle to generate human faces and hands. For example I take an image of cloth and try to outpaint a human inside it, it will mess up the facial structure and even anatomy. Any work around that? |
Beta Was this translation helpful? Give feedback.
-
Amazing work! Thank you so much 🙏 would it be possible to provide a similar guide to obtain the equivalent of Generative Fill from photoshop? it would be super valuable! |
Beta Was this translation helpful? Give feedback.
-
Outpainting with controlnet
There are at least three methods that I know of to do the outpainting, each with different variations and steps, so I'll post a series of outpainting articles and try to cover all of them.
Outpainting with controlnet requires using a mask, so this method only works when you can paint a white mask around the area you want to expand. With this method it is not necessary to prepare the area before but it has the limit that the image can only be as big as your VRAM allows it.
For this case I'll use a wolf image that was provided by @Laidawang in this comment:
This time I won't do a comparison with the popular applications because the intention of this post is not to write about other webuis but to do an example on how to do this with diffusers. I did use them to have a baseline though.
The starting prompt is
a wolf playing basketball
and I'll use the Juggernaut V9 model.There's a controlnet for SDXL trained for inpainting by destitech: https://huggingface.co/destitech/controlnet-inpaint-dreamer-sdxl. It's an early alpha version but I think it works well for most cases.
This
controlnet
is really easy to use, you just need to paint white the parts you want to replace, so in this case what I'm going to do is paint white the transparent part of the image.To paint white the alpha channel of an image I use this code:
With this we get the same image but with a white background:
This image has a 720x720 px size, SDXL works better with 1024x1024 image, so the generations will be upscaled, you can use the smaller image but you'll probably get a lower quality image.
The conditional scale affects how much of the original image will be preserved, since this is an outpaint, it's safe to use higher values, for inpainting and complex images is better to use lower values around 0.5.
With the same seed I get these results:
I always prefer to allow the model to have a little freedom so it can adjust tiny details to make the image more coherent, so for this case I'll use
0.9
At this point I think we are at the level of other solutions, but let's say we want the wolf to look just like the original image, for that I want to give the model more context of the wolf and where I want it to be so I'll use an IP adapter for that.
There's a little trick that works for me and is that I use the generated image I want as a base and paint the mask of the wolf over it and then use this as an attention mask for the IP Adapter.
The process I use to generate the mask is like this:
For me doing this with something like GIMP doesn't take me more than a minute to do.
I'm using the IP Plus Adapter with a scale of 0.4, and with these settings we now get these results:
My guess as to why this works is that since we’re drawing a shape of a wolf and giving it an input image of a wolf, the model tries to maintain the shape, position, and coherence. In short, we’re giving the model more context for it to generate the final image, and now it looks a lot more like the original.
For example, if we don’t use a mask for this ip adapter we get this:
Now we're at the point where the image looks good but I always want more, so lets improve it with the prompt.
I'll change the prompt to this:
"high quality photo of a wolf playing basketball, highly detailed, professional, dramatic ambient light, cinematic, dynamic background, focus"
Also I'll give it a little more freedom to the controlnet with
control_guidance_end=0.9
so it can finish the details without restrictions.Finally, if you’ve ever worked with compositing images or video before, you’ll know that it is a common practice to apply a filter to the whole composition to unite the final look. This is true for stable diffusion too, and it sometimes hides the seams better, if there are any. For this, I’ll do a final pass to the whole image with an image-to-image pipeline.
Diffusers allows to just change the pipeline while preserving the loaded models which is what we need here.
The VAE decoding is a lossy process, so everytime we encode or decode we're lossing details and lowering the quality of the image, to prevent this we need to stay in the latent space as much as possible.
Diffusers allows this if you pass
output_type="latent"
to the pipeline. We then feed the latents to the image to image pipeline but before that I want to also give it a more cinematic look, so I'll change the prompt again:prompt = "cinematic film still of a wolf playing basketball, highly detailed, high budget hollywood movie, cinemascope, epic, gorgeous, film grain"
This should be just a quick pass, so I'll set the steps to 30 and the strength to 0.2.
These are the final results:
In my opinion these images are as good as or better than the outpainting done with other UIs and hopefully these will help people to better understand what you can do with diffusers.
As a bonus, there's a neat little trick. Before I used the IP Adapter with a mask to give more of the initial image to the generation. However, this can also be used without a mask. This has the effect of feeding the model more context and makes it better able to guess the rest of the image, for example with just the prompt "high quality":
So if I do what I did before but with the exception of the ip adapter mask, this is the final result:
Don't ask me why is the Eiffel tower there ^^
This is the code:
Beta Was this translation helpful? Give feedback.
All reactions