Replies: 4 comments 11 replies
-
For reference, this is the code I used: import torch
from diffusers import AutoencoderKL, DPMSolverMultistepScheduler, StableDiffusionXLPipeline
from diffusers.models import ImageProjection
from diffusers.utils import load_image
def encode_image(
image_encoder,
feature_extractor,
image,
device,
num_images_per_prompt,
output_hidden_states=None,
negative_image=None,
):
dtype = next(image_encoder.parameters()).dtype
if not isinstance(image, torch.Tensor):
image = feature_extractor(image, return_tensors="pt").pixel_values
image = image.to(device=device, dtype=dtype)
if output_hidden_states:
image_enc_hidden_states = image_encoder(image, output_hidden_states=True).hidden_states[-2]
image_enc_hidden_states = image_enc_hidden_states.repeat_interleave(num_images_per_prompt, dim=0)
if negative_image is None:
uncond_image_enc_hidden_states = image_encoder(
torch.zeros_like(image), output_hidden_states=True
).hidden_states[-2]
else:
if not isinstance(negative_image, torch.Tensor):
negative_image = feature_extractor(negative_image, return_tensors="pt").pixel_values
negative_image = negative_image.to(device=device, dtype=dtype)
uncond_image_enc_hidden_states = image_encoder(negative_image, output_hidden_states=True).hidden_states[-2]
uncond_image_enc_hidden_states = uncond_image_enc_hidden_states.repeat_interleave(num_images_per_prompt, dim=0)
return image_enc_hidden_states, uncond_image_enc_hidden_states
else:
image_embeds = image_encoder(image).image_embeds
image_embeds = image_embeds.repeat_interleave(num_images_per_prompt, dim=0)
uncond_image_embeds = torch.zeros_like(image_embeds)
return image_embeds, uncond_image_embeds
@torch.no_grad()
def prepare_ip_adapter_image_embeds(
unet,
image_encoder,
feature_extractor,
ip_adapter_image,
do_classifier_free_guidance,
device,
num_images_per_prompt,
ip_adapter_negative_image=None,
):
if not isinstance(ip_adapter_image, list):
ip_adapter_image = [ip_adapter_image]
if len(ip_adapter_image) != len(unet.encoder_hid_proj.image_projection_layers):
raise ValueError(
f"`ip_adapter_image` must have same length as the number of IP Adapters. Got {len(ip_adapter_image)} images and {len(unet.encoder_hid_proj.image_projection_layers)} IP Adapters."
)
image_embeds = []
for single_ip_adapter_image, image_proj_layer in zip(
ip_adapter_image, unet.encoder_hid_proj.image_projection_layers
):
output_hidden_state = not isinstance(image_proj_layer, ImageProjection)
single_image_embeds, single_negative_image_embeds = encode_image(
image_encoder,
feature_extractor,
single_ip_adapter_image,
device,
1,
output_hidden_state,
negative_image=ip_adapter_negative_image,
)
single_image_embeds = torch.stack([single_image_embeds] * num_images_per_prompt, dim=0)
single_negative_image_embeds = torch.stack([single_negative_image_embeds] * num_images_per_prompt, dim=0)
if do_classifier_free_guidance:
single_image_embeds = torch.cat([single_negative_image_embeds, single_image_embeds])
single_image_embeds = single_image_embeds.to(device)
image_embeds.append(single_image_embeds)
return image_embeds
vae = AutoencoderKL.from_pretrained(
"madebyollin/sdxl-vae-fp16-fix",
torch_dtype=torch.float16,
).to("cuda")
pipeline = StableDiffusionXLPipeline.from_pretrained(
"RunDiffusion/Juggernaut-XL-v9",
torch_dtype=torch.float16,
vae=vae,
variant="fp16",
).to("cuda")
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
pipeline.scheduler.config.use_karras_sigmas = True
pipeline.load_ip_adapter(
"h94/IP-Adapter",
subfolder="sdxl_models",
weight_name="ip-adapter-plus_sdxl_vit-h.safetensors",
image_encoder_folder="models/image_encoder",
)
pipeline.set_ip_adapter_scale(0.7)
ip_image = load_image("source.png")
negative_ip_image = load_image("noise.png")
image_embeds = prepare_ip_adapter_image_embeds(
unet=pipeline.unet,
image_encoder=pipeline.image_encoder,
feature_extractor=pipeline.feature_extractor,
ip_adapter_image=[[ip_image]],
do_classifier_free_guidance=True,
device="cuda",
num_images_per_prompt=1,
ip_adapter_negative_image=negative_ip_image,
)
prompt = "cinematic photo of a cyborg in the city, 4k, high quality, intricate, highly detailed"
negative_prompt = "blurry, smooth, plastic"
image = pipeline(
prompt=prompt,
negative_prompt=negative_prompt,
ip_adapter_image_embeds=image_embeds,
guidance_scale=6.0,
num_inference_steps=25,
generator=torch.Generator(device="cpu").manual_seed(1556265306),
).images[0]
image.save("result.png") |
Beta Was this translation helpful? Give feedback.
-
this is awesome :) |
Beta Was this translation helpful? Give feedback.
-
monochrome noise doesn't work very well in my tests. An interesting way of adding noise might be as follow
you can further calibrate the effect by multiplying the noise by the factor again at the end for an even more subtle effect. PS: it is important to remember that the noise generation needs to be seeded if you want repeatable results |
Beta Was this translation helpful? Give feedback.
-
what a great post! thank you for sharing this I'm adding the link to the initial discussion on the IP adapter negative image here #6318 (comment) to summarize based on my understanding: the negative image allows you to generate images that have more variation from the original ip-adapter image; you can also do that by lowering the `ip_adapter_scale, but with the negative image you can have more control over the generation e.g. you can preserve more of the composition of the original image ad only lose the details you want to modify etc super neat technique :) |
Beta Was this translation helpful? Give feedback.
-
I'm starting this discussion to document and share some examples of this technique with IP Adapters.
First of all, this wasn't my initial idea, so thanks to @cubiq and his repository https://github.com/cubiq/ComfyUI_IPAdapter_plus. It is from here and his comments that I got started playing with this idea. AFAIK there's no mention about this in the official repository or the paper.
For this discussion I'm only applying noise and images that can be generated automatically, there's a lot more that can be done with manual intervention but those would be better with an UI.
Lets start with an initial image without prompts, I'm using the IP Adapter PLUS with a 1.0 scale, all settings are the same with a fixed seed, so this is the base where I start:
The quality isn't that great right now. I'm using the latest juggernaut model and what I'm trying to achieve is a more cinematic result and to change the initial setting of the image, so I'll add this prompts:
The result with this prompt is like this:
The quality still isn't that great, also even with the prompt there's almost no city in the image. So now I'll start with the negative image of IP Adapters to try to make it better, first I'll start with mandelbrot noise, one normal and one inverted.
I like it better the result with the inverted mandelbrot, but still it doesn't have that much of a city so I had to lower the scale of the IP Adapter to 0.5, but with that and without controlnet I lose the composition position and pose of the cyborg.
Even without that I still think it looks good so here's the result:
To compare the results with another type of noise, I did gaussian:
Results:
So all this got me thinking, how about if I start feeding it other types of images as negatives, for example a blurred image:
So I got a really sharp image of the cyborg, with a scale of 0.5 it didn't have any of the original image left so I used a scale of 1.0.
This again got me thinking, if a blurred image makes it sharper, what about colors, so I tested it with passing the isolated color channels of the image as negatives:
That wasn't the result I was expecting but I like the blue and green ones, so as a final test, since I liked those two and I wanted a sharper image, I did a mix of those three:
With this I got the image I was looking for, still need some inpainting to fix details, but IMO it looks really good to be generated with just a single IP Adapter Image:
Without masks and controlnet, the use of the IP Adapter here is like a creative start (instead of using prompts you use an image to feed the model with your initial idea). If you want to do the same but preserving more of the initial image you'll need to use them.
Beta Was this translation helpful? Give feedback.
All reactions