Sample latents from VAE to generate close images #9327

Clement-Lelievre · 2024-08-30T14:19:13Z

Clement-Lelievre
Aug 30, 2024

hi,

I would like to use a VAE (SDXL's VAE in the example below) in order to get close variations of a given image, in the spirit of what was demonstrated in that paper.
I could achieve this with inpainting or maybe img2img, but my point is to get these close images without having to go through reverse diffusion, for performance reasons.

I tried to implement the logic, and found out that diffusers already has nearly all the tools.

PROBLEM: It does run without errors and I do get images that are different pixel-wise to the initial image, however the difference across images is invisible to the human eye. I noticed that the std of the latent distribution is extremely low relative to the mean, that might explain why all the produced images look so identical to the input? I might have done a preproc step wrong also?

I'd be grateful if someone could review my logic and advise.
Here is my code:

import torch
from diffusers import AutoencoderKL
from diffusers.image_processor import VaeImageProcessor
from diffusers.models.autoencoders.vae import DiagonalGaussianDistribution
from PIL import Image
from torchvision import transforms
from torchvision.transforms.functional import pil_to_tensor

# 2: assume I'm interested in getting close variations of an image, using SDXL's VAE
# instead of inferring, eg doing an img2img or inpaint,
# I can use the latent proba distribution inferred by the VAE, sample from it and decode back to pixel space

# instantiate SDXL's VAE
with torch.no_grad():
    # vae:AutoencoderKL = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix")
    vae: AutoencoderKL = AutoencoderKL.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0", subfolder="vae"
    )

    vae.to(dtype=torch.float32)  # otherwise it produces NaNs, even madebyollin's VAE
    vae.to(device="cuda")

    assert vae.device == torch.device("cuda:0")
    assert vae.dtype == torch.float32

    # make image as tensor
    img = Image.open(<image_path_on_disk>)
    img_tensor = pil_to_tensor(img).unsqueeze(0) / 255.0
    img_tensor = img_tensor.to(vae.device)
    img_tensor = img_tensor.to(vae.dtype)

    # get the inferred latent distribution
    latent_dist: DiagonalGaussianDistribution = vae.encode(img_tensor, return_dict=False)[0]
    print(
        f"{latent_dist.mean.shape=} {latent_dist.std.shape=} {latent_dist.mean.mean()=} {latent_dist.std.mean()=}"
    )
    assert not latent_dist.mean.isnan().any()
    assert not latent_dist.std.isnan().any()
    assert latent_dist.deterministic is False

    # generate another latents from it
    sample_1 = (
        latent_dist.sample()
    )  # take the mean of the latent dist and add a fraction of the std
    sample_2 = latent_dist.sample()
    assert not sample_1.isnan().any()
    assert not sample_2.isnan().any()
    assert (sample_1 != sample_2).any(), "samples should be different"

    print(f"{sample_1.shape=}")

    # sample_1 *= (2 ** (len(vae.config.block_out_channels) - 1))

    assert vae.dtype == sample_1.dtype
    assert vae.device == sample_1.device

    img_1: torch.Tensor = vae.decode(sample_1).sample  # , return_dict=False)[0]
    ##vae_scale_factor = 2 ** (len(vae.config.block_out_channels) - 1)
    img_1 = img_1.squeeze(0).cpu().detach()
    assert (
        img_1 != img_tensor.cpu().detach()
    ).any(), "generated image should be different from the input image"
    img_1_pil = transforms.ToPILImage()(img_1)
    # vae_img_processor = VaeImageProcessor(vae_scale_factor=2 ** (len(vae.config.block_out_channels) - 1))
    # img_1_pil = vae_img_processor.postprocess(img_1)[0]
    img_1_pil.save("sample_1.png")

Thanks a lot!

cc @asomoza

bmaxdk · 2024-08-31T21:17:58Z

bmaxdk
Aug 31, 2024

I tested with your code with updating noise and scale factor with some modification.

import torch
from diffusers import AutoencoderKL
from diffusers.image_processor import VaeImageProcessor
from diffusers.models.autoencoders.vae import DiagonalGaussianDistribution
from PIL import Image
from torchvision import transforms
from torchvision.transforms.functional import pil_to_tensor

# 2: assume I'm interested in getting close variations of an image, using SDXL's VAE
# instead of inferring, eg doing an img2img or inpaint,
# I can use the latent proba distribution inferred by the VAE, sample from it and decode back to pixel space

# Instantiate SDXL's VAE
with torch.no_grad():
    # vae:AutoencoderKL = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix")
    vae: AutoencoderKL = AutoencoderKL.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0", subfolder="vae"
    )

    vae.to(dtype=torch.float32) # otherwise it produces NaNs, even madebyollin's VAE
    vae.to(device="cuda")

    assert vae.device == torch.device("cuda:0")
    assert vae.dtype == torch.float32

    # make image as tensor
    img = Image.open("avenger.jpg")  # Replace with your actual image path
    img_tensor = pil_to_tensor(img).unsqueeze(0) / 255.0
    img_tensor = img_tensor.to(vae.device)
    img_tensor = img_tensor.to(vae.dtype)

    # get the inferred latent distribution
    latent_dist: DiagonalGaussianDistribution = vae.encode(img_tensor, return_dict=False)[0]
    print(f"{latent_dist.mean.shape=} {latent_dist.std.shape=} {latent_dist.mean.mean()=} {latent_dist.std.mean()=}")

    assert not latent_dist.mean.isnan().any()
    assert not latent_dist.std.isnan().any()
    assert latent_dist.deterministic is False

    # -- Tried with scale factor and add noise--
    scale_factor = 5.0  # increased the variations
    noise_strength = 0.2  # add noise will help further perturb the latent space
    noise = noise_strength * torch.randn_like(latent_dist.mean).to(vae.device)

    # generate new latents with added noise and scaling
    sample_1 = latent_dist.mean + scale_factor * latent_dist.std * torch.randn_like(latent_dist.mean).to(vae.device) + noise
    sample_2 = latent_dist.mean + scale_factor * latent_dist.std * torch.randn_like(latent_dist.mean).to(vae.device) + noise
    assert not sample_1.isnan().any()
    assert not sample_2.isnan().any()
    assert (sample_1 != sample_2).any(), "samples should be different"

    print(f"{sample_1.shape=}")

    assert vae.dtype == sample_1.dtype
    assert vae.device == sample_1.device

    # decode the sampled latents back to images
    img_1: torch.Tensor = vae.decode(sample_1).sample  # Decoding the first variation
    img_1 = img_1.squeeze(0).cpu().detach()
    assert (img_1 != img_tensor.cpu().detach()).any(), "generated image should be different from the input image"
    
    # save first variation
    img_1_pil = transforms.ToPILImage()(img_1)
    img_1_pil.save("sample2_1.png")

    
    # save second variation 
    img_2: torch.Tensor = vae.decode(sample_2).sample
    img_2 = img_2.squeeze(0).cpu().detach()
    img_2_pil = transforms.ToPILImage()(img_2)
    img_2_pil.save("sample2_2.png")
    
    # -- Try interpolation between Latent vectors --
    t = torch.rand(1).item()  # generates a random interpolation factor between 0 - 1
    interpolated_sample = (1 - t) * sample_1 + t * sample_2
    
    # Decode the interpolated sample --> image
    img_interpolated = vae.decode(interpolated_sample).sample
    img_interpolated = img_interpolated.squeeze(0).cpu().detach()
    img_interpolated = (img_interpolated * 0.5 + 0.5).clamp(0, 1)
    img_interpolated_pil = transforms.ToPILImage()(img_interpolated)
    img_interpolated_pil.save("interpolated_variation2.png")
    print("Done")

It sure does working well. It does change eyes, smiles, etc. You may want to see significant changes. However, you may need to consider the dimension of the latent space. Also VAE are good at generating smooth variation. it may not be as effective at huge change unless it trained for that.

1 reply

Clement-Lelievre Sep 1, 2024
Author

hi @bmaxdk

thanks for your reply.

Could you please share an input image or several, that worked on your side, the intermediate images (sample2_1.png and sample2_2.png) and the variations (interpolated_variation2.png) you get?
I'm interested in seeing the size and verify that it's RGB.

I tried with various sizes, eg 1024x1024 and then with relatively small images like 32x32 in order for the latent space to be much smaller and to have more impact on it.

Two issues on my side when running your snippet:

the two samples are very close to one another, likely because the latent distributions std are very small (in the order of 10^-5) and dominate the random tensors; so the interpolation does not really play a role IMO, at least in the examples I ran. When increasing the scale_factor to match this magnitude, the decoded samples are indeed different but too degraded
the final image (interpolated_variation2.png) looks a washed out version of the input image on my end (this is eliminated when avoiding the img_interpolated * 0.5 + 0.5 part )

Notes:

I tried with SD1.5's VAE and it seemed like the latent dimensionality is the same as SDXL's VAE
I tried interpolating between an image of a non-smiling portrait and its smiling equivalent, and decoding the averaged latents produced a half smiling portrait. But I did pass the two images as inputs, these are not the two latents from your code. I'd have to confirm with more images but seems like interpolation worked. This thus requires two input images and they need to be close and the difference needs to be "continuous" I think, not binary (eg smiling/not smiling intead of glasses/no glasses).

However I'm still interested in getting to see your version of my original snippet work, as this would not require two input images+the above conditions, but just one.

Clement-Lelievre · 2024-09-01T11:14:21Z

Clement-Lelievre
Sep 1, 2024
Author

Here is an exmaple of what I get when interpolating two images (smile vs non smile portraits)

INPUTS

OUTPUT

This is very cool as it doesn't require reverse diffusion so it's pretty fast, but again my initial idea is to start from a single input image and generate variations of it

10 replies

TimothyAlexisVass Sep 6, 2024

in the paper I reference above, as well as in my deep learning textbooks, they show that the VAE approximates the probability distributions that gave rise to the specific latents decoded.

Ok. I have no idea what you're talking about.
In my understanding, the latents tensor is just the image in a different, compressed format.

So I think you would need to show what you mean for me to understand.

Here is a simple example to show what I mean:

test = torch.flip(latents, dims=[0])

display(decode_latents(test)[0])

test = torch.flip(latents, dims=[1])

display(decode_latents(test)[0])

test = torch.flip(latents, dims=[2])

display(decode_latents(test)[0])

test = torch.flip(latents, dims=[3])

display(decode_latents(test)[0])

Clement-Lelievre Sep 6, 2024
Author

sure, happy to show you, just let me know the best way to do this

TimothyAlexisVass Sep 6, 2024

You could write a blog post on Huggingface, that way a lot of people would learn something interesting. :)

Otherwise you could just post some images and examples here I guess.

Clement-Lelievre Sep 11, 2024
Author

I don’t really have examples or images to post, this is an idea and so far I didn’t manage to make it work.
The idea like I said above is: because the VAE encoding outputs normal distributions, let’s sample from these to get various latents of the same image, and then decode all of these latents in the hope that they will produce (slightly) different images.
It’d be an efficient way to get variations of an image. This has proven successful before, like in the paper I referenced.

but so far in my xp, the means are so overwhelming relative to the stds, that no matter how much I perturb the samples, all the decoded latents look the same to the human eye.

TimothyAlexisVass Sep 16, 2024

What I'm thinking is that... even if you manage to perturb the VAE latent space enough to produce visible variations in the output image, I am supposing that these variations are highly unlikely to be anything meaningful and will mostly manifest as introduced noise. I think that any variation you introduce by perturbing the latent vector is highly likely to affect random aspects of the reconstruction.

I think that:
VAE Latent Space is geared towards reconstruction and not structured for semantics. Like I wrote before, I don't think that the VAE has any idea of the concepts of the image.

The way I understand it is that VAEs are primarily designed to compress/reconstruct images between high-dimensional images/lower-dimensional latent space. I don't think either space has any inherent structure or meaning associated with specific semantic features of the image. While diffusion models can map particular latent dimensions to semantically meaningful variations, I don't think that standard VAEs do this. There might be something like that, but I don't think the SDXL is one of those.

The goal of a VAE is to reconstruct the input image with minimal error, not to generate diverse variations. I assume this leads to a tightly-packed latent space where different images are encoded very close to each other. In such a space, while the variations in the latent space may generally be smooth, again I don't think that they are tied to specific semantic features (such as changing a person's expression or altering the lighting in a scene).

BUT! It would be amazing if you would prove your idea, because it would indeed be an extremely cheap way to get a relatively vast amount of variation. Cheap enough that you could do this in real-time on a generated image and possibly solve things like wacky teeth and hands and all sorts of things... and that would indeed be a breakthrough.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample latents from VAE to generate close images #9327

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 11 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Sample latents from VAE to generate close images #9327

Clement-Lelievre Aug 30, 2024

Replies: 2 comments · 11 replies

bmaxdk Aug 31, 2024

Clement-Lelievre Sep 1, 2024 Author

Clement-Lelievre Sep 1, 2024 Author

TimothyAlexisVass Sep 6, 2024

Clement-Lelievre Sep 6, 2024 Author

TimothyAlexisVass Sep 6, 2024

Clement-Lelievre Sep 11, 2024 Author

TimothyAlexisVass Sep 16, 2024

Clement-Lelievre
Aug 30, 2024

Replies: 2 comments 11 replies

bmaxdk
Aug 31, 2024

Clement-Lelievre Sep 1, 2024
Author

Clement-Lelievre
Sep 1, 2024
Author

Clement-Lelievre Sep 6, 2024
Author

Clement-Lelievre Sep 11, 2024
Author