Unet conditioning per block #9368

CodeExplode · 2024-09-05T02:31:39Z

CodeExplode
Sep 5, 2024

Recently I've been running into issues where my finetunes of Stable Diffusion mostly learn a large number of concepts well, but then don't improve with significantly more training, despite that the concepts can be trained to achieve high likeness in isolation. After thinking about it, I believe it's due to the fact that transformer blocks can only be finetuned for so many concepts, considering the various mappings and workarounds they have to do for where the conditioning doesn't accurately match the visual concepts. They're excellent for broad spectrum interpretation of embedding features, but when you want highly accurate likeness on many concepts (a few thousand) then transformers just don't seem capable of doing that due to the practical limits of them each pushing against each other in the bottlenecked spaces.

It seems the best way around this is to focus as much of the finetuning as possible on the conditioning, i.e. the input embeddings, which can be trained in isolation without needing to update any of the other weights. I've been doing this to a large extent for the last two years with pretraining concepts as embeddings usual textual inversion, and inserting them into the CLIP model prior to full finetuning, though seem to have neared the limits of what that can do.

There was a paper a year back which suggested that training embeddings for per layer of the unet was much more capable: https://arxiv.org/abs/2303.09522

Which would make sense, as the embeddings currently have to encode multiple pieces of information which fit well with each QKV projection in each layer of the unet, whereas being fit for just one layer would greatly reduce the confusion and amount of information which needs to be packed into each embedding.

I've been thinking about the cleanest way to do this, and it seems that just extending the CLIP input embeddings layer to hold n copies of the input embeddings, one for each cross attention block, would fit cleanly with current model savers and loaders, where the input embeddings layer would simply be larger. When encoding a prompt, n versions would be encoded, using offsets of the vocab length * layer_index to get the input embeddings for tokens.

The issue then arises of how to condition the unet with these multiple versions of the hidden states. It seems like it wouldn't be an incredibly difficult change, simply requiring the encoder_hidden_states passed to each down/mid/up block to first be assigned to a temp variable, where it is the singular one provided if only one is provided, or is the block index if a list is provided, e.g. at:

https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unets/unet_2d_condition.py#L1219

It would be changed to something like:

layer_hidden_states = encoder_hidden_states if not isinstance(encoder_hidden_states , list) else encoder_hidden_states[block_index]

sample, res_samples = downsample_block(
    encoder_hidden_states=layer_hidden_states ,
)

block_index++

In Stable Diffusion 1.5 I believe there are 7 cross-attention blocks (3 in, 1 middle, 3 out), so that would mean 7 variations of each input embedding.

I'm not sure how to correctly contribute to a project like Diffusers, or if this would be desired as an addition, or if there's a better way to do this. But it does seem like the missing piece to giving these models much more capability for no real extra effort, where a few extra tiny parameters in input embeddings can make all the difference. It seems to me that transformer blocks and embeddings are intertwined, neither has much meaning without the other half to match it, but embeddings offer the only chance to finetune concepts without impacting other concepts, and the only chance for likeness on more than a small number of concepts, since they can be trained without stepping on each other's toes.

A similar concept would be for different embeddings at different timesteps, perhaps blending between an embedding at timestep 0 and an embedding at timestep 1000, which would be easier to do in custom code without needing to change the unet implementation (simply encode a high and low version of the prompt and blend them for the current timestep). I intend to investigate this soon, however I think the real advantage would be in unique conditioning per layer, removing the need for so much compression in embeddings, a big gain in flexibility with very little increase in parameter size.

asomoza · 2024-09-05T08:52:58Z

asomoza
Sep 5, 2024
Maintainer

Hi, your project sounds interesting, I'm specially curious about what happens with training for different timesteps, I like that kind of thinking but I don't have that much experience with full finetunes in Stable Diffusion.

Probably we need some PoC of the trainings before, but this can be added to the research projects if it produces good results.

cc: @sayakpaul @linoytsaban

0 replies

sayakpaul · 2024-09-05T12:27:50Z

sayakpaul
Sep 5, 2024
Maintainer

Interesting idea. I don't have anything concrete to add yet off the top of my head. But I will update here if I run into something interesting.

And yes, having this contributed to diffusers/examples/research_projects is a good idea.

0 replies

CodeExplode · 2024-09-07T03:44:21Z

CodeExplode
Sep 7, 2024
Author

This led me down a road to another idea, where I removed the CLIP model from the process entirely and just directly trained the conditioning vectors to the unet, using each comma separated tag in a prompt as its own vector. It's worked surprisingly well in a very naive test and unbalanced dataset, far better than textual inversion, and if using the trained vector per layer idea it would presumably be even better. It also seems to do multiple characters significantly better without bleeding them, which may have been primarily a problem in CLIP, not the cross-attention.

Taking it even further, you could just directly train a Key and Value per cross attention module, with no need for making a vector which works well with the projection, essentially a tiny dedicated lora per concept. It would also allow excluding concepts from published models if desired, merging concepts by directly merging their conditioning vectors (I haven't tried that yet, but presumably it would work), and using key/value vectors from different concepts (e.g. key for Elephant, value for Glass, perhaps only in some layers, and you could easily prompt for a glass elephant, no confusion from CLIP's attempt to interpret required). 5000 conditioning vectors is only 20mb.

This is my very quick and naive implementation to replace CLIP: https://github.com/CodeExplode/ConditioningExperiments/blob/main/conditioner.py

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unet conditioning per block #9368

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Unet conditioning per block #9368

CodeExplode Sep 5, 2024

Replies: 3 comments

asomoza Sep 5, 2024 Maintainer

sayakpaul Sep 5, 2024 Maintainer

CodeExplode Sep 7, 2024 Author

CodeExplode
Sep 5, 2024

asomoza
Sep 5, 2024
Maintainer

sayakpaul
Sep 5, 2024
Maintainer

CodeExplode
Sep 7, 2024
Author