Near zero loss / high capacity hypernetwork artstyle training #7011

mom333 · 2023-01-21T17:06:59Z

mom333
Jan 21, 2023

Training data:

Training preview:

I am gonna start writing it down now since I may start to forget things. Basically I think I finally made something very close to a perfect hypernetwork. By perfect I mean a hypernetwork that can make replicas of training data if you use the same prompt used during training for each picture. See: #2670 (comment) but in this case it is not 1 training picture but 72. While as I have shown in that 1 training picture example the network can be overpowering when you use the same prompt, changing prompt shows that it can generalize and retain style.

I am probably 20 commits behind and there is a lot of stuff I modified in the file during my trials and errors so I will just attach my hypernetwork.py file.

hypernetwork.zip

If you just copy the file over and it works you can skip the next section.

Most important changes that allow you to actually get to near zero loss

Line 82:
if type(layer) == torch.nn.Linear:

This stops the change to default initialization of norm layer. People have been saying that norm layers just slow down training. In reality norm layer was initialized improperly. Norm layers not only reduce overfitting, but they also significantly speed up training. With proper initialization (default initialization with weights set to 1) you can actually see the training loss per epoch start decreasing. And the outputs also start to resemble training data much faster. Also I never had a gradient explosion with norm layer and cosine annealing regardless of how long I trained the network.

Line 463:
optimizer = torch.optim.AdamW(params=weights, lr=3e-4, weight_decay = 0.05, amsgrad = True)

Amsgrad is helpful. I have seen it speed up the initial phase of learning where you go from blobs to actual shapes. It also controls the norms of your weights. WIthout it the norms grow much faster which can mess up the optimal learn rate. It unfortunately increases vram.

Line 743:

 with torch.random.fork_rng():
                        processed = processing.process_images(p)

Discussed here: #2670 (comment) . Wasted a lot of time because of this. If you don't want to fork rng not making previews is an option. I also think that previews with -1 seed should be fine. Making previews with a fixed seed without forked rng breaks hypernetwork training.

https://textual-inversion.github.io/

My understanding based on above (can be wrong, would love to be corrected) is that your hypernetwork modifies the noised picture at 4 steps during denoising to change the final denoised picture into your training data. If you don't reset the seed (correct method of training) then each step of your training is a random noised picture, that would be generated with the prompt you train with for that picture. And your hypernetwork tries to modify those noised picture so the denoising makes it look like training data. If you set the seed at each step then you give it the same exact noised picture to modify. Your loss graph will look great with this but your hypernetwork is unprepared for random noised pictures, so it will never create anything good.

Settings used in training to reach near zero loss

I had VAE on but moved to ram when training. 1st NAI model (Good to know it is ok to use vae and everything with it is fixed).
Pictures were deepdanbooru-ed with threshold of 0.7 (I am speculating 0.9 is a better idea)
Tag dropout 0.1 (may help learning or may not help learning. I am sure that setting it too high is a bad idea)
Latent sampling once - I am surprised to say I can't see much effect between all three but I am sure that original sampling works
512x512 (speculating that this is actually not good... and bigger could actually fix my final issue)
Gradient accumulation steps: 1
Initial learn rate 2e-4, weight decay 0 (modify lines 507 and 516 of hypernetwork file otherwise 1e-4 and 0.05 could also be good)
Network structures I tried and got similar results: gelu 1, 8, 8 ,2 ,1 and swish 1,6,3,1,1,1,6,6,1.
No dropout. Norm layers ON (it is really night and day difference once you get past initial blobs.) Kaiming uniform (I am starting to think think xavier uniform is better)
Probably should have mentioned that I used 24GB vram for this and I didn't use full precision and no half. Turns out training is ok with mixed precision.
Cosine annealing according to attached data:
First annealing period of roughly 1000 steps and then multiplied by 2 at each restart (change with line 515 in my hypernetwork file)
hypernetwork_loss.csv

Loss graph:

Purple is learn rate plotted on right axis. Blue is mean loss from last 3 epochs plotted on left. In the end the loss oscillates around 0.02

And here are data points.

Aaaaand I am still very unhappy because it is 90% there but it is still not what I want. I tried a lot of things to lower the loss further but it doesn't work. And the results on this loss level are far from perfect.

It is very easy to tell which one is original and which one is the copy. And look at that face:

Also the shading and coloring is atrocious. I want that final 2% of perfection.

A few remarks and thoughts

Even at 2% loss I think it is usable. Just lower the hypernet strength a bit to remove those shitty artifacts:

Great paper: https://arxiv.org/pdf/1706.05350.pdf
Opened my eyes on a lot of things. Page 7 is especially enlightening. Basically layer normalization removes the effect of weight norms on generalization performance (overcooking). The problem with those neat graphs there is that the general trend applies but optimal values are probably off. Graphs in paper were done for batch size 128. High batch size requires a much bigger weight decay than 1 I used.
I wish I knew what is the optimal structure of hypernetwork for this. Bigger is obviously always better with layer norm but I don't know what would be optimal. If you think about it the network I did generates the same picture for different seeds and even dropped out tags. So it is hard for me to imagine you could contain all that information with a large feed forward like 1, 8, 8 ,1. You need more layers. But how should they be structured and maybe some convolutional layers would be better?
I am pretty sure you need to use tags and deepdanbooru is great for this with NAI if NAI was trained on danbooru tags. Tags will be generating differently noised pictures that gives a chance for your hypernetwork to distinguish between easier. I just don't know if more tags is better or not. Maybe the reason I can't go lower with loss is because too many tags generate very random results.

BetaDoggo · 2023-01-23T16:37:13Z

BetaDoggo
Jan 23, 2023

This is all really interesting, though I think the goal of generating identical images to the training is a bit weird. Isn't the point of training models to produce new images with the style and concepts of the training data(without replicating it)? Even if it can generalize I'd be scared of accidentally plagiarizing something when using it.

Either way the stuff about the improper normalization layer implementation is neat and should probably be fixed if the current implementation is wrong.

Also, how many steps did it take to train your example network? I'm curious how much of a speedup the correct normalization layer gives.

1 reply

mom333 Jan 23, 2023
Author

Unless someone else can show an example of a hypernetwork that actually had a converging loss without norm layer (with more than 1 picture cause it is easy to make a 1 picture converging network) and it was producing actual pictures the speedup is: the loss actually converges and goes towards 0 instead of it never converging. The graphs show that it took 80k steps to hit the limit. And some best examples of intermediate steps:

34416

36000

40752

And some not too good example from same period:

32544

Of course I have no idea how close to optimal it is and I think the network structure is suboptimal.

As for identical pictures I refer to: #2670 (comment) . You will not be able to use exact same prompts as the ones you use during training because this will generate the same picture regardless of seed and other settings. But if you change your prompt to vaguely reference what you trained with, it seems to retain features - it is capable of generalizing. Hard for me to imagine someone would use the same prompt without deepdanbooru.

denliner · 2023-01-23T22:49:05Z

denliner
Jan 23, 2023

Interesting but... thoses are hypernetwork are probably way to much overfitted to be useful.

0 replies

mom333 · 2024-02-28T17:06:20Z

mom333
Feb 28, 2024
Author

If anyone is still interested in this ancient technology:

https://huggingface.co/lmganon123/pochi_hypernet

Here is the best I could do with it. It works but I am not sure if it is better than a LORA.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Near zero loss / high capacity hypernetwork artstyle training #7011

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Near zero loss / high capacity hypernetwork artstyle training #7011

mom333 Jan 21, 2023

Most important changes that allow you to actually get to near zero loss

Settings used in training to reach near zero loss

A few remarks and thoughts

Replies: 3 comments · 1 reply

BetaDoggo Jan 23, 2023

mom333 Jan 23, 2023 Author

denliner Jan 23, 2023

mom333 Feb 28, 2024 Author

mom333
Jan 21, 2023

Replies: 3 comments 1 reply

BetaDoggo
Jan 23, 2023

mom333 Jan 23, 2023
Author

denliner
Jan 23, 2023

mom333
Feb 28, 2024
Author