flux lora training TIPs discussion #1497

forestsource · 2024-08-22T16:26:40Z

forestsource
Aug 22, 2024

I read this article
There are incredible TIPS listed in it.

We’re finding and hearing that captionless training is better than long narrative style captions (or Danbooru captioning)! Try your next training session with no captions!
Training with a resolution of 512 seems to produce excellent results – much better and faster than 1024x!

Apparently flux lora is very different from SDXL lora training.
I would like to discuss if what civitai is saying is true based on flux architecture and sd-script training code.

Since there is no description of what type of lora it is, i assume for the moment that it is a style lora.
I believe the tip about not adding captions when training stems from the fact that sd-script doesn’t adjust the TextEncoder. However, this might change if training the TextEncoder is supported in the future.

The tip about image size doesn't make sense to me. Images at 1024x should have more details compared to those at 512x.

b-7777777 · 2024-08-22T21:19:19Z

b-7777777
Aug 22, 2024

the tip about captions makes no sense to me, to be honest. there are so many use cases even in style loras where captions are needed to differentiate concepts, and captioned style loras will always have stronger foundations towards the base style because it'll better associate the correct words with the correct aspects of the style (or so is my experience/broscience thoughts, both with SDXL and Flux - so far). it is hard for me to believe that no caption loras having superior output are not the direct result of poor captioning in the test group

personally, I have had success so far with style training using the wildcard ARG and a mix of danbooru tags + LLM generated long prompts, though I am not sure going to such lengths is necessary - it is still my current preference. here are my latest tests from lora training on 12GB vram in the style of the artist Ibuki Satsuki:

30 epochs of lion cosine, 8 dim/8 alpha, 3e-4 LR test

training ARGs used: (click to expand)

flux_train_network.py --pretrained_model_name_or_path C:/Users/B/Downloads/ComfyUI/models/unet/flux1-dev.safetensors --clip_l C:/Users/B/Downloads/ComfyUI/models/clip/clip_l.safetensors --t5xxl C:/Users/B/Downloads/ComfyUI/models/clip/t5xxl_fp16.safetensors --ae C:/Users/B/Downloads/ComfyUI/models/vae/VAE.safetensors --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 777 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 --network_module networks.lora_flux --network_dim 8 --network_alpha 8 --network_args train_blocks=single rank_dropout=0.1 module_dropout=0.1 --optimizer_type lion --lr_scheduler cosine --learning_rate 3e-4 --network_train_unet_only --output_dir "C:/Users/B/Downloads/ibuki satsuki lora" --output_name "flux-ibukisatsuki-lion" --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 --loss_type l2 --train_data_dir "C:\Users\B\Downloads\ibuki satsuki lora-upscale_resized" --resolution 512,512 --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --fp8_base --enable_bucket --bucket_no_upscale --split_mode --prior_loss_weight 1.0 --caption_extension .txt --keep_tokens 0 --network_dropout 0.1 --max_grad_norm 1.0 --min_snr_gamma 10 --save_every_n_epochs 1 --multires_noise_iterations 6 --multires_noise_discount 0.3 --optimizer_args weight_decay=0.4 betas="[0.95, 0.98]" --max_train_epochs 30 --enable_wildcard

captioned with both local joycaption and booru tags, using the wildcard arg to switch between. using dev-q4_0 & t5-v1_1-xxl-encoder-q5_k_m for genning (quant'd model/T5 to allow me not to OOM)

genned at 512x512: (click to expand)

genned at 1024x1024: (click to expand)

random seeds
euler, simple
20 steps
prompt:
a man with long white hair and Chinese style clothing, 1boy, long hair, white hair

based on my results, I am now running a test of training at 1024x1024 with higher LR to attempt to lower the amount of epochs necessary (since it is a substantial amount of time on my 12GB). I theorize the wonky hands in the sample gens are from bucketing my images down to 512x512 in training, but could also be from overcooking - I hope to find out in future test runs

in my opinion, my ibuki satsuki style lora is a great example of where a style lora benefits from captioning. if you are familiar with the art style of ibuki satsuki (or as you can probably surmise from my sample gens), the males are very androgynous. I do not imagine the training would not confuse or combine males/females' subtle feature differences if they were not properly captioned. however, if people are finding no captioned training works for them, then power to them. it's less work and if the results are good, that means more time to create more loras

here is an example image of ibuki satsuki's I used in my dataset: https://files.catbox.moe/8km0lp.jpg
and here is the captioned .txt file I used with it, to showcase what my captioning is like: https://files.catbox.moe/28qr0q.txt

after my 1024x1024 higher LR test is done, I will repeat my previous training with no captions to do a direct comparison, for the sake of covering all my bases. unfortunately, having low vram, the amount of time it takes for me to experiment with training settings is a bit of a hurdle, so I will probably not get too wild beyond this

7 replies

recris Sep 25, 2024

I find myself a bit puzzled when I see people having to train for 3k steps or more to achieve the desired goal. I usually only need about 500 steps to get near perfect results.

I get this speed using photos of a real person instead of art, so I am not sure how these tips transfer to a slightly different task:

Use loss masks on the subject - this a has a massive impact on the quality and speed of convergence
Use huber loss - I found that this mostly helps with improving the consistency of output images (less "weird" and uncanny valley stuff)
- Note: huber loss does not currently work in the sd3/flux branch, I am using local patches
Learning rate between 1e-4 and 2e-4
Network dim = 16, alpha = 8
Optimizer AdamWScheduleFree
Multi resolution training with 512, 768, 1024 resolutions, adjust batch size on each according to your GPU
Use a good quality dataset - this is the most important item on this list (together with loss masking). It is better to have a small but high quality set of images than a larger one with some crap in the mix
- Quality dataset is not only about image resolution, variety of concepts is key
- As an example, I was testing a dataset with 50 images, 46 had a smile and 4 had a serious face. After training I was finding it very easy to get a consistent good smile but very hard to generate a serious face. After I changed dataset to make it more balanced the problem went away. Balance of concepts is crucial when training on a character, this a applies to poses, outfits, camera framing, etc.
- With an unbalanced dataset I was able to eventually "fix" the problem by needed a much longer training run, in the thousands of steps. I think the model needs to "see" a concept a minimum number of times before it sticks, and I am finding Flux to be more sensitive to these types of imbalance when compared to SDXL.
Use proper captions. I follow a caption style composed of short sentence fragments separated by commas:
- Example: photo of a man/woman, smile, wearing a yellow shirt and jeans, next to a tree in a park

These tips work well for a dataset of about 40-60 images, for larger datasets I recommend lowering LR, and train for more steps.

b-7777777 Sep 26, 2024

Flux has a very strong understanding of photography and realistic portraits of people, I think it is much easier to train those for this reason over styles and anime which it doesn't have very good knowledge of from what I have seen. I would be very happy to be wrong, though, and will try your settings (minus the loss mask) on a style and see how my results fair

recris Oct 14, 2024

Something I found out recently: when using "trigger" words, the model seems to learn slightly better if the trigger is embedded into the caption in a more natural manner, rather than just leaving it as the first token in the text.

For example, instead TRIGGERNAME, photo of a person, ... use photo of a person named TRIGGERNAME, ...

My testing on this is very limited, but I think this makes the model understand the association with the subject class better, and we get better generalization from that.

Nabby109 Oct 14, 2024

Something I found out recently: when using "trigger" words, the model seems to learn slightly better if the trigger is embedded into the caption in a more natural manner, rather than just leaving it as the first token in the text.

This is roughly how I've been doing it for a while now. I go through an auto caption with Florence2 Large, at a detail of 1 or 2 (3 is way too long) then go back through and make sure all of them got the same class name (I don't want half my data set to say woman and the other half to say girl for instance) then I use notepad++ to replace all occurrences of the class name with TRIGGER class name. Then I go back through and manually edit down any with more than 75 tokens.

I figure Flux is good with adjectives. It knows when I ask for a blue car what a car is and how to make it blue. So if I want to make a Lora of a man named Frank, I could say "a man named Fr4nk" but now the word "named" is mixed up in some weird love triangle with "man" and my trigger. But if I caption it as a "Fr4nk man" now I've told it that there's a whole race of these Fr4nk men out there and all of them look like Frank.

kunibald413 Nov 12, 2024

@recris do you mind sharing your flux lora train config? with learning rate do you mean unet_lr for flux?

slashedstar · 2024-08-29T11:57:42Z

slashedstar
Aug 29, 2024

captionless training

People really are only training and generating "1girl standing" aren't they? The only scenario this works is when the only thing you care about is replicating your training data, fortunately the model is robust enough that even when you do that it ends up still being able to juggle stuff around and generate diverse images, but that's despite the error not because of it, either way if you want to generate anything that isn't a bad copy of your training data you need captions to disentangle the contents of the image, captions are what enable you to train stuff on one thing and generate on another.

0 replies

recris · 2024-08-30T21:29:29Z

recris
Aug 30, 2024

Training without detailed captions produces Lora with a lot less flexibility. Sure, it may generate very accurate looking images, as long as you don't deviate too much from the training data.

To observe the differences try the "clown test":

Train a lora on a specific person, using images without makeup
Prompt using the lora and a caption like "photo of SUBJECT with clown makeup"

I have experimented captioning with a unique token, and unique token + detailed captions. With just the unique token usually the makeup appears more faded and a lot of times looks messy; with detailed captions it usually comes out looking good, as if no lora was used.

Captioning helps the model to learn the fundamentals of the concept better.

0 replies

recris · 2024-09-17T18:43:55Z

recris
Sep 17, 2024

Is it just me, or Flux learns better with higher batch sizes?

For example, in some tests I am seeing better results at 500 steps + batch size 6, than at 1000 steps + batch size 3.

In older SD models increasing the batch size past 2 or 3 didn't seem to yield improved results.

5 replies

forestsource Sep 19, 2024
Author

Since flux uses guidance distillation, I believe that having fewer fine-tuning steps will help to better preserve the original model's quality.

recris Sep 19, 2024

I've tested the new schedule-free optimizer option the difference seems be be (mostly) gone. I suspect that plain AdamW has a harder time learning things correctly.

mandal-rahul Nov 10, 2024

@recris was AdamW really the problem? Did you make any progress?

recris Nov 10, 2024

Yes, I recommend using the AdamWScheduleFree option for training Flux, and also to use multi-resolution training

Sarania Nov 10, 2024

In my tests, on simpler datasets AdamWScheduleFree is /fire/. It converges lightning fast and produces excellent results. On my more complex datasets, so far I've got the best results with... Adafactor? What? Yes, really. AdamW8bit has failed to produce any great results, regardless of what type of scheduling I've tried. I got an /acceptable/ output after 36 epochs at 1e-4 with three restarts of cosine and AdamW8bit but it definitely seems weak for Flux. I use AdamWScheduleFree on simpler datasets up to dimension 24, more complex datasets or larger LORA I use Adafactor with Kohya's recommended parameters: "--lr_scheduler constant_with_warmup --optimizer_args relative_step=False scale_parameter=False warmup_init=False --max_grad_norm 0.0" AdamWScheduleFree produces results that are almost as good even on the larger, more complex datasets, but the VRAM hit is heavy on my 16GB 4070 TS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flux lora training TIPs discussion #1497

{{title}}

Replies: 4 comments 12 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

flux lora training TIPs discussion #1497

Replies: 4 comments · 12 replies

forestsource Sep 19, 2024 Author

Replies: 4 comments 12 replies

forestsource Sep 19, 2024
Author