flux lora training TIPs discussion #1497
Replies: 4 comments 12 replies
-
the tip about captions makes no sense to me, to be honest. there are so many use cases even in style loras where captions are needed to differentiate concepts, and captioned style loras will always have stronger foundations towards the base style because it'll better associate the correct words with the correct aspects of the style (or so is my experience/broscience thoughts, both with SDXL and Flux - so far). it is hard for me to believe that no caption loras having superior output are not the direct result of poor captioning in the test group personally, I have had success so far with style training using the wildcard ARG and a mix of danbooru tags + LLM generated long prompts, though I am not sure going to such lengths is necessary - it is still my current preference. here are my latest tests from lora training on 12GB vram in the style of the artist Ibuki Satsuki: 30 epochs of lion cosine, 8 dim/8 alpha, 3e-4 LR test training ARGs used: (click to expand)
captioned with both local joycaption and booru tags, using the wildcard arg to switch between. using dev-q4_0 & t5-v1_1-xxl-encoder-q5_k_m for genning (quant'd model/T5 to allow me not to OOM)
based on my results, I am now running a test of training at 1024x1024 with higher LR to attempt to lower the amount of epochs necessary (since it is a substantial amount of time on my 12GB). I theorize the wonky hands in the sample gens are from bucketing my images down to 512x512 in training, but could also be from overcooking - I hope to find out in future test runs in my opinion, my ibuki satsuki style lora is a great example of where a style lora benefits from captioning. if you are familiar with the art style of ibuki satsuki (or as you can probably surmise from my sample gens), the males are very androgynous. I do not imagine the training would not confuse or combine males/females' subtle feature differences if they were not properly captioned. however, if people are finding no captioned training works for them, then power to them. it's less work and if the results are good, that means more time to create more loras here is an example image of ibuki satsuki's I used in my dataset: https://files.catbox.moe/8km0lp.jpg after my 1024x1024 higher LR test is done, I will repeat my previous training with no captions to do a direct comparison, for the sake of covering all my bases. unfortunately, having low vram, the amount of time it takes for me to experiment with training settings is a bit of a hurdle, so I will probably not get too wild beyond this |
Beta Was this translation helpful? Give feedback.
-
People really are only training and generating "1girl standing" aren't they? The only scenario this works is when the only thing you care about is replicating your training data, fortunately the model is robust enough that even when you do that it ends up still being able to juggle stuff around and generate diverse images, but that's despite the error not because of it, either way if you want to generate anything that isn't a bad copy of your training data you need captions to disentangle the contents of the image, captions are what enable you to train stuff on one thing and generate on another. |
Beta Was this translation helpful? Give feedback.
-
Training without detailed captions produces Lora with a lot less flexibility. Sure, it may generate very accurate looking images, as long as you don't deviate too much from the training data. To observe the differences try the "clown test":
I have experimented captioning with a unique token, and unique token + detailed captions. With just the unique token usually the makeup appears more faded and a lot of times looks messy; with detailed captions it usually comes out looking good, as if no lora was used. Captioning helps the model to learn the fundamentals of the concept better. |
Beta Was this translation helpful? Give feedback.
-
Is it just me, or Flux learns better with higher batch sizes? For example, in some tests I am seeing better results at 500 steps + batch size 6, than at 1000 steps + batch size 3. In older SD models increasing the batch size past 2 or 3 didn't seem to yield improved results. |
Beta Was this translation helpful? Give feedback.
-
I read this article
There are incredible TIPS listed in it.
Apparently flux lora is very different from SDXL lora training.
I would like to discuss if what civitai is saying is true based on flux architecture and sd-script training code.
Since there is no description of what type of lora it is, i assume for the moment that it is a style lora.
I believe the tip about not adding captions when training stems from the fact that sd-script doesn’t adjust the TextEncoder. However, this might change if training the TextEncoder is supported in the future.
The tip about image size doesn't make sense to me. Images at 1024x should have more details compared to those at 512x.
Beta Was this translation helpful? Give feedback.
All reactions