by. Haoming
Last Update: 2024/10/22
corresponding Webui version: v24.1.7
Just like with generation, there is also a popular Webui for training: Kohya_SS. Simply go to the repository and follow the installation steps.
(The UI is maintained by bmaltais, and internally calls the sd-scripts written by kohya-ss)
Tip
There are also other training UIs, such as OneTrainer, AI Toolkit, and Flux Gym. Though I personally do not have much experience with them.
Now comes the most important part, preparing the dataset. If you didn’t prepare the dataset properly, the trained model will not produce good results, just as the good ol’ saying goes:
"Garbage in, garbage out."
The dataset refers to 2 things: the images and the captions for said images.
You can train a decent LoRA with just a dozen images. The more important part is the variety of the images, such as different backgrounds, lighting, and poses. i.e., quality over quantity. Personally, I recommend using only “official” images, such as card carts, in-game screenshots, or promo posters.
The resolution of the images should be around 512 x 512
for SD1 models; and at least 1024 x 1024
for SDXL and Flux models. The aspect ratio of each image does not matter. Simply enable the Enable buckets
option, and the tool will handle them for you. No longer need to crop the images into perfect squares.
Tip
See Architecture if you don't know what the versions mean
Once you finish preparing the images, proceed to the next step:
You can train a decent LoRA without any captions at all; however, having captions can drastically improve the flexibility of the model. You can manually write the descriptions of the images; or use tagging tools to automatically generate captions for an entire folder of images, such as the WD series for anime checkpoints, or Florence-2 for realistic checkpoints. Caption file should be placed next to its corresponding image, with the same filename in .txt
extension.
You’re not done yet!
Next, you need to go through every single caption file and manually prune the captions. You should remove every tag that describes your subject, keeping only the tags that are the “variables.” Think of it this way: the LoRA should learn to associate the features of your subject with its "trigger words."
Trigger Words are keywords that you add into the captions, so that the LoRA will learn to recreate the concept when the keywords are present in the prompt. Usually, the trigger word would be the name of the character or the style.
Tip
When training a LoRA for anime checkpoints, due to how Booru Tags are formatted, you should put the trigger words at the start of the captions. Additionally, the number of trigger words should be consistent across the entire dataset.
- TL;DR:
- Take training a character LoRA as example: Only caption the background, the expression of the subject, and the poses of the subject; not the hair color, eye color, or other defining features.
- Personally, when I am training a character with multiple outfits, I put the character name first, the outfit name second, then the rest of the tags. Thus, I have 2 trigger words for every single caption.
This structure is specific to Kohya_SS
- Create a project folder to store your dataset. Inside it, create folders named in the format of
XXX_YYY
:- The
XXX
is the number of repeats per image. Generally, it takes around a thousand steps to train a concept, depending on its complexity. Divide that by the number ofepochs
and the number of images, and then multiply by thebatch size
, to getXXX
.- eg. To train
1000
steps in10
epochs using10
images at a batch size of2
, theXXX
will be1000 / 10 / 10 * 2
=20
- eg. To train
- The
YYY
is the "class" of the images, basically the most broad "category" of your subject.- eg.
man
orwoman
- eg.
- The
In the end, it will be something like:
Project
|- 12_foo
|- 01.png
|- 01.txt
|- ...
|- 16_bar
|- 02.jpg
|- 02.txt
|- ...
Tip
Remember to switch to the LoRA tab at the top first
Note
Parameters that were not mentioned can just be left at default
- You can save the following settings into a
.json
file, and simply load it again in future trainings
-
Mixed precision: If you have a RTX 30 series or later GPU, select
bf16
; otherwise selectfp16
-
Dynamo backend: Try setting it to
inductor
and see if your GPU supports it; otherwise leave it atno
- Pretrained model name or path: Click the
📄
button, and select a checkpoint of choice. Recommended to pick a more general checkpoint instead of a more finetuned one, so that it is flexible to work on more checkpoints.- If the model is SDXL, also enable the
SDXL
checkbox
- If the model is SDXL, also enable the
- Image folder: Enter the path to the project folder (not the sub-folders)
- Trained Model output name: The name of the LoRA
- Save trained model as:
.safetensors
- Save precision: Same as Mixed precision above
- Output directory for trained model: The folder where the LoRA should be saved to (You can simply link to the
~webui\models\Lora
folder)
- LoRA Type: You can choose between training a LoRA or a LyCORIS model
- Keep it at
Standard
when you are starting out
- Keep it at
- Train Batch Size: How many images does it train at once every step
- Increasing this would "smooth out" the changes between each Epoch
- You may need to increase the Epoch count to compensate for it
- Epoch: How many times should the training go through the entire dataset
- Max train steps: I set this to
0
cause my step count is determined by Epoch instead - Caption file extension:
.txt
- Learning Rate / Scheduler / Optimizer:
- Max resolution: For SDXL, set to
1024,1024
; For SD1, set to768,768
if VRAM is sufficient; otherwise512,512
.
Tip
See Architecture if you don't know what the version mean
- Enable buckets:
true
- Network Rank: How “complex” the concept is
- You do not need more than
32
for the vast majority of cases. Do not blindly follow random YouTube tutorials where they set it to super high values, generating junks that are hundreds of MBs in size, wasting everyone's time and space... - I was able to train LoRA for a single character with multiple outfits at just
4
; a pack LoRA with three characters at just8
; and a pack LoRA with ten characters at32
- You do not need more than
- Network Alpha: Setting it to half of Network Rank is usually fine
- Keep n tokens: For realistic checkpoints, set it to
0
; for anime checkpoints, set it to the number of trigger words you have- See Captions
- Clip Skip:
1
for realistic checkpoints;2
for anime checkpoints - Gradient checkpointing: Enable this if you're getting Out of Memory errors
- Shuffle caption:
true
for anime checkpoints;false
otherwise- This doesn't make sense for natural sentences after all
- CrossAttention:
xformers
- Debiased Estimation loss:
true
; helps with color - Noise offset: Basically, this helps adapting to brighter and darker scenes
- I set it to
0.02
- Do not set this too high, or you will get extremely contrasty "AI" look...
- I set it to
- Adaptive noise scale: Set this to
1 / 10
of Noise offset- So
0.002
in my case
- So
Pray that you don't waste an hour for nothing- After the training is finished, you can use X/Y/Z Plot to evaluate the results