Help - When launching advanced training flux it gets stuck #9984

duchamps0305 · 2024-11-21T15:59:17Z

duchamps0305
Nov 21, 2024

Hi everyone, I'm trying to finetune a flux db lora using the advanced script to finetune also the embeddings.
I followed all the steps, however when pasting either the example script or my own to train, it always gets stuck at the same point after downloading the mode. Tried multiple times but can't understand why this happens.
I'm running on an A40 on runpod, yesterday it went smoothly but today it really doesn't seem to work.

this is my config:

(venv) root@da0c51e8dcb3:/workspace/diffusers/examples/advanced_diffusion_training# export MODEL_NAME="black-forest-labs/FLUX.1-dev"
export DATASET_NAME="linoyts/3d_icon"
export OUTPUT_DIR="3d-icon-Flux-LoRA"

accelerate launch train_dreambooth_lora_flux_advanced.py
--pretrained_model_name_or_path=$MODEL_NAME
--dataset_name=$DATASET_NAME
--instance_prompt="3d icon in the style of TOK"
--output_dir=$OUTPUT_DIR
--caption_column="prompt"
--mixed_precision="bf16"
--resolution=1024
--train_batch_size=1
--repeats=1
--report_to="wandb"
--gradient_accumulation_steps=1
--gradient_checkpointing
--learning_rate=1.0
--text_encoder_lr=1.0
--optimizer="prodigy"
--train_text_encoder_ti
--enable_t5_ti
--train_text_encoder_ti_frac=0.5
--lr_scheduler="constant"
--lr_warmup_steps=0
--rank=8
--max_train_steps=700
--checkpointing_steps=2000
--seed="0" \

11/21/2024 15:44:31 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: bf16

tokenizer/tokenizer_config.json: 100%|█████████████████████████████████████████████████████████| 705/705 [00:00<00:00, 2.82MB/s]
tokenizer/vocab.json: 100%|████████████████████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 29.9MB/s]
tokenizer/merges.txt: 100%|██████████████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 19.8MB/s]
tokenizer/special_tokens_map.json: 100%|███████████████████████████████████████████████████████| 588/588 [00:00<00:00, 2.36MB/s]
tokenizer_2/tokenizer_config.json: 100%|███████████████████████████████████████████████████| 20.8k/20.8k [00:00<00:00, 97.8MB/s]
spiece.model: 100%|██████████████████████████████████████████████████████████████████████████| 792k/792k [00:00<00:00, 58.5MB/s]
tokenizer_2/tokenizer.json: 100%|██████████████████████████████████████████████████████████| 2.42M/2.42M [00:00<00:00, 30.9MB/s]
tokenizer_2/special_tokens_map.json: 100%|█████████████████████████████████████████████████| 2.54k/2.54k [00:00<00:00, 9.05MB/s]
You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers
text_encoder/config.json: 100%|████████████████████████████████████████████████████████████████| 613/613 [00:00<00:00, 5.88MB/s]
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
text_encoder_2/config.json: 100%|██████████████████████████████████████████████████████████████| 782/782 [00:00<00:00, 4.04MB/s]
You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
scheduler/scheduler_config.json: 100%|█████████████████████████████████████████████████████████| 273/273 [00:00<00:00, 1.48MB/s]
{'args'} was not found in config. Values will be initialized to default values.
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████| 246M/246M [00:05<00:00, 42.7MB/s]
(…)t_encoder_2/model.safetensors.index.json: 100%|█████████████████████████████████████████| 19.9k/19.9k [00:00<00:00, 46.5MB/s]
model-00001-of-00002.safetensors: 100%|████████████████████████████████████████████████████| 4.99G/4.99G [01:58<00:00, 42.1MB/s]
model-00002-of-00002.safetensors: 100%|████████████████████████████████████████████████████| 4.53G/4.53G [01:48<00:00, 41.8MB/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████| 2/2 [03:47<00:00, 113.51s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.56s/it]
vae/config.json: 100%|█████████████████████████████████████████████████████████████████████████| 820/820 [00:00<00:00, 10.1MB/s]
diffusion_pytorch_model.safetensors: 100%|███████████████████████████████████████████████████| 168M/168M [00:04<00:00, 41.6MB/s]
transformer/config.json: 100%|█████████████████████████████████████████████████████████████████| 378/378 [00:00<00:00, 3.77MB/s]
(…)ion_pytorch_model.safetensors.index.json: 100%|███████████████████████████████████████████| 121k/121k [00:00<00:00, 31.6MB/s]
(…)pytorch_model-00003-of-00003.safetensors: 100%|█████████████████████████████████████████| 3.87G/3.87G [01:31<00:00, 42.3MB/s]
(…)pytorch_model-00002-of-00003.safetensors: 100%|█████████████████████████████████████████| 9.95G/9.95G [03:56<00:00, 42.1MB/s]
(…)pytorch_model-00001-of-00003.safetensors: 100%|█████████████████████████████████████████| 9.98G/9.98G [03:57<00:00, 42.1MB/s]
Fetching 3 files: 100%|███████████████████████████████████████████████████████████████████████████| 3/3 [03:57<00:00, 79.13s/it]
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.█▉| 9.98G/9.98G [03:57<00:00, 42.5MB/s]

asomoza · 2024-11-21T23:18:34Z

asomoza
Nov 21, 2024
Maintainer

Hi, if it worked a day before and now it doesn't probably means that runpod changed something on their side, so probably you should open an issue there so they can help you finding the reason of your issue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help - When launching advanced training flux it gets stuck #9984

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Help - When launching advanced training flux it gets stuck #9984

duchamps0305 Nov 21, 2024

Replies: 1 comment

asomoza Nov 21, 2024 Maintainer

duchamps0305
Nov 21, 2024

asomoza
Nov 21, 2024
Maintainer