Help - When launching advanced training flux it gets stuck #9984
Unanswered
duchamps0305
asked this question in
Q&A
Replies: 1 comment
-
Hi, if it worked a day before and now it doesn't probably means that runpod changed something on their side, so probably you should open an issue there so they can help you finding the reason of your issue. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi everyone, I'm trying to finetune a flux db lora using the advanced script to finetune also the embeddings.
I followed all the steps, however when pasting either the example script or my own to train, it always gets stuck at the same point after downloading the mode. Tried multiple times but can't understand why this happens.
I'm running on an A40 on runpod, yesterday it went smoothly but today it really doesn't seem to work.
this is my config:
(venv) root@da0c51e8dcb3:/workspace/diffusers/examples/advanced_diffusion_training# export MODEL_NAME="black-forest-labs/FLUX.1-dev"
export DATASET_NAME="linoyts/3d_icon"
export OUTPUT_DIR="3d-icon-Flux-LoRA"
accelerate launch train_dreambooth_lora_flux_advanced.py
--pretrained_model_name_or_path=$MODEL_NAME
--dataset_name=$DATASET_NAME
--instance_prompt="3d icon in the style of TOK"
--output_dir=$OUTPUT_DIR
--caption_column="prompt"
--mixed_precision="bf16"
--resolution=1024
--train_batch_size=1
--repeats=1
--report_to="wandb"
--gradient_accumulation_steps=1
--gradient_checkpointing
--learning_rate=1.0
--text_encoder_lr=1.0
--optimizer="prodigy"
--train_text_encoder_ti
--enable_t5_ti
--train_text_encoder_ti_frac=0.5
--lr_scheduler="constant"
--lr_warmup_steps=0
--rank=8
--max_train_steps=700
--checkpointing_steps=2000
--seed="0" \
11/21/2024 15:44:31 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: bf16
tokenizer/tokenizer_config.json: 100%|█████████████████████████████████████████████████████████| 705/705 [00:00<00:00, 2.82MB/s]
tokenizer/vocab.json: 100%|████████████████████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 29.9MB/s]
tokenizer/merges.txt: 100%|██████████████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 19.8MB/s]
tokenizer/special_tokens_map.json: 100%|███████████████████████████████████████████████████████| 588/588 [00:00<00:00, 2.36MB/s]
tokenizer_2/tokenizer_config.json: 100%|███████████████████████████████████████████████████| 20.8k/20.8k [00:00<00:00, 97.8MB/s]
spiece.model: 100%|██████████████████████████████████████████████████████████████████████████| 792k/792k [00:00<00:00, 58.5MB/s]
tokenizer_2/tokenizer.json: 100%|██████████████████████████████████████████████████████████| 2.42M/2.42M [00:00<00:00, 30.9MB/s]
tokenizer_2/special_tokens_map.json: 100%|█████████████████████████████████████████████████| 2.54k/2.54k [00:00<00:00, 9.05MB/s]
You set
add_prefix_space
. The tokenizer needs to be converted from the slow tokenizerstext_encoder/config.json: 100%|████████████████████████████████████████████████████████████████| 613/613 [00:00<00:00, 5.88MB/s]
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
text_encoder_2/config.json: 100%|██████████████████████████████████████████████████████████████| 782/782 [00:00<00:00, 4.04MB/s]
You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
scheduler/scheduler_config.json: 100%|█████████████████████████████████████████████████████████| 273/273 [00:00<00:00, 1.48MB/s]
{'args'} was not found in config. Values will be initialized to default values.
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████| 246M/246M [00:05<00:00, 42.7MB/s]
(…)t_encoder_2/model.safetensors.index.json: 100%|█████████████████████████████████████████| 19.9k/19.9k [00:00<00:00, 46.5MB/s]
model-00001-of-00002.safetensors: 100%|████████████████████████████████████████████████████| 4.99G/4.99G [01:58<00:00, 42.1MB/s]
model-00002-of-00002.safetensors: 100%|████████████████████████████████████████████████████| 4.53G/4.53G [01:48<00:00, 41.8MB/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████| 2/2 [03:47<00:00, 113.51s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.56s/it]
vae/config.json: 100%|█████████████████████████████████████████████████████████████████████████| 820/820 [00:00<00:00, 10.1MB/s]
diffusion_pytorch_model.safetensors: 100%|███████████████████████████████████████████████████| 168M/168M [00:04<00:00, 41.6MB/s]
transformer/config.json: 100%|█████████████████████████████████████████████████████████████████| 378/378 [00:00<00:00, 3.77MB/s]
(…)ion_pytorch_model.safetensors.index.json: 100%|███████████████████████████████████████████| 121k/121k [00:00<00:00, 31.6MB/s]
(…)pytorch_model-00003-of-00003.safetensors: 100%|█████████████████████████████████████████| 3.87G/3.87G [01:31<00:00, 42.3MB/s]
(…)pytorch_model-00002-of-00003.safetensors: 100%|█████████████████████████████████████████| 9.95G/9.95G [03:56<00:00, 42.1MB/s]
(…)pytorch_model-00001-of-00003.safetensors: 100%|█████████████████████████████████████████| 9.98G/9.98G [03:57<00:00, 42.1MB/s]
Fetching 3 files: 100%|███████████████████████████████████████████████████████████████████████████| 3/3 [03:57<00:00, 79.13s/it]
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.█▉| 9.98G/9.98G [03:57<00:00, 42.5MB/s]
Beta Was this translation helpful? Give feedback.
All reactions