FLUX.1-dev dreambooth train on multigpu #9277

MehmetcanTozlu · 2024-08-26T11:57:47Z

MehmetcanTozlu
Aug 26, 2024

I have 4 A6000 Ada Generation cards. but I get an error when training Flux.1-dev with dreambooth(train_dreambooth_flux.py). can anyone help?

pip list:
absl-py 2.1.0
accelerate 0.33.0
bitsandbytes 0.43.3
certifi 2024.7.4
charset-normalizer 3.3.2
click 8.1.7
cmake 3.25.0
diffusers 0.31.0.dev0
docker-pycreds 0.4.0
filelock 3.15.4
fsspec 2024.6.1
ftfy 6.2.3
gitdb 4.0.11
GitPython 3.1.43
grpcio 1.66.0
huggingface-hub 0.24.6
idna 3.8
importlib_metadata 8.4.0
Jinja2 3.1.4
lit 15.0.7
Markdown 3.7
MarkupSafe 2.1.5
mpmath 1.3.0
networkx 3.3
numpy 1.26.4
nvidia-cublas-cu11 11.11.3.6
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu11 11.8.87
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu11 11.8.89
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu11 11.8.89
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu11 8.7.0.84
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu11 10.9.0.58
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu11 10.3.0.86
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu11 11.4.1.48
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu11 11.7.5.86
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu11 2.19.3
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.6.20
nvidia-nvtx-cu11 11.8.86
nvidia-nvtx-cu12 12.1.105
packaging 24.1
peft 0.12.0
pillow 10.4.0
pip 22.0.2
platformdirs 4.2.2
protobuf 5.27.3
psutil 6.0.0
PyYAML 6.0.2
regex 2024.7.24
requests 2.32.3
safetensors 0.4.4
sentencepiece 0.2.0
sentry-sdk 2.13.0
setproctitle 1.3.3
setuptools 59.6.0
six 1.16.0
smmap 5.0.1
sympy 1.13.2
tensorboard 2.17.1
tensorboard-data-server 0.7.2
tokenizers 0.19.1
torch 2.2.1+cu118
torchaudio 2.2.1+cu118
torchvision 0.17.1+cu118
tqdm 4.66.5
transformers 4.44.2
triton 2.2.0
typing_extensions 4.12.2
urllib3 2.2.2
wandb 0.17.7
wcwidth 0.2.13
Werkzeug 3.0.4
zipp 3.20.0

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

run command:
accelerate launch --num_processes=4 train_dreambooth_flux.py --pretrained_model_name_or_path="/mnt/s1/flux_dev/FLUX.1-dev" --instance_data_dir="/home/wrusr/flux_train_env/flux_dreambooth_train/dreambooth-datasets_/style_spiderverse" --caption_column="/home/wrusr/flux_train_env/flux_dreambooth_train/dreambooth-datasets_/atext" --instance_prompt="A photo of spdrvrs" --output_dir="/home/wrusr/flux_train_env/flux_dreambooth_train" --resolution=024 --train_batch_size=1 --mixed_precision="fp16" --max_train_steps=500 --revision="fp16" --gradient_accumulation_steps=4 --learning_rate=1e-4 --report_to="wandb" --lr_scheduler="constant" --lr_warmup_steps=0 --seed="0" --push_to_hub

error:
08/26/2024 11:46:54 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

08/26/2024 11:46:54 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 4
Process index: 3
Local process index: 3
Device: cuda:3

Mixed precision type: fp16

08/26/2024 11:46:54 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 4
Process index: 2
Local process index: 2
Device: cuda:2

Mixed precision type: fp16

08/26/2024 11:46:54 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 4
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: fp16

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.75s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.91s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.92s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.92s/it]
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.
[2024-08-26 11:48:25,610] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 32699 closing signal SIGTERM
[2024-08-26 11:48:25,733] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 32701 closing signal SIGTERM
[2024-08-26 11:48:25,733] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 32702 closing signal SIGTERM
[2024-08-26 11:48:28,939] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 1 (pid: 32700) of binary: /home/wrusr/flux_train_env/bin/python3.10
Traceback (most recent call last):
File "/home/wrusr/flux_train_env/bin/accelerate", line 8, in
sys.exit(main())
File "/home/wrusr/flux_train_env/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/wrusr/flux_train_env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
multi_gpu_launcher(args)
File "/home/wrusr/flux_train_env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
distrib_run.run(args)
File "/home/wrusr/flux_train_env/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/wrusr/flux_train_env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/wrusr/flux_train_env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_dreambooth_flux.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-26_11:48:25
host : wr-dev-train-1
rank : 1 (local_rank: 1)
exitcode : -9 (pid: 32700)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 32700

tolgacangoz · 2024-08-26T12:56:14Z

tolgacangoz
Aug 26, 2024

Could you put the error log into three backticks for a better presentation? Signal 9 (SIGKILL) is mostly related to insufficient system RAM (not GPU vRAM).

Btw, --resolution=024.

0 replies

MehmetcanTozlu · 2024-08-26T13:52:10Z

MehmetcanTozlu
Aug 26, 2024
Author

--resolution=1024 :)

I have 128gb of RAM.

Aug 26 13:49:37 ml-dev-worker-1 kernel: Out of memory: Killed process 678761 (python3.10) total-vm:81000448kB, anon-rss:63219524kB, file-rss:1816kB, shmem-rss:4kB, UID:100>
Aug 26 13:49:37 ml-dev-worker-1 systemd[1]: session-385.scope: A process of this unit has been killed by the OOM killer.
Aug 26 13:49:40 ml-dev-worker-1 kernel: oom_reaper: reaped process 678761 (python3.10), now anon-rss:0kB, file-rss:0kB, shmem-rss:4kB

But I get the same error on a machine with 200gb RAM.

0 replies

jyy-1998 · 2024-10-29T06:32:49Z

jyy-1998
Oct 29, 2024

Hi, I have the same problem. Have you solved it?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLUX.1-dev dreambooth train on multigpu #9277

{{title}}

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

FLUX.1-dev dreambooth train on multigpu #9277

MehmetcanTozlu Aug 26, 2024

train_dreambooth_flux.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-08-26_11:48:25 host : wr-dev-train-1 rank : 1 (local_rank: 1) exitcode : -9 (pid: 32700) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 32700

Replies: 3 comments

tolgacangoz Aug 26, 2024

MehmetcanTozlu Aug 26, 2024 Author

jyy-1998 Oct 29, 2024

MehmetcanTozlu
Aug 26, 2024

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-26_11:48:25
host : wr-dev-train-1
rank : 1 (local_rank: 1)
exitcode : -9 (pid: 32700)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 32700

tolgacangoz
Aug 26, 2024

MehmetcanTozlu
Aug 26, 2024
Author

jyy-1998
Oct 29, 2024