cogvideo training error #10315

linwenzhao1 · 2024-12-20T06:47:32Z

Describe the bug

Fine tuning the model on both Gpus reports the following error： RuntimeError: CUDA driver error: invalid argument
Do you know what the problem is?

Reproduction

rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
rank1: return self._call_impl(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
rank1: return forward_call(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/diffusers-0.32.0.dev0-py3.11.egg/diffusers/models/transformers/cogvideox_transformer_3d.py", line 148, in forward
rank1: ff_output = self.ff(norm_hidden_states)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
rank1: return self._call_impl(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
rank1: return forward_call(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/diffusers-0.32.0.dev0-py3.11.egg/diffusers/models/attention.py", line 1242, in forward
rank1: hidden_states = module(hidden_states)
rank1: ^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
rank1: return self._call_impl(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
rank1: return forward_call(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/diffusers-0.32.0.dev0-py3.11.egg/diffusers/models/activations.py", line 88, in forward
rank1: hidden_states = self.proj(hidden_states)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
rank1: return self._call_impl(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
rank1: return forward_call(*args, kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 125, in forward
rank1: return F.linear(input, self.weight, self.bias)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: RuntimeError: CUDA driver error: invalid argument
Steps: 0%| | 0/133600000 [00:12<?, ?it/s]
[rank0]:[W1220 14:39:33.155016577 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_proce ss_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint ha s always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W1220 14:39:35.723000 381051 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 381224 closing signal SIGTERM
E1220 14:39:36.039000 381051 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 381223) of binary: /home/conda_env/controlnet/bin/python
Traceback (most recent call last):
File "/home/conda_env/controlnet/bin/accelerate", line 8, in
sys.exit(main())
^^^^^^
File "/home/conda_env/controlnet/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/conda_env/controlnet/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/home/conda_env/controlnet/lib/python3.11/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in call**
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_controlnet.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-20_14:39:35
host : robot
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 381223)
error_file: <N/A>
traceback : To enable traceback see: https:/pytorch.org/docs/stable/elastic/errors.html

Logs

No response

System Info

ubuntu20.04
cuda 12.0
torch 2.5
diffusers 0.32.0.dev0

Who can help?

No response

hlky · 2024-12-20T15:24:27Z

cc @linoytsaban @sayakpaul for training.

sayakpaul · 2024-12-20T15:30:25Z

Cc: @a-r-r-o-w for training

a-r-r-o-w · 2024-12-21T11:10:01Z

Unable to deduce what exactly caused this error. I see it happens in the attention feed-forward projection but nothing hinting why. Could you maybe run with CUDA_LAUNCH_BLOCKING=1 and share your results? Does it also happen with pytorch nightly? I'm able to run the scripts for CogVideoX in https://github.com/a-r-r-o-w/finetrainers just fine

linwenzhao1 added the bug Something isn't working label Dec 20, 2024

hlky added the training label Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cogvideo training error #10315

cogvideo training error #10315

linwenzhao1 commented Dec 20, 2024

hlky commented Dec 20, 2024

sayakpaul commented Dec 20, 2024

a-r-r-o-w commented Dec 21, 2024

cogvideo training error #10315

cogvideo training error #10315

Comments

linwenzhao1 commented Dec 20, 2024

Describe the bug

Reproduction

train_controlnet.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-12-20_14:39:35 host : robot rank : 0 (local_rank: 0) exitcode : 1 (pid: 381223) error_file: <N/A> traceback : To enable traceback see: https:/pytorch.org/docs/stable/elastic/errors.html

Logs

System Info

Who can help?

hlky commented Dec 20, 2024

sayakpaul commented Dec 20, 2024

a-r-r-o-w commented Dec 21, 2024

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-20_14:39:35
host : robot
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 381223)
error_file: <N/A>
traceback : To enable traceback see: https:/pytorch.org/docs/stable/elastic/errors.html