You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please check that this issue hasn't been reported before.
I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
max_steps should be correctly calculated
Current behaviour
Basically I am running into dataloader exhaustion and a crash when HF trainer is trying to gather batch sizes from all ranks.
The dataset I am using is private, so give me some time to figure out how to reproduce this
2 GPU devices
sample packing enabled
completion dataset - so using the axolotl.prompt_strategies.user_defined strategy
micro_batch_size 1
[rank0]: File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/axolotl/cli/train.py", line 34, in do_cli
[rank0]: return do_train(parsed_cfg, parsed_cli_args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/axolotl/cli/train.py", line 47, in do_train
[rank0]: model, tokenizer = train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/axolotl/train.py", line 191, in train
[rank0]: trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank0]: File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/transformers/trainer.py", line 2164, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/transformers/trainer.py", line 2473, in _inner_training_loop
[rank0]: batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/transformers/trainer.py", line 5142, in get_batch_samples
[rank0]: num_items_in_batch = self.accelerator.gather(num_items_in_batch).sum().item()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/accelerator.py", line 2458, in gather
[rank0]: return gather(tensor)
[rank0]: ^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/utils/operations.py", line 376, in wrapper
[rank0]: return function(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/utils/operations.py", line 437, in gather
[rank0]: return _gpu_gather(tensor)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/utils/operations.py", line 356, in _gpu_gather
[rank0]: return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/utils/operations.py", line 129, in recursively_apply
[rank0]: raise TypeError(
[rank0]: TypeError: Unsupported types (<class 'NoneType'>) passed to `_gpu_gather_one`. Only nested list/tuple/dicts of objects that are valid for `is_torch_tensor` should
Steps to reproduce
TODO
Config yaml
{}
Possible solution
TODO
Which Operating Systems are you using?
Linux
macOS
Windows
Python Version
3.11
axolotl branch-commit
main
Acknowledgements
My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.
The text was updated successfully, but these errors were encountered:
Please check that this issue hasn't been reported before.
Expected Behavior
max_steps should be correctly calculated
Current behaviour
Basically I am running into dataloader exhaustion and a crash when HF trainer is trying to gather batch sizes from all ranks.
The dataset I am using is private, so give me some time to figure out how to reproduce this
axolotl.prompt_strategies.user_defined
strategySteps to reproduce
TODO
Config yaml
{}
Possible solution
TODO
Which Operating Systems are you using?
Python Version
3.11
axolotl branch-commit
main
Acknowledgements
The text was updated successfully, but these errors were encountered: