You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction (minimal, reproducible, runnable)
Change the model id to meta-llama/Meta-Llama-3-8B, add env variable via os.environ['XLA_USE_BF16'] = "1", training loss shows up as nan. Here is an example of the training log
{'loss': nan, 'learning_rate': 4.166666666666667e-05, 'epoch': 0.48}
{'loss': nan, 'learning_rate': 3.3333333333333335e-05, 'epoch': 0.96}
{'loss': nan, 'learning_rate': 2.5e-05, 'epoch': 1.45}
{'loss': nan, 'learning_rate': 1.6666666666666667e-05, 'epoch': 1.93}
{'loss': nan, 'learning_rate': 8.333333333333334e-06, 'epoch': 2.41}
{'loss': nan, 'learning_rate': 0.0, 'epoch': 2.89}
100%|███████████████████████████████████████████| 60/60 [09:55<00:00, 9.09s/it]
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 595.4478, 'train_samples_per_second': 1.673, 'train_steps_per_second': 0.101, 'train_loss': nan, 'epoch': 2.89}
100%|███████████████████████████████████████████| 60/60 [09:55<00:00, 9.92s/it]
In addition, trying to run inference on this model creates the following error:
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Expected behavior
Normal training, normal inferencing as the notebook intended
The text was updated successfully, but these errors were encountered:
I did try without packing, if you turn on packing the potential problem is if the packed length is shorter than the given prompt, some samples will end up not having any trainable data and cause the model to produce nothing valuable, hence the nan loss. With it turned off it's still seeing this issue though.
System Info
Who can help?
@michaelbenayoun
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
Change the model id to meta-llama/Meta-Llama-3-8B, add env variable via
os.environ['XLA_USE_BF16'] = "1"
, training loss shows up as nan. Here is an example of the training logIn addition, trying to run inference on this model creates the following error:
Expected behavior
Normal training, normal inferencing as the notebook intended
The text was updated successfully, but these errors were encountered: