Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama 3 8B fine tuning shows nan value as loss #660

Open
2 of 4 tasks
BaiqingL opened this issue Jul 20, 2024 · 4 comments
Open
2 of 4 tasks

Llama 3 8B fine tuning shows nan value as loss #660

BaiqingL opened this issue Jul 20, 2024 · 4 comments
Labels
bug Something isn't working Stale

Comments

@BaiqingL
Copy link

BaiqingL commented Jul 20, 2024

System Info

Platform:

- Platform: Linux-5.15.0-1056-aws-x86_64-with-glibc2.29
- Python version: 3.8.10


Python packages:

- `optimum-neuron` version: 0.0.24.dev0
- `neuron-sdk` version: 2.18.0
- `optimum` version: 1.20.0
- `transformers` version: 4.41.1
- `huggingface_hub` version: 0.23.2
- `torch` version: 2.1.2+cu121
- `aws-neuronx-runtime-discovery` version: 2.9
- `libneuronxla` version: 2.0.965
- `neuronx-cc` version: 2.13.66.0+6dfecc895
- `neuronx-distributed` version: 0.7.0
- `neuronx-hwm` version: NA
- `torch-neuronx` version: 2.1.2.2.1.0
- `torch-xla` version: 2.1.2
- `transformers-neuronx` version: 0.10.0.21


Neuron Driver:


WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

aws-neuronx-collectives/unknown,now 2.20.22.0-c101c322e amd64 [installed,upgradable to: 2.21.46.0-69b77134b]
aws-neuronx-dkms/unknown,now 2.16.7.0 amd64 [installed,upgradable to: 2.17.17.0]
aws-neuronx-oci-hook/unknown,now 2.3.0.0 amd64 [installed,upgradable to: 2.4.4.0]
aws-neuronx-runtime-lib/unknown,now 2.20.22.0-1b3ca6425 amd64 [installed,upgradable to: 2.21.41.0-fb1705f5f]
aws-neuronx-tools/unknown,now 2.17.1.0 amd64 [installed,upgradable to: 2.18.3.0]

Who can help?

@michaelbenayoun

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Change the model id to meta-llama/Meta-Llama-3-8B, add env variable via os.environ['XLA_USE_BF16'] = "1", training loss shows up as nan. Here is an example of the training log

{'loss': nan, 'learning_rate': 4.166666666666667e-05, 'epoch': 0.48}            
{'loss': nan, 'learning_rate': 3.3333333333333335e-05, 'epoch': 0.96}           
{'loss': nan, 'learning_rate': 2.5e-05, 'epoch': 1.45}                          
{'loss': nan, 'learning_rate': 1.6666666666666667e-05, 'epoch': 1.93}           
{'loss': nan, 'learning_rate': 8.333333333333334e-06, 'epoch': 2.41}            
{'loss': nan, 'learning_rate': 0.0, 'epoch': 2.89}                              
100%|███████████████████████████████████████████| 60/60 [09:55<00:00,  9.09s/it]

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 595.4478, 'train_samples_per_second': 1.673, 'train_steps_per_second': 0.101, 'train_loss': nan, 'epoch': 2.89}
100%|███████████████████████████████████████████| 60/60 [09:55<00:00,  9.92s/it]

In addition, trying to run inference on this model creates the following error:

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Expected behavior

Normal training, normal inferencing as the notebook intended

@BaiqingL BaiqingL added the bug Something isn't working label Jul 20, 2024
@BaiqingL BaiqingL changed the title Llama 3B fine tuning shows nan value as loss Llama 3 8B fine tuning shows nan value as loss Jul 20, 2024
@jianyinglangaws
Copy link

I saw the same.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Oct 14, 2024
@anilozlu
Copy link

I have the same problem, when I try training with packing=True I get nan loss. You should try without packing and see if it changes anything.

@BaiqingL
Copy link
Author

I did try without packing, if you turn on packing the potential problem is if the packed length is shorter than the given prompt, some samples will end up not having any trainable data and cause the model to produce nothing valuable, hence the nan loss. With it turned off it's still seeing this issue though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

3 participants