Llama 3 8B fine tuning shows nan value as loss #660

BaiqingL · 2024-07-20T06:31:39Z

System Info

Platform:

- Platform: Linux-5.15.0-1056-aws-x86_64-with-glibc2.29
- Python version: 3.8.10


Python packages:

- `optimum-neuron` version: 0.0.24.dev0
- `neuron-sdk` version: 2.18.0
- `optimum` version: 1.20.0
- `transformers` version: 4.41.1
- `huggingface_hub` version: 0.23.2
- `torch` version: 2.1.2+cu121
- `aws-neuronx-runtime-discovery` version: 2.9
- `libneuronxla` version: 2.0.965
- `neuronx-cc` version: 2.13.66.0+6dfecc895
- `neuronx-distributed` version: 0.7.0
- `neuronx-hwm` version: NA
- `torch-neuronx` version: 2.1.2.2.1.0
- `torch-xla` version: 2.1.2
- `transformers-neuronx` version: 0.10.0.21


Neuron Driver:


WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

aws-neuronx-collectives/unknown,now 2.20.22.0-c101c322e amd64 [installed,upgradable to: 2.21.46.0-69b77134b]
aws-neuronx-dkms/unknown,now 2.16.7.0 amd64 [installed,upgradable to: 2.17.17.0]
aws-neuronx-oci-hook/unknown,now 2.3.0.0 amd64 [installed,upgradable to: 2.4.4.0]
aws-neuronx-runtime-lib/unknown,now 2.20.22.0-1b3ca6425 amd64 [installed,upgradable to: 2.21.41.0-fb1705f5f]
aws-neuronx-tools/unknown,now 2.17.1.0 amd64 [installed,upgradable to: 2.18.3.0]

Who can help?

@michaelbenayoun

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Change the model id to meta-llama/Meta-Llama-3-8B, add env variable via os.environ['XLA_USE_BF16'] = "1", training loss shows up as nan. Here is an example of the training log

{'loss': nan, 'learning_rate': 4.166666666666667e-05, 'epoch': 0.48}            
{'loss': nan, 'learning_rate': 3.3333333333333335e-05, 'epoch': 0.96}           
{'loss': nan, 'learning_rate': 2.5e-05, 'epoch': 1.45}                          
{'loss': nan, 'learning_rate': 1.6666666666666667e-05, 'epoch': 1.93}           
{'loss': nan, 'learning_rate': 8.333333333333334e-06, 'epoch': 2.41}            
{'loss': nan, 'learning_rate': 0.0, 'epoch': 2.89}                              
100%|███████████████████████████████████████████| 60/60 [09:55<00:00,  9.09s/it]

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 595.4478, 'train_samples_per_second': 1.673, 'train_steps_per_second': 0.101, 'train_loss': nan, 'epoch': 2.89}
100%|███████████████████████████████████████████| 60/60 [09:55<00:00,  9.92s/it]

In addition, trying to run inference on this model creates the following error:

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Expected behavior

Normal training, normal inferencing as the notebook intended

The text was updated successfully, but these errors were encountered:

jianyinglangaws · 2024-07-22T21:46:56Z

I saw the same.

github-actions · 2024-10-14T11:40:18Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

anilozlu · 2024-12-14T22:19:33Z

I have the same problem, when I try training with packing=True I get nan loss. You should try without packing and see if it changes anything.

BaiqingL · 2024-12-18T20:25:40Z

I did try without packing, if you turn on packing the potential problem is if the packed length is shorter than the given prompt, some samples will end up not having any trainable data and cause the model to produce nothing valuable, hence the nan loss. With it turned off it's still seeing this issue though.

BaiqingL added the bug Something isn't working label Jul 20, 2024

BaiqingL changed the title ~~Llama 3B fine tuning shows nan value as loss~~ Llama 3 8B fine tuning shows nan value as loss Jul 20, 2024

github-actions bot added the Stale label Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 3 8B fine tuning shows nan value as loss #660

Llama 3 8B fine tuning shows nan value as loss #660

BaiqingL commented Jul 20, 2024 •

edited

Loading

jianyinglangaws commented Jul 22, 2024

github-actions bot commented Oct 14, 2024

anilozlu commented Dec 14, 2024

BaiqingL commented Dec 18, 2024

Llama 3 8B fine tuning shows nan value as loss #660

Llama 3 8B fine tuning shows nan value as loss #660

Comments

BaiqingL commented Jul 20, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

jianyinglangaws commented Jul 22, 2024

github-actions bot commented Oct 14, 2024

anilozlu commented Dec 14, 2024

BaiqingL commented Dec 18, 2024

BaiqingL commented Jul 20, 2024 •

edited

Loading