Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPMD errors when enabling pipeline parallel for fine-tuning llama 3 8B model #674

Open
2 of 4 tasks
bingchen-liu opened this issue Jul 31, 2024 · 2 comments
Open
2 of 4 tasks
Labels
bug Something isn't working Stale

Comments

@bingchen-liu
Copy link

System Info

optimum-neuron==0.0.22
transformers == 4.36.2
python==3.10 
torch==2.1.2
optimum=1.18.*

Training image in SageMaker:
https://github.com/aws-neuron/deep-learning-containers/blob/2.19.1/docker/pytorch/training/2.1.2/Dockerfile.neuronx

Who can help?

@michaelbenayoun @JingyaHuang @philschmid

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

https://github.com/aws-samples/ml-specialized-hardware/blob/main/tutorials/06_FinetuneLLMs/01_Finetune_LLMs.ipynb

The training is based on the above notebook. I used tp=8, pp=2, 2 trn1.32xlarge instances. Official LLama 3 8B model.

Expected behavior

The following errors showed up during the finetuning of the model:

2024-Jul-31 14:47:33.0955282024-Jul-31 14:47:33.0955282024-Jul-31 14:47:33.0955362024-Jul-31 14:47:33.0955372024-Jul-31 14:47:33.0955352024-Jul-31 14:47:33.0955382024-Jul-31 14:47:33.0955352024-Jul-31 14:47:33.0955402024-Jul-31 14:47:33.095544 98:250 ERROR TDRV:v2_cc_execute 104:258 ERROR TDRV:v2_cc_execute 2024-Jul-31 14:47:33.095545 114:253 ERROR TDRV:v2_cc_execute 120:261 ERROR TDRV:v2_cc_execute 125:286 ERROR TDRV:v2_cc_execute 112:277 ERROR TDRV:v2_cc_execute [nec_dev 4, gid 4] MPMD detected but reload is not supported yet 96:263 ERROR TDRV:v2_cc_execute 117:274 ERROR TDRV:v2_cc_execute 115:273 ERROR TDRV:v2_cc_execute 2024-Jul-31 14:47:33.095550[nec_dev 31, gid 31] MPMD detected but reload is not supported yet[nec_dev 18, gid 18] MPMD detected but reload is not supported yet[nec_dev 23, gid 23] MPMD detected but reload is not supported yet2024-Jul-31 14:47:33.0955582024-Jul-31 14:47:33.0955442024-Jul-31 14:47:33.095567[nec_dev 21, gid 21] MPMD detected but reload is not supported yet2024-Jul-31 14:47:33.095543 116:256 ERROR TDRV:v2_cc_execute 2024-Jul-31 14:47:33.095573

@bingchen-liu bingchen-liu added the bug Something isn't working label Jul 31, 2024
@jonetiz
Copy link

jonetiz commented Jul 31, 2024

Try installing optimum-neuron from source, recent changes have fixed several issues, including some MPMD errors.

pip install git+https://github.com/huggingface/optimum-neuron.git

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

2 participants