-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix initial_lr when resuming training #243
base: main
Are you sure you want to change the base?
Conversation
The argument initial_lr in lambda_lr() is initialized incorrectly from the learning rate of an optimizer's parameter groups. It should instead be initialized from the INITIAL learning rate of the optimizer's param groups.
This might break μ-Parametrization. In that case should probably unbundle the logic for standard parametrization and μP. |
This was fixed better in other pull request here: #245 . |
My mistake. This was not fixed in #245 . Original training run with the latest commit of Nanotron as of writing this. Learning rate at Step 11001 is 0.000103.
Resuming training without my patch applied resumes at the incorrect learning rate (0.000145):
Resuming training with my hacky patch applied (lr resumes at 0.000103):
Initial LR and optimizer LR still aren't in sync in I'm not knowledgeable enough about μ-Parametrization to suggest a general fix for this. Perhaps a True/False flag somewhere to signal whether the current training run is resuming from a checkpoint or not. And if it is, then initialize the LR scheduler with Since most people including me are not using μP, this works as a temporary fix for us. |
Thanks a lot for opening the PR and the details explanation! I think we merged a fix recently here https://github.com/huggingface/nanotron/pull/245/files |
Unfortunately #245 did not fix this issue. Learning rates still don't match up when resuming in the latest commit of Nanotron with a clean environment. If you want to reproduce my test below, here is the yaml config file. Original training run at step 1401 out of 1500 shows learning rate at lr: 4.37e-05:
Current Nanotron commit after resuming at 1401 steps is at lr: 0.000108:
Applying the patch in #256 where LR scheduler builder is initialized before the optimizer is loaded results in resuming at lr: 4.29e-05:
My patch in this PR also results in resuming at the same LR as above lr: 4.29e-05:
The reason we are slightly off the value of the original training run is because you've added an nanotron/src/nanotron/serialize/optimizer.py Line 370 in fdd5151
If we comment this line out, the training resumes at the correct value lr: 4.37e-05:
I would recommend
You can close this pull request. The solution in #256 looks much better. |
Perfect ty @Lauler |
The argument initial_lr in lambda_lr() is initialized incorrectly from the learning rate of an optimizer's parameter groups. This causes the learning rate to be set incorrectly when models are resumed from checkpoints trained with standard parametrization LR schedulers.
nanotron/src/nanotron/helpers.py
Lines 167 to 168 in cfcdeae
It should instead be initialized from the initial learning rate of the optimizer's param groups. However, the key "initial_lr" in the optimizer does not exist when training is started, only when training is resumed from a checkpoint. I've therefore set this argument to
lr_scheduler_args.learning_rate
, which seems to work in standard parametrization, but almost certainly breaks something in mu-Parametrization.See this issue comment for context: #233 (comment)