-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
num_training_batches
is inf
in configure_optimizers
#16060
Comments
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team! |
FWIW, I question the logic of "stale bots". Do you not want GitHub to be a place that users can inform you of potential problems? What does automatic closure achieve other than sweeping issues under the carpet? I'm much rather a human being apply the "won't fix" label, I'm OK with that. |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team! |
I'm facing this one on v1.9.0. It seems to be a problem around DDP. One of the workers can get the correct value but the others fail and get inf. Are there any workarounds? |
I found |
@davidgilbertson have you found a workaround for this ? |
Bug description
The value of
num_training_batches
isinf
when referenced inconfigure_optimizers()
. It seems that it doesn't actually get its correct value until some point later. This causes a very hard-to-find issue because the training runs without error, except the loss isnan
.Something inside
optim.lr_scheduler.CyclicLR
actually sets thelr
of theoptimizer
tonan
.It would be nice if:
configure_optimizers()
was called, orHow to reproduce the bug
Error messages and logs
The main hint something is wrong is actually tensorboard printing "NaN or Inf found in input tensor" - but even that doesn't come with a trace telling me who's printing this.
Environment
Current environment
More info
No response
cc @justusschock @awaelchli @carmocca
The text was updated successfully, but these errors were encountered: