You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Accelerate seems to apply the Adam update when calling loss.backward rather than when calling optimizer.step. Moreover, accelerator.no_sync doesn't prevent the adam update.
This caused a hard-to-debug issue for me, because I was trying to set the learning rate between calling backward and calling step (as in the script below), but my learning rate wasn't applied until the next update.
importtorchfromaccelerateimportAcceleratorfromtorchimportnnfromtorch.optimimportAdam# Simple model with just one parameterclassSimpleModel(nn.Module):
def__init__(self):
super().__init__()
self.variable=nn.Parameter(torch.tensor([1.0]))
defforward(self):
returnself.variable# Monkey patch Adam to print something before steporiginal_step=torch.optim.Adam.step# Create new step method that prints lrdefpatched_step(self, *args, **kwargs):
fori, groupinenumerate(self.param_groups):
print(f"Group {i} learning rate before step: {group['lr']}", flush=True)
returnoriginal_step(self, *args, **kwargs)
torch.optim.Adam.step=patched_step# Setupaccelerator=Accelerator()
model=SimpleModel()
optimizer=Adam(model.parameters(), lr=1000) # Start with a large LR# Prepare with acceleratormodel, optimizer=accelerator.prepare(model, optimizer)
print(f"Initial lr: {optimizer.param_groups[0]['lr']}") # Should show 0.1loss=model()
print("before calling backward")
accelerator.backward(loss)
print("after calling backward")
# now let's change the LRnew_lr=0.01forparam_groupinoptimizer.param_groups:
param_group["lr"] =new_lrprint(f"Set new lr to: {new_lr}")
# Call stepoptimizer.step()
print("After step")
By looking at the printouts, you can see that the Adam step happens on accelerator.backward, and it's using the large LR.
I also tried replacing accelerator.backward(loss) with
with accelerator.no_sync(model):
accelerator.backward(loss)
Accelerate seems to apply the Adam update when calling loss.backward rather than when calling optimizer.step. Moreover,
accelerator.no_sync
doesn't prevent the adam update.This caused a hard-to-debug issue for me, because I was trying to set the learning rate between calling backward and calling step (as in the script below), but my learning rate wasn't applied until the next update.
By looking at the printouts, you can see that the Adam step happens on accelerator.backward, and it's using the large LR.
I also tried replacing
accelerator.backward(loss)
withbut this didn't seem to have any effect.
Here are my config files:
deepspeed.yaml:
deepspeed_config.json
The text was updated successfully, but these errors were encountered: