-
-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Efficient stochastic path without unused layers #294
Comments
@Aceticia hello again Xujin/Chris stochastic depth is popular in some circles for sure what do you think about just forcing the parameters to be used by sending in a single dummy token, multiplying the output by 0 and summing it to the stream? that should fix the ddp issue? |
Hello again! I go by Chris :D Sounds like a good solution, I can't really think of any side effects. |
@Aceticia ok Chris i'll add it later this evening and you can let me know if that unused parameters issue persists |
@Aceticia did you see anything interesting when splitting dimensions for alibi across heads? |
I tried it out, didn't have time for a complete run but sadly I don't see much differences from just using alibi in time. We made the compromise to use consistent time ordering across samples and use rotary pos emb in time, and a learned positional embedding across space and it's the best we have yet. Can't spend forever on this - sorry to have wasted some of your time on this. Good knowledge though. |
@Aceticia no problem! just your sharing this makes it worth it thanks! |
DINO v2 finds that high values of stochastic depth is very helpful for larger models in terms of performance and they also gave an efficient implementation that only operates on the un-masked samples of a batch here, which is very simple:
In practice, with up to
stochastic_depth=0.4
, the memory usage almost halves.In this repo, there is a stochastic depth provided, where the layers are dropped altogether. This also achieves similar effect as the DINO v2 implementation in that masked out samples of a batch don't waste compute. However, this drops entire layers and thus we are forced to use
find_unused_parameters=True
when training with DDP, which would cause further overheads... besides, dropping entire layer across all batches feels kinda weird and might introduce biases.I can contribute something and integrate this into the attention and MLP layers. What do you think? Is there any other reasons that you keep the entire layer drop (apart from the potential overhead when drop is low)?
The text was updated successfully, but these errors were encountered: