Skip to content

Magic constants #7

Answered by cloneofsimo
platers asked this question in Q&A
Mar 4, 2024 · 1 comments · 2 replies
Discussion options

You must be logged in to vote

Because in the paper, they did the sweeping for transformer and found that multiplicative scaling factor for embedding is good when its something like 3 ~ 10 times, so naturally, since we dont have multiplicative factor, scale the init and lr 3 ~ 10 times each. Kinda random hyperparameter tbh, and the sweeping results of muP paper doesnt how that much difference.

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@cloneofsimo
Comment options

@platers
Comment options

Answer selected by platers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants