Magic constants #7
-
This is a very clean implementation, thanks for sharing it! I have a few questions about some constants in the code. Why is the MLP init scaled by for name, param in self.mlp.named_parameters():
if "weight" in name:
init.normal_(param, mean=0, std=0.5 * (1 / config.n_embd) ** 0.5) And for the embeddings where does the init.normal_(self.embed.weight, mean=0, std=alpha * 3.3) |
Beta Was this translation helpful? Give feedback.
Answered by
cloneofsimo
Mar 6, 2024
Replies: 1 comment 2 replies
-
Beta Was this translation helpful? Give feedback.
2 replies
Answer selected by
platers
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Because in the paper, they did the sweeping for transformer and found that multiplicative scaling factor for embedding is good when its something like 3 ~ 10 times, so naturally, since we dont have multiplicative factor, scale the init and lr 3 ~ 10 times each. Kinda random hyperparameter tbh, and the sweeping results of muP paper doesnt how that much difference.