-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT: Adding 1.58bit LLMs training architecture in nanotron #180
base: main
Are you sure you want to change the base?
Conversation
) / w_scale / x_scale | ||
else : | ||
w = self.weight | ||
x_norm = normalize(x, self.in_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't RMSNorm here have learnable weights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MekkCyber May I know why this is marked as resolved? From my understanding of the training tips handbook, the new RMSNorm should have learnable weights as the usual RMSNorm layers. Is there a reason you left it out here? (as well as in the PR merged in huggingface/transformers).
Hi, I fetched this pr, and fintune the llama 70B using tp=8 or pp=8, before training, I have convert the llama 70B into nanotron format using this method: #174, with set pp=1, dp=1, tp=1, but when start training using pp=1, dp=1, tp=8, I got this error: |
Hey @hjc3613 thanks for the report ! you don't have to be consitent with the convert config, it should work I will investigate that! Can you tell me from your side what is the content of models/qwen2.5-72b-instruct-nanotron/model/model/decoder/0/pp_block/MLPBitNet |
thank you very much! |
Hi @MekkCyber ! So far I think I managed to correctly unpack the weights of the model using the functions provided by you, but the I am unsure if this is expected and if I should continue with fine-tuning, having the model in this state. I would appreciate any help or guidance you could provide on this matter. |
Implementation of 1.58bit LLM with Llama following the paper & handbook released by Microsoft :
https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf
Here are The training results on 25B tokens :
cc @NouamaneTazi @xrsrke @thomwolf