[NEW] Llama3.2 weight converters 🦙 #255

TJ-Solergibert · 2024-11-28T23:04:15Z

Hi!

In this branch I reintroduce & update to the current main branch the Llama model & conversion scripts to support Llama3.1 and Llama3.2 1B&3B models.
The main changes are the following:

Deleted Flash Attention RoPEs as they don't support rope scaling of Llama3
Copied + Cleaning transformers LlamaRotaryEmbedding layer. Now this will be the only class in llama.py. I think it shouldn't break generations for the inference case WITHOUT LlamaConfig.rope_interleaved = True in CausalSelfAttention.forward, are there any tests?
Added @eliebak fixes (resuming checkpoint without lr schedule or optimizer state #253) to just load weights from a checkpoint. I would suggest adding a config.optimizer.finetuning flag in order to (True) just load the weights or (False) Load weights, optimizer & LR Scheduler instead of config.checkpoints.load_optimizer & config.checkpoints.load_lr_scheduler
Switched from flash_attn_varlen_func to flash_attn_func as the later achieves greater performance. Keep in mind that we aren't using any feature of the varlen funct so it's recommended to stick with flash_attn_func
Should we depreciate LlamaConfig.rope_interleaved ? It was useful for training when using FlashAttention RoPEs and now seems to be used also in the inference code. IMO we should unify all 3 cases (Training, inference with rope_interleaved & inference without rope interleaved) within a single RoPE

Results

You can run the conversions & generations tests using the scripts in tools/converters. As I already mentioned in the previous PR (#174), despite we need at least 1 GPU (To init the ParallelContext) we are running the conversion with the CPU.

Generate HF predictions with the HF Hub model

torchrun --nproc-per-node 1 tools/converters/delete/generate_hf_predictions.py --pretrained-model-name-or-path meta-llama/Llama-3.2-3B

Convert the HF model to Nanotron

torchrun --nproc-per-node 1 tools/converters/convert_hf_to_nanotron.py --nanotron-checkpoint-path checkpoints/nanotron_pretrained_checkpoints/Nanotron-Llama-3.2-3B --pretrained-model-name-or-path meta-llama/Llama-3.2-3B

Generate predictions with the nanotron converted model

torchrun --nproc-per-node 1 tools/converters/delete/generate_nanotron_predictions.py --tp 1 --nanotron-checkpoint-path checkpoints/nanotron_pretrained_checkpoints/Nanotron-Llama-3.2-3B

Convert the model back to HF

torchrun --nproc-per-node 1 tools/converters/convert_nanotron_to_hf.py --nanotron-checkpoint-path checkpoints/nanotron_pretrained_checkpoints/Nanotron-Llama-3.2-3B --hugging-face-checkpoint-path checkpoints/huggingface_converted/Converted-Nanotron-Llama-3.2-3B

Generate predictions using the converted back HF model

torchrun --nproc-per-node 1 tools/converters/delete/generate_hf_predictions.py --pretrained-model-name-or-path checkpoints/huggingface_converted/Converted-Nanotron-Llama-3.2-3B

As can be seen from the following table, we observe slightly differences between the 2 backends. Those differences are produced by the QKV projections in the CausalSelfAttention layer (Nanotron computes them in a single GEMM vs 3 different GEMMs in HF) and the LayerNorm layer is different (Nanotron is using a optimized one from FlashAttention vs Basic PyTorch LayerNorm in HF). Also note that the differences increase if we use TP which is totally expected as the sizes of the GEMMs are different, triggering different GEMM algorithms.

Experiment	Backend	Size	TP	Accuracy
OG HF	HF	3	1	0.73046875
OG HF --> Nanotron	Nanotron	3	1	0.7265625
OG HF --> Nanotron --> HF	HF	3	1	0.73046875
OG HF --> Nanotron	Nanotron	3	2	0.703125
OG HF --> Nanotron	Nanotron	3	4	0.65234375

To run the Nanotron generations with different TP sizes:

torchrun --nproc-per-node 2 tools/converters/delete/generate_nanotron_predictions.py --tp 2 --nanotron-checkpoint-path checkpoints/nanotron_pretrained_checkpoints/Nanotron-Llama-3.2-3B
torchrun --nproc-per-node 4 tools/converters/delete/generate_nanotron_predictions.py --tp 4 --nanotron-checkpoint-path checkpoints/nanotron_pretrained_checkpoints/Nanotron-Llama-3.2-3B

TODO (Preferably in other PRs):

Add docs
Edit examples/config_llama3.2-3B.yaml data_stages.data.dataset.dataset_folder
Remove nanotron/tools/converters/delete/generate_hf_predictions.py & nanotron/tools/converters/delete/generate_nanotron_predictions.py scripts
Integrate Liger kernels for apply_rotary_pos_emb
Switch to SDPA instead of FA2
Reduce number of transposes in CausalSelfAttention.forward

Lauler · 2024-12-15T20:38:56Z

Have you managed to train with tp=4 after converting llama 3.2 from HF to Nanotron? From your earlier Llama3 PR you wrote the conversion could be done with tp=dp=pp=1, and that it was TP agnostic.

When using your conversion script above for Llama 3.2 3B model it works fine for tp=2, but runs into tensor size mismatch trying to train with tp=4:

[3236:3]:RuntimeError: The expanded size of the tensor (32128) must match the existing size (31872) at non-singleton dimension 0.  Target sizes: [32128, 3072].  Tensor sizes: [31872, 3072]

traceback.txt

(using your llama 3.2 yaml script in examples/ as a template for starting continued pretraining)

NouamaneTazi

Very nice PR @TJ-Solergibert! Thanks
Added some small qsts before merging

NouamaneTazi · 2024-12-23T13:39:13Z

tools/converters/delete/generate_hf_predictions.py

@@ -0,0 +1,73 @@
+"""


is this supposed to be pushed? 👀

NouamaneTazi · 2024-12-23T13:43:15Z

src/nanotron/models/llama.py


        # NOTE: this scale is for µTransfer,
        # in SP, we use sqrt(1/d_h)
        softmax_scale = 1 / query_states.shape[-1] if self.is_using_mup else None
-        attn_output = flash_attn_varlen_func(
+        attn_output = flash_attn_func(


yes this is faster but only for causal masks. How do you deal with kv cache in inference? Are generations the same with and without use_kv_cache?

Llama3.2 conversion updated

1e31cb9

NouamaneTazi requested changes Dec 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEW] Llama3.2 weight converters 🦙 #255

[NEW] Llama3.2 weight converters 🦙 #255

TJ-Solergibert commented Nov 28, 2024 •

edited

Loading

Lauler commented Dec 15, 2024 •

edited

Loading

NouamaneTazi left a comment •

edited

Loading

NouamaneTazi Dec 23, 2024

NouamaneTazi Dec 23, 2024

[NEW] Llama3.2 weight converters 🦙 #255

Are you sure you want to change the base?

[NEW] Llama3.2 weight converters 🦙 #255

Conversation

TJ-Solergibert commented Nov 28, 2024 • edited Loading

Results

Lauler commented Dec 15, 2024 • edited Loading

NouamaneTazi left a comment • edited Loading

Choose a reason for hiding this comment

NouamaneTazi Dec 23, 2024

Choose a reason for hiding this comment

NouamaneTazi Dec 23, 2024

Choose a reason for hiding this comment

TJ-Solergibert commented Nov 28, 2024 •

edited

Loading

Lauler commented Dec 15, 2024 •

edited

Loading

NouamaneTazi left a comment •

edited

Loading