Pre-training

We use nanotron library for training SmolLM and SmolLM2 base models.

The scripts for training SmolLM v1 can be found in the smollm1 folder, and those for training SmolLM2 can be found in the smollm2 folder, we will add the details for the data mixture soon. SmolLM2 uses similar architecture as SmolLM but uses an improved data mixture and significantly longer training periods (11 trillion tokens for the 1.7B, 4 trillion for the 360M and 2 trillion for the 135M).

Setup

Please refer to nanotron for detailed instructions on setting up your training environment and launching jobs.

After setting up the environment and tokenizing the training datasets with datatrove (instructions available here), you can modify the configurations to match your number of nodes and local paths.

Below is an example of launching SmolLM1 135M training on 1 node (you can change the DP value to 8 in the config and adjust the batch size) and run:

git clone https://github.com/huggingface/nanotron
cd nanotron
# follow installation
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file smollm1/config_smollm1_135M.yaml

If you are working on a slurm cluster, you can modify the launch.slurm and launch the training with:

sbatch launch.slurm

Note

Don't forget to create the logs directory before launching the job:

Continual pre-training

The nanotron checkpoints for SmolLM2 models are available at: https://huggingface.co/HuggingFaceTB/SmolLM2-nanotron-ckpt

You can find an example of continual pre-training in the continual-pretraining folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Pre-training

Setup

Continual pre-training

Files

README.md

Latest commit

History

README.md

File metadata and controls

Pre-training

Setup

Continual pre-training