We use nanotron library for training SmolLM and SmolLM2 base models.
The scripts for training SmolLM v1 can be found in the smollm1
folder, and those for training SmolLM2 can be found in the smollm2
folder, we will add the details for the data mixture soon. SmolLM2 uses similar architecture as SmolLM but uses an improved data mixture and significantly longer training periods (11 trillion tokens for the 1.7B, 4 trillion for the 360M and 2 trillion for the 135M).
Please refer to nanotron for detailed instructions on setting up your training environment and launching jobs.
After setting up the environment and tokenizing the training datasets with datatrove (instructions available here), you can modify the configurations to match your number of nodes and local paths.
Below is an example of launching SmolLM1 135M training on 1 node (you can change the DP value to 8 in the config and adjust the batch size) and run:
git clone https://github.com/huggingface/nanotron
cd nanotron
# follow installation
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file smollm1/config_smollm1_135M.yaml
If you are working on a slurm cluster, you can modify the launch.slurm
and launch the training with:
sbatch launch.slurm
Note
Don't forget to create the logs directory before launching the job:
The nanotron checkpoints for SmolLM2 models are available at: https://huggingface.co/HuggingFaceTB/SmolLM2-nanotron-ckpt
You can find an example of continual pre-training in the continual-pretraining folder.