SmolLM2 Pretrain Dataset #35

lapp0 · 2024-12-18T16:48:19Z

The SmolLM2 model cards reference the fact that the pretraining dataset will be released soon:

"The 135M model was trained on 2 trillion tokens using a diverse dataset combination: FineWeb-Edu, DCLM, The Stack, along with new filtered datasets we curated and will release soon."

Quite interested in this dataset, are there still plans to release it?

loubnabnl · 2024-12-23T15:13:22Z

The datasets are all already public (FineWeb-Edu, DCLM, FineMath) except for the code dataset built on top of The Stack v2. We will release information on how we mixed them together in a tech report soon.

lapp0 changed the title ~~SmolLM2 Dataset~~ SmolLM2 Pretrain Dataset Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SmolLM2 Pretrain Dataset #35

SmolLM2 Pretrain Dataset #35

lapp0 commented Dec 18, 2024

loubnabnl commented Dec 23, 2024

SmolLM2 Pretrain Dataset #35

SmolLM2 Pretrain Dataset #35

Comments

lapp0 commented Dec 18, 2024

loubnabnl commented Dec 23, 2024