You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The datasets are all already public (FineWeb-Edu, DCLM, FineMath) except for the code dataset built on top of The Stack v2. We will release information on how we mixed them together in a tech report soon.
The SmolLM2 model cards reference the fact that the pretraining dataset will be released soon:
"The 135M model was trained on 2 trillion tokens using a diverse dataset combination: FineWeb-Edu, DCLM, The Stack, along with new filtered datasets we curated and will release soon."
Quite interested in this dataset, are there still plans to release it?
The text was updated successfully, but these errors were encountered: