Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: Evaluation Code for SmolLM-1 #34

Open
bg51717 opened this issue Dec 14, 2024 · 8 comments
Open

Request: Evaluation Code for SmolLM-1 #34

bg51717 opened this issue Dec 14, 2024 · 8 comments

Comments

@bg51717
Copy link

bg51717 commented Dec 14, 2024

Hi,
I am very interested in reproducing the evaluation results for SmolLM-1 as mentioned in your blog. I noticed that while the model weights are available, I couldn't find the specific evaluation scripts and code used for benchmarking SmolLM-1.
Could you please share the evaluation scripts/code used for testing SmolLM-1.
Thank you for your time and consideration!

@HallerPatrick
Copy link

The evaluation is here: https://github.com/huggingface/smollm/blob/main/evaluation/README.md

I think it is still missing the mmlu-cloze task

@bg51717
Copy link
Author

bg51717 commented Dec 14, 2024

The evaluation is here: https://github.com/huggingface/smollm/blob/main/evaluation/README.md

I think it is still missing the mmlu-cloze task

@HallerPatrick Really thank for your reply.
I want to evaluate the SmolLM-1. And I only find the scripts to evaluate the SmolLM-2.I try use the script to evaluate the SmolLM-1-135M,I get the different results from https://huggingface.co/blog/smollm#evaluation .
Do you have some idea about this? Looking forward to your reply.

@HallerPatrick
Copy link

@bg51717 Sorry, I missed that you looking for version 1.
In the blog post they reference following scripts: https://github.com/huggingface/cosmopedia/tree/main/evaluation. Did they also yield different results?

@bg51717
Copy link
Author

bg51717 commented Dec 14, 2024

The evaluation is here: https://github.com/huggingface/smollm/blob/main/evaluation/README.md

I think it is still missing the mmlu-cloze task

I don't know which parts MMLU is composed of.

@bg51717
Copy link
Author

bg51717 commented Dec 16, 2024

The evaluation script I used is :

lighteval accelerate \
    --model_args "pretrained=HuggingFaceTB/SmolLM-135M,revision=main,dtype=bfloat16,vllm,gpu_memory_utilisation=0.8,max_model_length=2048,data_parallel_size=4" \
    --custom_tasks "tasks.py" \
    --tasks "custom|mmlu_pro|0|1" \
    --output_dir "../eval_results"

The model is https://huggingface.co/HuggingFaceTB/SmolLM-135M and implementation commit is https://huggingface.co/HuggingFaceTB/SmolLM-135M/tree/1d461723eec654e65efdc40cf49301c89c0c92f4.

The tasks.py is from https://github.com/huggingface/smollm/blob/main/evaluation/tasks.py and its implementation commit is https://github.com/huggingface/smollm/tree/58c5cc63d2d993c441ae5e4ee5ed40cfc567332d.

The mmlu scores I got is 11.15 while the blog https://huggingface.co/blog/smollm recorded is 30.23.

@loubnabnl
Copy link
Collaborator

Hi, you're using MMLU Pro which is harder than MMLU, @anton-l let's also add MMLU/smollm1 evals to this repo?

You can reproduce the results in the blog post using this code https://github.com/huggingface/cosmopedia/tree/main/evaluation with

accelerate launch --num_processes=1 --main_process_port=29600 "lighteval/run_evals_accelerate.py" --model_args="pretrained=$MODEL" \
      --custom_tasks "lighteval_tasks.py" --output_dir $OUTPUT_DIR --override_batch_size 16 \
      --tasks "custom|mmlu_cloze:abstract_algebra|0|1,custom|mmlu_cloze:anatomy|0|1,custom|mmlu_cloze:astronomy|0|1,custom|mmlu_cloze:business_ethics|0|1,custom|mmlu_cloze:clinical_knowledge|0|1,custom|mmlu_cloze:college_biology|0|1,custom|mmlu_cloze:college_chemistry|0|1,custom|mmlu_cloze:college_computer_science|0|1,custom|mmlu_cloze:college_mathematics|0|1,custom|mmlu_cloze:college_medicine|0|1,custom|mmlu_cloze:college_physics|0|1,custom|mmlu_cloze:computer_security|0|1,custom|mmlu_cloze:conceptual_physics|0|1,custom|mmlu_cloze:econometrics|0|1,custom|mmlu_cloze:electrical_engineering|0|1,custom|mmlu_cloze:elementary_mathematics|0|1,custom|mmlu_cloze:formal_logic|0|1,custom|mmlu_cloze:global_facts|0|1,custom|mmlu_cloze:high_school_biology|0|1,custom|mmlu_cloze:high_school_chemistry|0|1,custom|mmlu_cloze:high_school_computer_science|0|1,custom|mmlu_cloze:high_school_european_history|0|1,custom|mmlu_cloze:high_school_geography|0|1,custom|mmlu_cloze:high_school_government_and_politics|0|1,custom|mmlu_cloze:high_school_macroeconomics|0|1,custom|mmlu_cloze:high_school_mathematics|0|1,custom|mmlu_cloze:high_school_microeconomics|0|1,custom|mmlu_cloze:high_school_physics|0|1,custom|mmlu_cloze:high_school_psychology|0|1,custom|mmlu_cloze:high_school_statistics|0|1,custom|mmlu_cloze:high_school_us_history|0|1,custom|mmlu_cloze:high_school_world_history|0|1,custom|mmlu_cloze:human_aging|0|1,custom|mmlu_cloze:human_sexuality|0|1,custom|mmlu_cloze:international_law|0|1,custom|mmlu_cloze:jurisprudence|0|1,custom|mmlu_cloze:logical_fallacies|0|1,custom|mmlu_cloze:machine_learning|0|1,custom|mmlu_cloze:management|0|1,custom|mmlu_cloze:marketing|0|1,custom|mmlu_cloze:medical_genetics|0|1,custom|mmlu_cloze:miscellaneous|0|1,custom|mmlu_cloze:moral_disputes|0|1,custom|mmlu_cloze:moral_scenarios|0|1,custom|mmlu_cloze:nutrition|0|1,custom|mmlu_cloze:philosophy|0|1,custom|mmlu_cloze:prehistory|0|1,custom|mmlu_cloze:professional_accounting|0|1,custom|mmlu_cloze:professional_law|0|1,custom|mmlu_cloze:professional_medicine|0|1,custom|mmlu_cloze:professional_psychology|0|1,custom|mmlu_cloze:public_relations|0|1,custom|mmlu_cloze:security_studies|0|1,custom|mmlu_cloze:sociology|0|1,custom|mmlu_cloze:us_foreign_policy|0|1,custom|mmlu_cloze:virology|0|1,custom|mmlu_cloze:world_religions|0|1"
      ``` 

@anton-l
Copy link
Member

anton-l commented Dec 23, 2024

The MMLU task is now also available in the new suite with --tasks "custom|mmlu|0|1". Let me know if you have any issues!

@bg51717
Copy link
Author

bg51717 commented Dec 24, 2024

The MMLU task is now also available in the new suite with --tasks "custom|mmlu|0|1". Let me know if you have any issues!

@anton-l @loubnabnl
Thanks for your reply.I could get the same the results.
But I still have one issues.
I find that the ckpt you released on https://huggingface.co/HuggingFaceTB/SmolLM-135M is not a nanotron ckpt.So I want to know how to convert a hf ckpt to a nanotron ckpt. I tried to use the https://github.com/huggingface/nanotron/blob/main/examples/llama/convert_hf_to_nanotron.py and did continue pretrain. I got a big loss with 7 as first. And after training, the loss would be 2.5. I think I didn't convert the checkpoint successfully.

Looking forward to your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants