Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate performance discrepancies in gte-Qwen and NV-embed models #1600

Open
isaac-chung opened this issue Dec 16, 2024 · 10 comments
Open

Comments

@isaac-chung
Copy link
Collaborator

From #1436

@AlexeyVatolin
Copy link
Contributor

AlexeyVatolin commented Dec 17, 2024

Hello,

I conducted a comparison of the models using the examples provided in the readme.md file for each model. Here's a summary of my findings:

  • Alibaba-NLP/gte-Qwen2-7B-instruct

  • Alibaba-NLP/gte-Qwen1.5-7B-instruct

  • Alibaba-NLP/gte-Qwen2-1.5B-instruct

  • Linq-AI-Research/Linq-Embed-Mistral

    For these models, I found that all three implementations (i.e., Transformers AutoModel, sentence_transformers, and mteb) are exactly the same. This consistency is great to see.

  • nvidia/NV-Embed-v2

  • nvidia/NV-Embed-v1

In these cases, the official implementation of Transformers AutoModel differs from the official sentence_transformers implementation, which is unexpected. The implementation in mteb aligns completely with sentence_transformers.

I also wanted to share the code I used for this comparison: View the Gist

Please note that questions regarding the correctness of prompt usage were not within the scope of this study. However, it does highlight that the models added to mteb are correctly implemented.

P.S.
I created a discussion in the nvidia repository about this problem

@AlexeyVatolin
Copy link
Contributor

AlexeyVatolin commented Dec 17, 2024

Qwen model repository includes a script to calculate scores for their models on the MTEB benchmark. I ran this script on the same tasks covered in my pull request.

The results from the original script are, in most cases, worse than those reported on the leaderboard and also fall short when compared to results obtained using the code from the MTEB models.
Here is my command to run this script

OPENBLAS_NUM_THREADS=8 python scripts/eval_mteb.py -m Alibaba-NLP/gte-Qwen2-1.5B-instruct --output_dir results_qwen_2_1.5b_eval_mteb --task mteb

Additionally, there is a open discussion about this on the Qwen model repository.

Classification

source AmazonCounterfactualClassification EmotionClassification ToxicConversationsClassification
gte-Qwen1.5-7B-instruct Leaderboard 83.16 54.53 78.75
gte-Qwen1.5-7B-instruct Pull request 81.78 54.91 77.25
gte-Qwen1.5-7B-instruct Original script 67.87 46.08 59.06
gte-Qwen2-1.5B-instruct Leaderboard 83.99 61.37 82.66
gte-Qwen2-1.5B-instruct Pull request 82.51 65.66 84.54
gte-Qwen2-1.5B-instruct Original script 71.81 54.56 65.1

Clustering

source ArxivClusteringS2S RedditClustering
gte-Qwen1.5-7B-instruct Leaderboard 51.45 73.37
gte-Qwen1.5-7B-instruct Pull request 53.57 80.12
gte-Qwen1.5-7B-instruct Original script 47.88 64.43
gte-Qwen2-1.5B-instruct Leaderboard 45.01 55.82
gte-Qwen2-1.5B-instruct Pull request 44.61 51.36
gte-Qwen2-1.5B-instruct Original script 41.1 52.53

PairClassification

source SprintDuplicateQuestions TwitterSemEval2015
gte-Qwen1.5-7B-instruct Leaderboard 96.07 79.36
gte-Qwen1.5-7B-instruct Pull request 94.51 80.72
gte-Qwen1.5-7B-instruct Original script 91.44 61.92
gte-Qwen2-1.5B-instruct Leaderboard 95.32 79.64
gte-Qwen2-1.5B-instruct Pull request 91.19 75.93
gte-Qwen2-1.5B-instruct Original script 93.87 74.59

Reranking

source SciDocsRR AskUbuntuDupQuestions
gte-Qwen1.5-7B-instruct Leaderboard 87.89 66
gte-Qwen1.5-7B-instruct Pull request 88.26 64.03
gte-Qwen1.5-7B-instruct Original script 85.2 57.32
gte-Qwen2-1.5B-instruct Leaderboard 86.52 64.55
gte-Qwen2-1.5B-instruct Pull request 85.67 62.33
gte-Qwen2-1.5B-instruct Original script 83.51 60.47

Retrieval

source SCIDOCS SciFact
gte-Qwen1.5-7B-instruct Leaderboard 27.69 75.31
gte-Qwen1.5-7B-instruct Pull request 26.34 75.8
gte-Qwen1.5-7B-instruct Original script 22.38 74.34
gte-Qwen2-1.5B-instruct Leaderboard 24.98 78.44
gte-Qwen2-1.5B-instruct Pull request 23.4 77.47
gte-Qwen2-1.5B-instruct Original script 21.92 75.81

STS

source STS16 STSBenchmark
gte-Qwen1.5-7B-instruct Leaderboard 86.39 87.35
gte-Qwen1.5-7B-instruct Pull request 85.98 86.86
gte-Qwen1.5-7B-instruct Original script 81.33 83.65
gte-Qwen2-1.5B-instruct Leaderboard 85.45 86.38
gte-Qwen2-1.5B-instruct Pull request 84.71 84.71
gte-Qwen2-1.5B-instruct Original script 85.35 86.04

Summarization

source SummEval
gte-Qwen1.5-7B-instruct Leaderboard 31.46
gte-Qwen1.5-7B-instruct Pull request 31.22
gte-Qwen1.5-7B-instruct Original script 30.07
gte-Qwen2-1.5B-instruct Leaderboard 31.17
gte-Qwen2-1.5B-instruct Pull request 30.5
gte-Qwen2-1.5B-instruct Original script 28.99

@KennethEnevoldsen
Copy link
Contributor

Right from this is seems like we should update the scores on the leaderboard with the new reproducible scores. Since the authors has been made aware (issue on NVIDIA and on QWEN) I believe this is a fair decision to make.

@AlexeyVatolin have you run the models, otherwise I will ask Niklas to rerun them

@afalf
Copy link

afalf commented Dec 24, 2024

Right from this is seems like we should update the scores on the leaderboard with the new reproducible scores. Since the authors has been made aware (issue on NVIDIA and on QWEN) I believe this is a fair decision to make.

@AlexeyVatolin have you run the models, otherwise I will ask Niklas to rerun them

I'm a member of the gte-Qwen series model. Sorry, we checked and found some errors in the previous script. It have now been updated and verified to be consistent with the results on the leaderboard. Please try again with the latest script to check the results.

@AlexeyVatolin
Copy link
Contributor

@afalf, thanks a lot! I've run the gte-Qwen models with the updated script and will post as soon as I have results

@AlexeyVatolin
Copy link
Contributor

@afalf, I have reviewed the updated script and noticed a few minor errors that were preventing it from running. I plan to submit a pull request to your Hugging Face repository later. After correcting these issues, the results have been very promising. In fact, when employing normalization - which I regrettably forgot to include last time, despite it being used in the example -the metrics slightly surpass those on the leaderboard. Could you please clarify whether the intended execution is with or without normalization?

Additionally, I compared the script with the code in mteb/gritlm and identified some differences. I have managed to adjust the model in mteb to produce results almost identical to those of the original script. You will find the corrections in my pull request. #1637

Here is average scores

leaderboard Original script Original script normalized Pull request
gte-Qwen1.5-7B-instruct 69.9129 69.5629 70.1543 69.5436
gte-Qwen2-1.5B-instruct 68.6643 68.33 68.74 68.6293

Classification

source AmazonCounterfactualClassification EmotionClassification ToxicConversationsClassification
gte-Qwen1.5-7B-instruct Leaderboard 83.16 54.53 78.75
gte-Qwen1.5-7B-instruct Pull request 81.51 55.34 76.44
gte-Qwen1.5-7B-instruct Original script 81.79 49.3 73.88
gte-Qwen1.5-7B-instruct Original script normalized 81.49 55.35 76.46
gte-Qwen2-1.5B-instruct Leaderboard 83.99 61.37 82.66
gte-Qwen2-1.5B-instruct Pull request 85.81 64.67 82.93
gte-Qwen2-1.5B-instruct Original script 84.04 61.04 82.29
gte-Qwen2-1.5B-instruct Original script normalized 85.82 64.68 82.94

Clustering

source ArxivClusteringS2S RedditClustering
gte-Qwen1.5-7B-instruct Leaderboard 51.45 73.37
gte-Qwen1.5-7B-instruct Pull request 53.16 80.14
gte-Qwen1.5-7B-instruct Original script 53.17 80.06
gte-Qwen1.5-7B-instruct Original script normalized 53.16 80.03
gte-Qwen2-1.5B-instruct Leaderboard 45.01 55.82
gte-Qwen2-1.5B-instruct Pull request 44.96 55.78
gte-Qwen2-1.5B-instruct Original script 45.05 56.06
gte-Qwen2-1.5B-instruct Original script normalized 45.02 55.72

PairClassification

source SprintDuplicateQuestions TwitterSemEval2015
gte-Qwen1.5-7B-instruct Leaderboard 96.07 79.36
gte-Qwen1.5-7B-instruct Pull request 94.96 80.94
gte-Qwen1.5-7B-instruct Original script 94.98 80.94
gte-Qwen1.5-7B-instruct Original script normalized 94.96 80.95
gte-Qwen2-1.5B-instruct Leaderboard 95.32 79.64
gte-Qwen2-1.5B-instruct Pull request 95.77 79.61
gte-Qwen2-1.5B-instruct Original script 95.64 79.61
gte-Qwen2-1.5B-instruct Original script normalized 95.77 79.61

Reranking

source SciDocsRR AskUbuntuDupQuestions
gte-Qwen1.5-7B-instruct Leaderboard 87.89 66
gte-Qwen1.5-7B-instruct Pull request 85.14 58.03
gte-Qwen1.5-7B-instruct Original script 87.62 64.4
gte-Qwen1.5-7B-instruct Original script normalized 87.61 64.4
gte-Qwen2-1.5B-instruct Leaderboard 86.52 64.55
gte-Qwen2-1.5B-instruct Pull request 83.27 62.27
gte-Qwen2-1.5B-instruct Original script 86.85 64.02
gte-Qwen2-1.5B-instruct Original script normalized 86.85 64.02

Retrieval

source SCIDOCS SciFact
gte-Qwen1.5-7B-instruct Leaderboard 27.69 75.31
gte-Qwen1.5-7B-instruct Pull request 26.1 76.33
gte-Qwen1.5-7B-instruct Original script 25.71 76.58
gte-Qwen1.5-7B-instruct Original script normalized 25.73 76.57
gte-Qwen2-1.5B-instruct Leaderboard 24.98 78.44
gte-Qwen2-1.5B-instruct Pull request 24.79 79.12
gte-Qwen2-1.5B-instruct Original script 23.69 76.23
gte-Qwen2-1.5B-instruct Original script normalized 23.69 76.14

STS

source STS16 STSBenchmark
gte-Qwen1.5-7B-instruct Leaderboard 86.39 87.35
gte-Qwen1.5-7B-instruct Pull request 86.38 87.63
gte-Qwen1.5-7B-instruct Original script 86.44 87.64
gte-Qwen1.5-7B-instruct Original script normalized 86.44 87.64
gte-Qwen2-1.5B-instruct Leaderboard 85.45 86.38
gte-Qwen2-1.5B-instruct Pull request 84.85 85.92
gte-Qwen2-1.5B-instruct Original script 84.92 86.06
gte-Qwen2-1.5B-instruct Original script normalized 84.92 86.06

Summarization

source SummEval
gte-Qwen1.5-7B-instruct Leaderboard 31.46
gte-Qwen1.5-7B-instruct Pull request 31.51
gte-Qwen1.5-7B-instruct Original script 31.37
gte-Qwen1.5-7B-instruct Original script normalized 31.37
gte-Qwen2-1.5B-instruct Leaderboard 31.17
gte-Qwen2-1.5B-instruct Pull request 31.06
gte-Qwen2-1.5B-instruct Original script 31.12
gte-Qwen2-1.5B-instruct Original script normalized 31.12

@afalf
Copy link

afalf commented Dec 28, 2024

@AlexeyVatolin

Thansk a lot! We have used execution with normalized. Sorry for these mirrors in our scripts.

@KennethEnevoldsen
Copy link
Contributor

I have reviewed the PR and everything looks good. @afalf you might want to resubmit the scores using the new scores given the improvements.

@afalf
Copy link

afalf commented Dec 30, 2024

I have reviewed the PR and everything looks good. @afalf you might want to resubmit the scores using the new scores given the improvements.

Okay, we will update the scores in our metadata and the results in https://github.com/embeddings-benchmark/results.

@AlexeyVatolin
Copy link
Contributor

@afalf, I noticed that the model results from the eval_mteb.py script do not match the results from the example in the readme (example in gist). Maybe is it required to update the readme to match with eval_mteb.py?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants