Investigate performance discrepancies in gte-Qwen and NV-embed models #1600

isaac-chung · 2024-12-16T08:59:11Z

AlexeyVatolin · 2024-12-17T12:34:29Z

Hello,

I conducted a comparison of the models using the examples provided in the readme.md file for each model. Here's a summary of my findings:

Alibaba-NLP/gte-Qwen2-7B-instruct
Alibaba-NLP/gte-Qwen1.5-7B-instruct
Alibaba-NLP/gte-Qwen2-1.5B-instruct
Linq-AI-Research/Linq-Embed-Mistral

For these models, I found that all three implementations (i.e., Transformers AutoModel, sentence_transformers, and mteb) are exactly the same. This consistency is great to see.
nvidia/NV-Embed-v2
nvidia/NV-Embed-v1

In these cases, the official implementation of Transformers AutoModel differs from the official sentence_transformers implementation, which is unexpected. The implementation in mteb aligns completely with sentence_transformers.

I also wanted to share the code I used for this comparison: View the Gist

Please note that questions regarding the correctness of prompt usage were not within the scope of this study. However, it does highlight that the models added to mteb are correctly implemented.

P.S.
I created a discussion in the nvidia repository about this problem

AlexeyVatolin · 2024-12-17T15:04:17Z

Qwen model repository includes a script to calculate scores for their models on the MTEB benchmark. I ran this script on the same tasks covered in my pull request.

The results from the original script are, in most cases, worse than those reported on the leaderboard and also fall short when compared to results obtained using the code from the MTEB models.
Here is my command to run this script

OPENBLAS_NUM_THREADS=8 python scripts/eval_mteb.py -m Alibaba-NLP/gte-Qwen2-1.5B-instruct --output_dir results_qwen_2_1.5b_eval_mteb --task mteb

Additionally, there is a open discussion about this on the Qwen model repository.

Classification

	source	AmazonCounterfactualClassification	EmotionClassification	ToxicConversationsClassification
gte-Qwen1.5-7B-instruct	Leaderboard	83.16	54.53	78.75
gte-Qwen1.5-7B-instruct	Pull request	81.78	54.91	77.25
gte-Qwen1.5-7B-instruct	Original script	67.87	46.08	59.06
gte-Qwen2-1.5B-instruct	Leaderboard	83.99	61.37	82.66
gte-Qwen2-1.5B-instruct	Pull request	82.51	65.66	84.54
gte-Qwen2-1.5B-instruct	Original script	71.81	54.56	65.1

Clustering

	source	ArxivClusteringS2S	RedditClustering
gte-Qwen1.5-7B-instruct	Leaderboard	51.45	73.37
gte-Qwen1.5-7B-instruct	Pull request	53.57	80.12
gte-Qwen1.5-7B-instruct	Original script	47.88	64.43
gte-Qwen2-1.5B-instruct	Leaderboard	45.01	55.82
gte-Qwen2-1.5B-instruct	Pull request	44.61	51.36
gte-Qwen2-1.5B-instruct	Original script	41.1	52.53

PairClassification

	source	SprintDuplicateQuestions	TwitterSemEval2015
gte-Qwen1.5-7B-instruct	Leaderboard	96.07	79.36
gte-Qwen1.5-7B-instruct	Pull request	94.51	80.72
gte-Qwen1.5-7B-instruct	Original script	91.44	61.92
gte-Qwen2-1.5B-instruct	Leaderboard	95.32	79.64
gte-Qwen2-1.5B-instruct	Pull request	91.19	75.93
gte-Qwen2-1.5B-instruct	Original script	93.87	74.59

Reranking

	source	SciDocsRR	AskUbuntuDupQuestions
gte-Qwen1.5-7B-instruct	Leaderboard	87.89	66
gte-Qwen1.5-7B-instruct	Pull request	88.26	64.03
gte-Qwen1.5-7B-instruct	Original script	85.2	57.32
gte-Qwen2-1.5B-instruct	Leaderboard	86.52	64.55
gte-Qwen2-1.5B-instruct	Pull request	85.67	62.33
gte-Qwen2-1.5B-instruct	Original script	83.51	60.47

Retrieval

	source	SCIDOCS	SciFact
gte-Qwen1.5-7B-instruct	Leaderboard	27.69	75.31
gte-Qwen1.5-7B-instruct	Pull request	26.34	75.8
gte-Qwen1.5-7B-instruct	Original script	22.38	74.34
gte-Qwen2-1.5B-instruct	Leaderboard	24.98	78.44
gte-Qwen2-1.5B-instruct	Pull request	23.4	77.47
gte-Qwen2-1.5B-instruct	Original script	21.92	75.81

STS

	source	STS16	STSBenchmark
gte-Qwen1.5-7B-instruct	Leaderboard	86.39	87.35
gte-Qwen1.5-7B-instruct	Pull request	85.98	86.86
gte-Qwen1.5-7B-instruct	Original script	81.33	83.65
gte-Qwen2-1.5B-instruct	Leaderboard	85.45	86.38
gte-Qwen2-1.5B-instruct	Pull request	84.71	84.71
gte-Qwen2-1.5B-instruct	Original script	85.35	86.04

Summarization

	source	SummEval
gte-Qwen1.5-7B-instruct	Leaderboard	31.46
gte-Qwen1.5-7B-instruct	Pull request	31.22
gte-Qwen1.5-7B-instruct	Original script	30.07
gte-Qwen2-1.5B-instruct	Leaderboard	31.17
gte-Qwen2-1.5B-instruct	Pull request	30.5
gte-Qwen2-1.5B-instruct	Original script	28.99

KennethEnevoldsen · 2024-12-22T20:32:02Z

Right from this is seems like we should update the scores on the leaderboard with the new reproducible scores. Since the authors has been made aware (issue on NVIDIA and on QWEN) I believe this is a fair decision to make.

@AlexeyVatolin have you run the models, otherwise I will ask Niklas to rerun them

afalf · 2024-12-24T08:42:37Z

Right from this is seems like we should update the scores on the leaderboard with the new reproducible scores. Since the authors has been made aware (issue on NVIDIA and on QWEN) I believe this is a fair decision to make.

@AlexeyVatolin have you run the models, otherwise I will ask Niklas to rerun them

I'm a member of the gte-Qwen series model. Sorry, we checked and found some errors in the previous script. It have now been updated and verified to be consistent with the results on the leaderboard. Please try again with the latest script to check the results.

AlexeyVatolin · 2024-12-24T11:44:59Z

@afalf, thanks a lot! I've run the gte-Qwen models with the updated script and will post as soon as I have results

AlexeyVatolin · 2024-12-27T19:40:40Z

@afalf, I have reviewed the updated script and noticed a few minor errors that were preventing it from running. I plan to submit a pull request to your Hugging Face repository later. After correcting these issues, the results have been very promising. In fact, when employing normalization - which I regrettably forgot to include last time, despite it being used in the example -the metrics slightly surpass those on the leaderboard. Could you please clarify whether the intended execution is with or without normalization?

Additionally, I compared the script with the code in mteb/gritlm and identified some differences. I have managed to adjust the model in mteb to produce results almost identical to those of the original script. You will find the corrections in my pull request. #1637

Here is average scores

	leaderboard	Original script	Original script normalized	Pull request
gte-Qwen1.5-7B-instruct	69.9129	69.5629	70.1543	69.5436
gte-Qwen2-1.5B-instruct	68.6643	68.33	68.74	68.6293

Classification

	source	AmazonCounterfactualClassification	EmotionClassification	ToxicConversationsClassification
gte-Qwen1.5-7B-instruct	Leaderboard	83.16	54.53	78.75
gte-Qwen1.5-7B-instruct	Pull request	81.51	55.34	76.44
gte-Qwen1.5-7B-instruct	Original script	81.79	49.3	73.88
gte-Qwen1.5-7B-instruct	Original script normalized	81.49	55.35	76.46
gte-Qwen2-1.5B-instruct	Leaderboard	83.99	61.37	82.66
gte-Qwen2-1.5B-instruct	Pull request	85.81	64.67	82.93
gte-Qwen2-1.5B-instruct	Original script	84.04	61.04	82.29
gte-Qwen2-1.5B-instruct	Original script normalized	85.82	64.68	82.94

Clustering

	source	ArxivClusteringS2S	RedditClustering
gte-Qwen1.5-7B-instruct	Leaderboard	51.45	73.37
gte-Qwen1.5-7B-instruct	Pull request	53.16	80.14
gte-Qwen1.5-7B-instruct	Original script	53.17	80.06
gte-Qwen1.5-7B-instruct	Original script normalized	53.16	80.03
gte-Qwen2-1.5B-instruct	Leaderboard	45.01	55.82
gte-Qwen2-1.5B-instruct	Pull request	44.96	55.78
gte-Qwen2-1.5B-instruct	Original script	45.05	56.06
gte-Qwen2-1.5B-instruct	Original script normalized	45.02	55.72

PairClassification

	source	SprintDuplicateQuestions	TwitterSemEval2015
gte-Qwen1.5-7B-instruct	Leaderboard	96.07	79.36
gte-Qwen1.5-7B-instruct	Pull request	94.96	80.94
gte-Qwen1.5-7B-instruct	Original script	94.98	80.94
gte-Qwen1.5-7B-instruct	Original script normalized	94.96	80.95
gte-Qwen2-1.5B-instruct	Leaderboard	95.32	79.64
gte-Qwen2-1.5B-instruct	Pull request	95.77	79.61
gte-Qwen2-1.5B-instruct	Original script	95.64	79.61
gte-Qwen2-1.5B-instruct	Original script normalized	95.77	79.61

Reranking

	source	SciDocsRR	AskUbuntuDupQuestions
gte-Qwen1.5-7B-instruct	Leaderboard	87.89	66
gte-Qwen1.5-7B-instruct	Pull request	85.14	58.03
gte-Qwen1.5-7B-instruct	Original script	87.62	64.4
gte-Qwen1.5-7B-instruct	Original script normalized	87.61	64.4
gte-Qwen2-1.5B-instruct	Leaderboard	86.52	64.55
gte-Qwen2-1.5B-instruct	Pull request	83.27	62.27
gte-Qwen2-1.5B-instruct	Original script	86.85	64.02
gte-Qwen2-1.5B-instruct	Original script normalized	86.85	64.02

Retrieval

	source	SCIDOCS	SciFact
gte-Qwen1.5-7B-instruct	Leaderboard	27.69	75.31
gte-Qwen1.5-7B-instruct	Pull request	26.1	76.33
gte-Qwen1.5-7B-instruct	Original script	25.71	76.58
gte-Qwen1.5-7B-instruct	Original script normalized	25.73	76.57
gte-Qwen2-1.5B-instruct	Leaderboard	24.98	78.44
gte-Qwen2-1.5B-instruct	Pull request	24.79	79.12
gte-Qwen2-1.5B-instruct	Original script	23.69	76.23
gte-Qwen2-1.5B-instruct	Original script normalized	23.69	76.14

STS

	source	STS16	STSBenchmark
gte-Qwen1.5-7B-instruct	Leaderboard	86.39	87.35
gte-Qwen1.5-7B-instruct	Pull request	86.38	87.63
gte-Qwen1.5-7B-instruct	Original script	86.44	87.64
gte-Qwen1.5-7B-instruct	Original script normalized	86.44	87.64
gte-Qwen2-1.5B-instruct	Leaderboard	85.45	86.38
gte-Qwen2-1.5B-instruct	Pull request	84.85	85.92
gte-Qwen2-1.5B-instruct	Original script	84.92	86.06
gte-Qwen2-1.5B-instruct	Original script normalized	84.92	86.06

Summarization

	source	SummEval
gte-Qwen1.5-7B-instruct	Leaderboard	31.46
gte-Qwen1.5-7B-instruct	Pull request	31.51
gte-Qwen1.5-7B-instruct	Original script	31.37
gte-Qwen1.5-7B-instruct	Original script normalized	31.37
gte-Qwen2-1.5B-instruct	Leaderboard	31.17
gte-Qwen2-1.5B-instruct	Pull request	31.06
gte-Qwen2-1.5B-instruct	Original script	31.12
gte-Qwen2-1.5B-instruct	Original script normalized	31.12

afalf · 2024-12-28T00:19:04Z

@AlexeyVatolin

Thansk a lot! We have used execution with normalized. Sorry for these mirrors in our scripts.

KennethEnevoldsen · 2024-12-29T15:31:10Z

I have reviewed the PR and everything looks good. @afalf you might want to resubmit the scores using the new scores given the improvements.

afalf · 2024-12-30T02:36:05Z

I have reviewed the PR and everything looks good. @afalf you might want to resubmit the scores using the new scores given the improvements.

Okay, we will update the scores in our metadata and the results in https://github.com/embeddings-benchmark/results.

AlexeyVatolin · 2024-12-30T15:07:58Z

@afalf, I noticed that the model results from the eval_mteb.py script do not match the results from the example in the readme (example in gist). Maybe is it required to update the readme to match with eval_mteb.py?

isaac-chung mentioned this issue Dec 16, 2024

Add new models nvidia, gte, linq #1436

Merged

7 tasks

AlexeyVatolin mentioned this issue Dec 27, 2024

Correction of discrepancies for gte-Qweb model #1637

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance discrepancies in gte-Qwen and NV-embed models #1600

Investigate performance discrepancies in gte-Qwen and NV-embed models #1600

isaac-chung commented Dec 16, 2024

AlexeyVatolin commented Dec 17, 2024 •

edited

Loading

AlexeyVatolin commented Dec 17, 2024 •

edited

Loading

KennethEnevoldsen commented Dec 22, 2024

afalf commented Dec 24, 2024

AlexeyVatolin commented Dec 24, 2024

AlexeyVatolin commented Dec 27, 2024

afalf commented Dec 28, 2024 •

edited

Loading

KennethEnevoldsen commented Dec 29, 2024

afalf commented Dec 30, 2024

AlexeyVatolin commented Dec 30, 2024

Investigate performance discrepancies in gte-Qwen and NV-embed models #1600

Investigate performance discrepancies in gte-Qwen and NV-embed models #1600

Comments

isaac-chung commented Dec 16, 2024

AlexeyVatolin commented Dec 17, 2024 • edited Loading

AlexeyVatolin commented Dec 17, 2024 • edited Loading

Classification

Clustering

PairClassification

Reranking

Retrieval

STS

Summarization

KennethEnevoldsen commented Dec 22, 2024

afalf commented Dec 24, 2024

AlexeyVatolin commented Dec 24, 2024

AlexeyVatolin commented Dec 27, 2024

Classification

Clustering

PairClassification

Reranking

Retrieval

STS

Summarization

afalf commented Dec 28, 2024 • edited Loading

KennethEnevoldsen commented Dec 29, 2024

afalf commented Dec 30, 2024

AlexeyVatolin commented Dec 30, 2024

AlexeyVatolin commented Dec 17, 2024 •

edited

Loading

AlexeyVatolin commented Dec 17, 2024 •

edited

Loading

afalf commented Dec 28, 2024 •

edited

Loading