Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New blog post] Unified multimodal large model evaluation, accelerating multimodal intelligence emergence #1987

Open
wants to merge 54 commits into
base: main
Choose a base branch
from
Open
Changes from 9 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
b2a5ff0
Initial commit
kcz358 Apr 14, 2024
b5ed7fc
Try to add image
kcz358 Apr 14, 2024
929c502
See whether it works using huggingface dataset
kcz358 Apr 14, 2024
37b377f
Nah
kcz358 Apr 14, 2024
809268e
Add english version
kcz358 Apr 14, 2024
28c44ae
Update lmms_eval.md
pufanyi Apr 14, 2024
44cef2f
Add author list
kcz358 Apr 14, 2024
bfda318
Merge branch 'main' of https://github.com/kcz358/blog
kcz358 Apr 14, 2024
bb4f141
Revise author list
kcz358 Apr 14, 2024
6507e4f
Update lmms_eval.md
kcz358 Apr 20, 2024
1aef8f8
Update lmms_eval.md
kcz358 Apr 20, 2024
2fdda3f
Update lmms_eval.md
kcz358 Apr 20, 2024
6aadd54
Update lmms_eval.md
kcz358 Apr 20, 2024
2656012
Update lmms_eval.md
kcz358 Apr 20, 2024
21fa476
Update lmms_eval.md
kcz358 Apr 20, 2024
fb5a9c8
Update lmms_eval in _blog.yml
kcz358 Apr 20, 2024
f1f8604
Add thumbnail image to assets
kcz358 Apr 20, 2024
5a3f283
Update lmms_eval.md
kcz358 Apr 20, 2024
f04f8ca
Update lmms_eval.md
kcz358 Apr 20, 2024
2974a3d
Merge branch 'main' into main
Luodian Apr 20, 2024
18e888f
Update lmms_eval.md
kcz358 Apr 25, 2024
549c968
Update lmms_eval.md
kcz358 Apr 25, 2024
f288278
Update lmms_eval.md
kcz358 Apr 25, 2024
0f1a208
Update lmms_eval.md
kcz358 Apr 25, 2024
d3caff6
Update lmms_eval.md
kcz358 Apr 25, 2024
d77e1c3
Update lmms_eval.md
kcz358 Apr 25, 2024
f1a72a3
Update lmms_eval.md
kcz358 Apr 25, 2024
6af100f
Fix title uppercase
kcz358 Apr 25, 2024
96c20ac
move entry to last
kcz358 Apr 25, 2024
b5f228d
Adding org name
kcz358 Apr 25, 2024
e81b5c6
Update title
kcz358 Apr 25, 2024
0782739
Update image src
kcz358 Apr 25, 2024
62db0ce
Change image src
kcz358 Apr 25, 2024
74fa630
Switch back to github link for image
kcz358 Apr 25, 2024
d334208
Update image src
kcz358 Apr 25, 2024
9cfde7d
Add link to lmms-eval
kcz358 Apr 25, 2024
b8b6aef
Fix title issue
kcz358 Apr 25, 2024
90a66f2
Fix upper title
kcz358 Apr 25, 2024
3229476
Merge branch 'main' of https://github.com/huggingface/blog
kcz358 Apr 25, 2024
d3486c4
Add images
kcz358 Apr 25, 2024
782e690
Update lmms_eval.md
kcz358 Apr 25, 2024
df74c0a
Merge remote-tracking branch 'upstream/main'
kcz358 May 2, 2024
6dc20a5
Merge branch 'main' of https://github.com/kcz358/blog
kcz358 May 2, 2024
5454271
Add chinese version
kcz358 May 2, 2024
051df69
Update dates
kcz358 May 2, 2024
49ecbac
Merge remote-tracking branch 'upstream/main'
kcz358 May 8, 2024
12099b7
Merge remote-tracking branch 'upstream/main'
kcz358 May 15, 2024
9636095
Update lmms_eval.md
kcz358 May 16, 2024
dda7b65
Update lmms_eval.md
kcz358 May 16, 2024
03cc232
Update lmms_eval.md
kcz358 May 16, 2024
74220b5
Merge remote-tracking branch 'upstream/main'
kcz358 May 16, 2024
cd70bc6
Remove duplicate
kcz358 May 16, 2024
6e223b2
Add resources at the end of the blog
kcz358 May 16, 2024
d020514
Merge branch 'main' into main
lewtun May 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions lmms_eval.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
title: "Unified multimodal large model evaluation, accelerating multimodal intelligence emergence"
thumbnail: https://github.com/lmms-lab/lmms-eval-blog/blob/master/assets/img/lmms-eval-header.png
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should live in the blog repo directly to render on hf.co/blog. See here for an example: https://github.com/huggingface/blog/pull/2021/files#diff-a332b83464cf2b650715bacb6e3f07b994af0790acc88a4ea353883ba2ae751eR3853

Note you also need to add the blog details to _blog.yml

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I have also noticed that in the _blog.yml, we can only have one author on the list?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's just for the thumbnail, but the blog post itself will show all authors:

Screenshot 2024-04-24 at 16 27 21

authors:
- user: luodian
guest: true
pcuenca marked this conversation as resolved.
Show resolved Hide resolved
- user: PY007
guest: true
- user: kcz358
guest: true
- user: pufanyi
guest: true
- user: JvThunder
guest: true
- user: dododododo
guest: true
- user: THUdyh
guest: true
- user: liuhaotian
guest: true
- user: ZhangYuanhan
guest: true
- user: zhangysk
guest: true
- user: Chunyuan24
guest: true
- user: liuziwei7
guest: true
---
kcz358 marked this conversation as resolved.
Show resolved Hide resolved
# Unified multimodal large model evaluation, accelerating multimodal intelligence emergence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should add a note about this being a guest post, like here? https://huggingface.co/blog/pollen-vision

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can make it uppercase for h1 IMO


GitHub repo : https://github.com/EvolvingLMMs-Lab/lmms-eval

Official website : https://lmms-lab.github.io/
Comment on lines +41 to +43
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd maybe move these links to the end of the intro (and, optionally, also to a "Resources" section at the end of the post). At this point, the reader knows nothing about what this is about so they have little incentive to click imo.


With the deepening development of artificial intelligence research, multimodal large models such as GPT-4V and LLaVA have become hot topics in both academia and industry. However, these advanced models require an effective evaluation framework to accurately measure their performance, which is not an easy task. On the one hand, the diverse prompts and post-processing methods adopted by different models may lead to significant differences in performance evaluation results, as illustrated by HuggingFace's mention of "1001 flavors of MMLU" in their blog post, indicating that different implementations of the same evaluation dataset may result in significant score differences, even changing the model's ranking on leaderboards.
kcz358 marked this conversation as resolved.
Show resolved Hide resolved

Another challenge lies in data acquisition and processing during the evaluation process, especially when dealing with old datasets that are not widely available. Researchers often need to invest a considerable amount of time and effort in manual searching, downloading, and processing.
kcz358 marked this conversation as resolved.
Show resolved Hide resolved

To address these issues, researchers from Nanyang Technological University, ByteDance, and other institutions have jointly open-sourced lmms-eval, which is an evaluation framework designed specifically for multimodal large models. Building upon lm-evaluation-harness, this framework has been improved and expanded to provide a unified interface for defining models, datasets, and evaluation metrics, offering a one-stop, efficient solution for evaluating multimodal models (LMMs). We hope that through this framework, we can collectively drive the iteration cycle of multimodal models and promote their broader application in academia and industry. We sincerely look forward to witnessing more breakthroughs and innovations in the field of multimodal AI, jointly advancing towards a more efficient and intelligent future development of artificial intelligence technology.
kcz358 marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To address these issues, researchers from Nanyang Technological University, ByteDance, and other institutions have jointly open-sourced lmms-eval, which is an evaluation framework designed specifically for multimodal large models. Building upon lm-evaluation-harness, this framework has been improved and expanded to provide a unified interface for defining models, datasets, and evaluation metrics, offering a one-stop, efficient solution for evaluating multimodal models (LMMs). We hope that through this framework, we can collectively drive the iteration cycle of multimodal models and promote their broader application in academia and industry. We sincerely look forward to witnessing more breakthroughs and innovations in the field of multimodal AI, jointly advancing towards a more efficient and intelligent future development of artificial intelligence technology.
To address these issues, researchers from Nanyang Technological University, ByteDance, and other institutions have jointly open-sourced lmms-eval, which is an evaluation framework designed specifically for multimodal large models. Building upon lm-evaluation-harness, this framework has been improved and expanded to provide a unified interface for defining models, datasets, and evaluation metrics, offering a one-stop, efficient solution for evaluating large multimodal models (LMMs). We hope that through this framework, we can collectively drive the iteration cycle of multimodal models and promote their broader application in academia and industry. We sincerely look forward to witnessing more breakthroughs and innovations in the field of multimodal AI, jointly advancing towards a more efficient and intelligent future development of artificial intelligence technology.


<image src="https://github.com/lmms-lab/lmms-eval-blog/blob/master/assets/img/teaser.png" alt="Pipeline"/>

## Overview of the main features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Overview of the main features
## Main features

Maybe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again maybe uppercase main and features


**One-click evaluation**: lmms-eval allows users to easily evaluate their model performance on multiple datasets with a single command, without the need for manual dataset preparation. With just one line of code, users can obtain comprehensive evaluation results within minutes, including detailed logs and sample analysis covering model parameters, inputs and outputs, correct answers, etc. This is suitable for scenarios where advanced models like GPT4 are needed for scoring.

kcz358 marked this conversation as resolved.
Show resolved Hide resolved
```
kcz358 marked this conversation as resolved.
Show resolved Hide resolved
accelerate launch --num_processes=8 -m lmms_eval --model llava --model_args pretrained="liuhaotian/llava-v1.5-7b" --tasks mme,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
accelerate launch --num_processes=8 -m lmms_eval --model llava --model_args pretrained="liuhaotian/llava-v1.5-7b" --tasks mme,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs
# pip install git+https://github.com/huggingface/lmms-eval.git
accelerate launch --multi_gpu --num_processes=8 -m lmms_eval \
--model llava \
--model_args pretrained="liuhaotian/llava-v1.5-7b" \
--tasks mme,mmbench_en \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_v1.5_mme_mmbenchen \
--output_path ./logs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will change the link to our current repo since hf forked repo is kind of behind and I will also add pip install git+https://github.com/haotian-liu/LLaVA.git

```

**Parallel acceleration and task merging**: Utilizing Huggingface's accelerator, lmms-eval supports multi-GPU, model parallelism, and multi-batch processing, significantly enhancing evaluation efficiency. This feature is particularly advantageous when testing multiple datasets simultaneously, greatly reducing evaluation time.
kcz358 marked this conversation as resolved.
Show resolved Hide resolved

Here is the total runtime on different datasets using 4 x A100 40G:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use 4 for num_processes in the command line invocation, or is it unrelated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question still stands. The previous code snippet showed accelerate running on 8 GPUs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is somehow unrelated because the code snippet is simply for demonstration only since we run baseline test on different datasets.




| Dataset (#num) | LLaVA-v1.5-7b | LLaVA-v1.5-13b |
| :---------------------- | :----------------- | :----------------- |
| mme (2374) | 2 mins 43 seconds | 3 mins 27 seconds |
| gqa (12578) | 10 mins 43 seconds | 14 mins 23 seconds |
| scienceqa_img (2017) | 1 mins 58 seconds | 2 mins 52 seconds |
| ai2d (3088) | 3 mins 17 seconds | 4 mins 12 seconds |
| coco2017_cap_val (5000) | 14 mins 13 seconds | 19 mins 58 seconds |


Additionally, in the 0.1.1.dev update, the team has added support for tensor parallelism, enabling the running of larger models like LLaVA-v1.6-34B on 4 x 3090 GPUs, supporting efficient inference.
kcz358 marked this conversation as resolved.
Show resolved Hide resolved

Comprehensive dataset support: The lmms-eval team has hosted over 40 diverse datasets (with the number continually increasing) on Huggingface's lmms-lab, covering a range of tasks from COCO Captions to MMMU and others. All datasets have been transformed into a unified format for archiving, available for direct access on the team's lmms-lab official Huggingface Hub. Users can view specific details of evaluation data and easily download and use them with just one click.
kcz358 marked this conversation as resolved.
Show resolved Hide resolved

<image src="https://github.com/lmms-lab/lmms-eval-blog/blob/master/assets/img/org_dataset.png" alt="dataset on organization"/>

<image src="https://github.com/lmms-lab/lmms-eval-blog/blob/master/assets/img/viewer.png" alt="viewer" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about the image link.


**Easy to Extend**: Through a unified interface definition, lmms-eval not only simplifies the integration process of different models and datasets but also provides convenience for introducing new datasets and models. Additionally, it supports simple customization settings, allowing users to easily add new datasets through simple YAML file configuration and customize evaluation settings as needed by modifying the configuration file.
kcz358 marked this conversation as resolved.
Show resolved Hide resolved

**Comparability**: We provide an environment for authors to reproduce the scores reported in the paper for the original LLaVA 1.5 model. Furthermore, we offer complete experimental results of the LLaVA series models on all evaluation datasets, along with environmental parameters for reference (see the Readme section on GitHub).

**Synchronized Online Logging**: We provide detailed logging tools to help you understand the evaluation process and results. Logs include model parameters, generation parameters, input questions, model responses, and ground truth answers. You can record every detail and visualize it in Weights & Biases runs. Users can access results in real-time from anywhere, making it convenient and efficient.

<image src="https://github.com/lmms-lab/lmms-eval-blog/blob/master/assets/img/wandb_table.jpg" alt="wandb_table" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these links will be embedded correctly as images (they are references to the github tree)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi I try to change the src to a link on huggingface dataset repo but I can't see the rendered image on the github. May I ask what is the most proper way to put image link in the blog?

I have uploaded all the images here but unable to find a way to let github markdown render the image


## Conclusion

In summary, the implementation of this framework not only provides new tools for multimodal model evaluation but also paves the way for future research and development, including video multimodal evaluation, few-shot evaluation modes, and batch inference acceleration, showcasing its powerful potential and foresight. The launch of lmms-eval marks the arrival of a new era in evaluation, opening up new paths for AI research and applications.
kcz358 marked this conversation as resolved.
Show resolved Hide resolved