Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New blog post] Unified multimodal large model evaluation, accelerating multimodal intelligence emergence #1987

Open
wants to merge 54 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 47 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
b2a5ff0
Initial commit
kcz358 Apr 14, 2024
b5ed7fc
Try to add image
kcz358 Apr 14, 2024
929c502
See whether it works using huggingface dataset
kcz358 Apr 14, 2024
37b377f
Nah
kcz358 Apr 14, 2024
809268e
Add english version
kcz358 Apr 14, 2024
28c44ae
Update lmms_eval.md
pufanyi Apr 14, 2024
44cef2f
Add author list
kcz358 Apr 14, 2024
bfda318
Merge branch 'main' of https://github.com/kcz358/blog
kcz358 Apr 14, 2024
bb4f141
Revise author list
kcz358 Apr 14, 2024
6507e4f
Update lmms_eval.md
kcz358 Apr 20, 2024
1aef8f8
Update lmms_eval.md
kcz358 Apr 20, 2024
2fdda3f
Update lmms_eval.md
kcz358 Apr 20, 2024
6aadd54
Update lmms_eval.md
kcz358 Apr 20, 2024
2656012
Update lmms_eval.md
kcz358 Apr 20, 2024
21fa476
Update lmms_eval.md
kcz358 Apr 20, 2024
fb5a9c8
Update lmms_eval in _blog.yml
kcz358 Apr 20, 2024
f1f8604
Add thumbnail image to assets
kcz358 Apr 20, 2024
5a3f283
Update lmms_eval.md
kcz358 Apr 20, 2024
f04f8ca
Update lmms_eval.md
kcz358 Apr 20, 2024
2974a3d
Merge branch 'main' into main
Luodian Apr 20, 2024
18e888f
Update lmms_eval.md
kcz358 Apr 25, 2024
549c968
Update lmms_eval.md
kcz358 Apr 25, 2024
f288278
Update lmms_eval.md
kcz358 Apr 25, 2024
0f1a208
Update lmms_eval.md
kcz358 Apr 25, 2024
d3caff6
Update lmms_eval.md
kcz358 Apr 25, 2024
d77e1c3
Update lmms_eval.md
kcz358 Apr 25, 2024
f1a72a3
Update lmms_eval.md
kcz358 Apr 25, 2024
6af100f
Fix title uppercase
kcz358 Apr 25, 2024
96c20ac
move entry to last
kcz358 Apr 25, 2024
b5f228d
Adding org name
kcz358 Apr 25, 2024
e81b5c6
Update title
kcz358 Apr 25, 2024
0782739
Update image src
kcz358 Apr 25, 2024
62db0ce
Change image src
kcz358 Apr 25, 2024
74fa630
Switch back to github link for image
kcz358 Apr 25, 2024
d334208
Update image src
kcz358 Apr 25, 2024
9cfde7d
Add link to lmms-eval
kcz358 Apr 25, 2024
b8b6aef
Fix title issue
kcz358 Apr 25, 2024
90a66f2
Fix upper title
kcz358 Apr 25, 2024
3229476
Merge branch 'main' of https://github.com/huggingface/blog
kcz358 Apr 25, 2024
d3486c4
Add images
kcz358 Apr 25, 2024
782e690
Update lmms_eval.md
kcz358 Apr 25, 2024
df74c0a
Merge remote-tracking branch 'upstream/main'
kcz358 May 2, 2024
6dc20a5
Merge branch 'main' of https://github.com/kcz358/blog
kcz358 May 2, 2024
5454271
Add chinese version
kcz358 May 2, 2024
051df69
Update dates
kcz358 May 2, 2024
49ecbac
Merge remote-tracking branch 'upstream/main'
kcz358 May 8, 2024
12099b7
Merge remote-tracking branch 'upstream/main'
kcz358 May 15, 2024
9636095
Update lmms_eval.md
kcz358 May 16, 2024
dda7b65
Update lmms_eval.md
kcz358 May 16, 2024
03cc232
Update lmms_eval.md
kcz358 May 16, 2024
74220b5
Merge remote-tracking branch 'upstream/main'
kcz358 May 16, 2024
cd70bc6
Remove duplicate
kcz358 May 16, 2024
6e223b2
Add resources at the end of the blog
kcz358 May 16, 2024
d020514
Merge branch 'main' into main
lewtun May 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions _blog.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3912,6 +3912,40 @@
- asr
- inference

- local: sc2-instruct
title: "StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation"
thumbnail: /blog/assets/sc2-instruct/sc2-instruct-banner.png
author: yuxiang630
guest: true
date: Apr 29, 2024
tags:
- nlp
- community
- research
- LLM

- local: evaluation-structured-outputs
title: "Improving Prompt Consistency with Structured Generations"
author: willkurt
guest: true
thumbnail: /blog/assets/evaluating-mmlu-leaderboard/thumbnail.png
date: Apr 30, 2024
tags:
- evaluation
- collaboration
- research
- leaderboard

- local: asr-diarization
title: "Powerful ASR + diarization + speculative decoding with Hugging Face Inference Endpoints"
author: sergeipetrov
thumbnail: /blog/assets/asr-diarization/thumbnail.png
date: May 1, 2024
tags:
- audio
- asr
- inference

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm these entries shouldn't be here. Can you try to merge main again and ensure there are no duplicates?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for spotting out the issue! I have merge the main again and delete the duplicates.

- local: leaderboard-artificial-analysis
title: "Bringing the Artificial Analysis LLM Performance Leaderboard to Hugging Face"
thumbnail: /blog/assets/leaderboards-on-the-hub/thumbnail_artificialanalysis.png
Expand Down Expand Up @@ -4005,3 +4039,15 @@
- multimodal
- LLM
- vision

- local: lmms_eval
title: "Unified multimodal large model evaluation, accelerating multimodal intelligence emergence"
author: kcz358
thumbnail: /blog/assets/lmms_eval/thumbnail.png
date: May 15, 2024
tags:
- vlm
- multimodal
- evaluation
- community
- research
Binary file added assets/lmms_eval/thumbnail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
111 changes: 111 additions & 0 deletions lmms_eval.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
---
title: "Unified Multimodal Large Model Evaluation, Accelerating Multimodal Intelligence Emergence"
thumbnail: /blog/assets/lmms_eval/thumbnail.png
authors:
- user: luodian
guest: true
pcuenca marked this conversation as resolved.
Show resolved Hide resolved
org: lmms-lab
- user: PY007
guest: true
org: lmms-lab
- user: kcz358
guest: true
org: lmms-lab
- user: pufanyi
guest: true
org: lmms-lab
- user: JvThunder
guest: true
org: lmms-lab
- user: dododododo
guest: true
- user: THUdyh
guest: true
org: lmms-lab
- user: liuhaotian
guest: true
org: lmms-lab
- user: ZhangYuanhan
guest: true
org: lmms-lab
- user: zhangysk
guest: true
- user: Chunyuan24
guest: true
org: lmms-lab
- user: liuziwei7
guest: true
---
kcz358 marked this conversation as resolved.
Show resolved Hide resolved
# Unified Multimodal Large Model Evaluation, Accelerating Multimodal Intelligence Emergence

GitHub repo : https://github.com/EvolvingLMMs-Lab/lmms-eval

Official website : https://lmms-lab.github.io/
Comment on lines +41 to +43
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd maybe move these links to the end of the intro (and, optionally, also to a "Resources" section at the end of the post). At this point, the reader knows nothing about what this is about so they have little incentive to click imo.


With the deepening development of artificial intelligence research, multimodal large models such as GPT-4V and LLaVA have become hot topics in both academia and industry. However, these advanced models require an effective evaluation framework to accurately measure their performance, which is not an easy task. On the one hand, the diverse prompts and post-processing methods adopted by different models may lead to significant differences in performance evaluation results, as illustrated by HuggingFace's mention of "1001 flavors of MMLU" in their blog post, indicating that different implementations of the same evaluation dataset may result in significant score differences, even changing the model's ranking on leaderboards.
kcz358 marked this conversation as resolved.
Show resolved Hide resolved

Another challenge lies in data acquisition and processing during the evaluation process, especially when dealing with old datasets that are not widely available. Researchers often need to invest a considerable amount of time and effort in manual searching, downloading, and processing.
kcz358 marked this conversation as resolved.
Show resolved Hide resolved

To address these issues, researchers from Nanyang Technological University, ByteDance, and other institutions have jointly open-sourced [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval), which is an evaluation framework designed specifically for multimodal large models. Building upon EleutherAI's [`lm-evaluation-harness`](https://github.com/EleutherAI/lm-evaluation-harness) and [🤗 Accelerate](https://github.com/huggingface/accelerate), this framework has been improved and expanded to provide a unified interface for defining models, datasets, and evaluation metrics, offering a one-stop, efficient solution for evaluating large multimodal models (LMMs). We hope that through this framework, we can collectively drive the iteration cycle of multimodal models and promote their broader application in academia and industry. We sincerely look forward to witnessing more breakthroughs and innovations in the field of multimodal AI, jointly advancing towards a more efficient and intelligent future development of artificial intelligence technology.

![pipeline.jpg](https://huggingface.co/datasets/kcz358/lmms-eval-blog/resolve/main/pipeline.png)

## Overview of the Main Features

**One-click evaluation**: lmms-eval allows users to easily evaluate their model performance on multiple datasets with a single command, without the need for manual dataset preparation. With just one line of code, users can obtain comprehensive evaluation results within minutes, including detailed logs and sample analysis covering model parameters, inputs and outputs, correct answers, etc. This is suitable for scenarios where advanced models like GPT4 are needed for scoring.

kcz358 marked this conversation as resolved.
Show resolved Hide resolved
Here's an example to evaluate a LLaVa model on the [MME](https://arxiv.org/abs/2306.13394) and [MMBench](https://arxiv.org/abs/2307.06281) benchmarks:
kcz358 marked this conversation as resolved.
Show resolved Hide resolved

```
kcz358 marked this conversation as resolved.
Show resolved Hide resolved
# Build from source
# pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git

# Build from pypi
kcz358 marked this conversation as resolved.
Show resolved Hide resolved
# pip install lmms-eval

# Build llava
# pip install git+https://github.com/haotian-liu/LLaVA.git

# Run your evaluation with accelerate with one line of code!
accelerate launch --multi_gpu --num_processes=8 -m lmms_eval \
--model llava \
--model_args pretrained="liuhaotian/llava-v1.5-7b" \
--tasks mme,mmbench_en \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_v1.5_mme_mmbenchen \
--output_path ./logs
```

**Parallel acceleration and task merging**: Utilizing Huggingface's accelerator, lmms-eval supports multi-GPU, model parallelism, and multi-batch processing, significantly enhancing evaluation efficiency. This feature is particularly advantageous when testing multiple datasets simultaneously, greatly reducing evaluation time.
kcz358 marked this conversation as resolved.
Show resolved Hide resolved

Here is the total runtime on different datasets using 4 x A100 40G:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use 4 for num_processes in the command line invocation, or is it unrelated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question still stands. The previous code snippet showed accelerate running on 8 GPUs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is somehow unrelated because the code snippet is simply for demonstration only since we run baseline test on different datasets.



| Dataset (#num) | LLaVA-v1.5-7b | LLaVA-v1.5-13b |
| :---------------------- | :----------------- | :----------------- |
| mme (2374) | 2 mins 43 seconds | 3 mins 27 seconds |
| gqa (12578) | 10 mins 43 seconds | 14 mins 23 seconds |
| scienceqa_img (2017) | 1 mins 58 seconds | 2 mins 52 seconds |
| ai2d (3088) | 3 mins 17 seconds | 4 mins 12 seconds |
| coco2017_cap_val (5000) | 14 mins 13 seconds | 19 mins 58 seconds |

Additionally, in the 0.1.1.dev update, the team has added support for tensor parallelism, enabling the running of larger models like LLaVA-v1.6-34B on 4 x 3090 GPUs, supporting efficient inference.
kcz358 marked this conversation as resolved.
Show resolved Hide resolved

**Comprehensive dataset support:** The `lmms-eval` team has hosted over 40 diverse datasets (with the number continually increasing) on Huggingface's lmms-lab, covering a range of tasks from COCO Captions to MMMU and others. All datasets have been transformed into a unified format for archiving, available for direct access on the team's lmms-lab official Huggingface Hub. Users can view specific details of evaluation data and easily download and use them with just one click. You can find all the datasets we support in the framework under [this collection](https://huggingface.co/collections/lmms-lab/lmms-eval-661d51f70a9d678b6f43f272).
kcz358 marked this conversation as resolved.
Show resolved Hide resolved

![org_dataset.png](https://huggingface.co/datasets/kcz358/lmms-eval-blog/resolve/main/org_dataset.png)

![viewer.png](https://huggingface.co/datasets/kcz358/lmms-eval-blog/resolve/main/viewer.png)

**Easy to Extend**: Through a unified interface definition, `lmms-eval` not only simplifies the integration process of different models and datasets but also provides convenience for introducing new datasets and models. Additionally, it supports simple customization settings, allowing users to easily add new datasets through simple YAML file configuration and customize evaluation settings as needed by modifying the configuration file.
kcz358 marked this conversation as resolved.
Show resolved Hide resolved

**Comparability**: We provide an environment for authors to reproduce the scores reported in the paper for the original LLaVA 1.5 model. Furthermore, we offer complete experimental results of the LLaVA series models on all evaluation datasets, along with environmental parameters for reference (see the Readme section on GitHub).

**Synchronized Online Logging**: We provide detailed logging tools to help you understand the evaluation process and results. Logs include model parameters, generation parameters, input questions, model responses, and ground truth answers. You can record every detail and visualize it in Weights & Biases runs. Users can access results in real-time from anywhere, making it convenient and efficient.

![wandb_table.jpg](https://huggingface.co/datasets/kcz358/lmms-eval-blog/resolve/main/wandb_table.jpg)

## Conclusion

In summary, the implementation of this framework not only provides new tools for multimodal model evaluation but also paves the way for future research and development, including video multimodal evaluation, few-shot evaluation modes, and batch inference acceleration, showcasing its powerful potential and foresight. The launch of `lmms-eval` marks the arrival of a new era in evaluation, opening up new paths for AI research and applications. We hope the community finds it useful for benchmarking their own models in this fast-moving field!
kcz358 marked this conversation as resolved.
Show resolved Hide resolved
12 changes: 12 additions & 0 deletions zh/_blog.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1601,3 +1601,15 @@
- community
- research
- LLM

- local: lmms_eval
title: "统一多模态大模型评估,加速多模态智能涌现"
author: kcz358
thumbnail: /blog/assets/lmms_eval/thumbnail.png
date: May 15, 2024
tags:
- vlm
- multimodal
- evaluation
- community
- research
115 changes: 115 additions & 0 deletions zh/lmms_eval.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
---
title: "统一多模态大模型评估,加速多模态智能涌现"
thumbnail: /blog/assets/lmms_eval/thumbnail.png
authors:
- user: luodian
guest: true
org: lmms-lab
- user: PY007
guest: true
org: lmms-lab
- user: kcz358
guest: true
org: lmms-lab
- user: pufanyi
guest: true
org: lmms-lab
- user: JvThunder
guest: true
org: lmms-lab
- user: dododododo
guest: true
- user: THUdyh
guest: true
org: lmms-lab
- user: liuhaotian
guest: true
org: lmms-lab
- user: ZhangYuanhan
guest: true
org: lmms-lab
- user: zhangysk
guest: true
- user: Chunyuan24
guest: true
org: lmms-lab
- user: liuziwei7
guest: true
translators:
- user: kcz358
guest: true
---
# 统一多模态大模型评估,加速多模态智能涌现

**代码仓库** : https://github.com/EvolvingLMMs-Lab/lmms-eval

**官方主页** : https://lmms-lab.github.io/

随着人工智能研究的深入发展,多模态大模型,如GPT-4V和LLaVA等模型,已经成为了学术界和产业界的热点。但是,这些先进的模型需要一个有效的评估框架来准确衡量其性能,而这并非易事。一方面,不同模型采用的提示(prompt)和答案后处理方式多种多样,可能导致性能评估结果大相径庭,正如HuggingFace在其博客中提及的“1001 flavors of MMLU” 所示,即同一评测数据集的不同实现可能会造成极大的分数差异,甚至改变模型在排行榜上的排序。

另一方面,评估过程中的数据集获取与处理也充满挑战,尤其是当面对尚未广泛可用的旧数据集时,研究人员往往需要投入大量时间和精力进行手动搜索、下载和处理。

为解决以上问题,南洋理工大学、字节跳动等机构的研究人员联合开源了[`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval)这是一个专为多模态大型模型设计的评估框架。该框架在[`lm-evaluation-harness`](https://github.com/EleutherAI/lm-evaluation-harness) 和 [🤗 Accelerate](https://github.com/huggingface/accelerate)的基础上改进和扩展,提供了一个统一的界面来定义模型、数据集和评估指标,为评估大型多模态模型(LMMs)提供了一个高效的解决方案。我们希望通过这个框架共同推动多模态模型的迭代周期,并促进它们在学术界和工业界的更广泛应用。我们真诚期待在多模态人工智能领域见证更多的突破和创新,共同推进人工智能技术向更高效、更智能的未来发展。

![pipeline.jpg](https://huggingface.co/datasets/kcz358/lmms-eval-blog/resolve/main/pipeline.png)

## 主要功能概览

**一键式评估**: lmms-eval让用户能够通过单一命令轻松在多个数据集上评估其模型性能,无需手动准备数据集。只需一行代码,用户便能在几分钟内获得综合评估结果,包括详尽的日志和样本分析,涵盖模型参数、输入输出、正确答案等,适用于需要使用GPT4等高级模型进行评分的场景。

以下是一个使用 LLaVa 模型在 [MME](https://arxiv.org/abs/2306.13394) 和 [MMBench](https://arxiv.org/abs/2307.06281) 上进行评测的一个例子:

```
# 从源代码安装
# pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git

# 从pypi进行安装
# pip install lmms-eval

# 安装llava
# pip install git+https://github.com/haotian-liu/LLaVA.git

# 一行代码运行评测!
accelerate launch --multi_gpu --num_processes=8 -m lmms_eval \
--model llava \
--model_args pretrained="liuhaotian/llava-v1.5-7b" \
--tasks mme,mmbench_en \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_v1.5_mme_mmbenchen \
--output_path ./logs
```

**并行加速与任务合并g**: 利用Huggingface的accelerator,lmms-eval支持多GPU、模型并行及多batch处理,显著提高评估效率。这一特点尤其在同时测试多个数据集时体现出其优势,大大缩短了评估时间。

以下是使用4 x A100 40G在不同数据集上的总运行时间。


| 数据集 (#数目) | LLaVA-v1.5-7b | LLaVA-v1.5-13b |
| :---------------------- | :----------------- | :----------------- |
| mme (2374) | 2 分 43 秒 | 3 分 27 秒 |
| gqa (12578) | 10 分 43 秒 | 14 分 23 秒 |
| scienceqa_img (2017) | 1 分 58 秒 | 2 分 52 秒 |
| ai2d (3088) | 3 分 17 秒 | 4 分 12 秒 |
| coco2017_cap_val (5000) | 14 分 13 秒 | 19 分 58 秒 |

此外在0.1.1.dev 的更新中,团队支持了 tensor parallelism 能够在4 * 3090 上运行 LLaVA-v1.6-34B 这样更大的模型并且支持高效推理。

**全面的数据集支持:** `lmms-eval` 团队在 Huggingface 的 lmms-lab 上托管了超过 40 个多样化的数据集(数量持续增加),涵盖了从 COCO Captions 到 MMMU 等一系列任务。所有数据集都已经转换为统一的格式进行存档,在团队的 lmms-lab 官方 Huggingface Hub 上直接获取。用户可以查看评估数据的具体细节,并且只需点击一次即可轻松下载和使用。您可以在[此集合](https://huggingface.co/collections/lmms-lab/)下找到我们支持的所有数据集。


![org_dataset.png](https://huggingface.co/datasets/kcz358/lmms-eval-blog/resolve/main/org_dataset.png)

![viewer.png](https://huggingface.co/datasets/kcz358/lmms-eval-blog/resolve/main/viewer.png)

**易于扩展**: 通过统一的接口定义, `lmms-eval` 不仅简化了不同模型和数据集的整合过程,也为引入新的数据集和模型提供了便利。同时,它还支持简便的个性化设置,通过简单的 yaml 文件的配置即可增加新的数据集,也允许用户根据需要简单的修改配置文件来自定义评测配置。

**可对比性**: 我们提供了环境以便于作者能够复现 LLaVA 1.5 模型原本的在论文里 report 的分数。除此之外,我们也完整的提供了 LLaVA 系列模型在所有的评测数据集上的实验结果以及环境参数作为参考(见 Github 内 Readme 部分)。

**可在线同步的日志**: 我们提供详细的日志工具,帮助您理解评估过程和结果。日志包括模型参数、生成参数、输入问题、模型响应和真实答案。您还可以记录每一个细节,并在Weights & Biases的运行中进行可视化展示。用户无论何时何地都可以实时查阅结果,方便快捷。

![wandb_table.jpg](https://huggingface.co/datasets/kcz358/lmms-eval-blog/resolve/main/wandb_table.jpg)

## 结论

总而言之,该框架的实施不仅为多模态模型评估提供了新工具,还为未来的研究和开发铺平了道路,包括视频多模态评估、少样本评估模式和批量推理加速等,展示了其强大的潜力和远见。`lmms-eval` 的推出标志着评估的新时代的到来,为人工智能研究和应用开辟了新的道路。我们希望社区能够发现在这个快速发展的领域中使用它来评估他们自己的模型的价值所在!