MLLM论文精选（持续更新）

多模态交流QQ群: 237976286

最新动态

2024.12 MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
2024.12 Apollo: An Exploration of Video Understanding in Large Multimodal Models Meta出品的Video-LLM
2024.12 DeepSeek-VL2
2024.12 FastVLM: Efficient Vision Encoding for Vision Language Models
2024.12 POINTS1.5 Buiding a Vision-Language Model towards Real World Applications 微信出品。
2024.12 InternVL 2.5 1B 到 78B 都有。
2024.12 Qwen2-VL-72B Model
2024.12 Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
2024.12 NVILA: Efficient Frontier Visual Language Models NVIDIA出品，同时优化效率和准确率的VLM。
2024.12 PaliGemma 2:A Family of Versatile VLMs for Transfer
2024.11 Multimodal Autoregressive Pre-training of Large Vision Encoders 苹果提出全新的视觉编码器训练方式，支持多模态。
2024.11 Pixtral Large Mistral发布124B的多模态大模型。
2024.11 OmniVision-968M: World's Smallest Vision Language Model
2024.11 LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation 微软出品，将CLIP中的text encoder替换成LLM，支持更长的上下文和更复杂的文本，有更好的topk检索效果。
2024.11 HourVideo: 1-Hour Video-Language Understanding 李飞飞团队提出长视频理解评测集
2024.11 Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
2024.11 MM-EMBED: UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS 英伟达提出基于MLLM的通用多模态检索。
2024.11 Attacking Vision-Language Computer Agents via Pop-ups
2024.11 Know Where You're Uncertain When Planning with Multimodal Foundation Models: A Formal Framework 提高多模态基础模型在处理不确定性时的能力，从而增强机器人在规划任务中的可靠性。
2024.10 Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
2024.10 LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding Meta提出长视频理解方法。
2024.10 Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data 智源开源4千万多模态指令数据。
2024.10 Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
2024.10 VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation VILA团队的统一理解和生成模型。
2024.10 Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation DeepSeek首个多模态模型。
2024.10 ARIA : An Open Multimodal Native Mixture-of-Experts Model 3.9B模型，号称超过 Pixtral-12B 和 Llama3.2-11。
2024.10 BAICHUAN-OMNI TECHNICAL REPORT 百川首个7B多模态模型。
2024.10 Pixtral 12B Mistral出品。
2024.10 Movie Gen: A Cast of Media Foundation Models Meta出品
2024.10 LEOPARD : A Vision Language Model for Text-Rich Multi-Image Tasks
2024.10 Video Instruction Tuning with Synthetic Data LLaVA和字节合作开源视频指令数据
2024.09 MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning 苹果MM1的升级版。
2024.09 Emu3: Next-Token Prediction is All You Need BAAI出品。
2024.09 Molmo and PixMo:Open Weights and Open Data for State-of-the-Art Multimodal Models Allen出品，同时开源模型和数据。
2024.09 MIO: A Foundation Model on Multimodal Tokens
2024.09 Phantom of Latent for Large Language and Vision Models
2024.09 Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution
2024.09 Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
2024.09 NVLM: Open Frontier-Class Multimodal LLMs 英伟达出品。
2024.09 Viper: Open Mamba-based Vision-Language Models 首个基于Mamba的VLM系列
2024.09 MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
2024.09 General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
2024.09 VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
2024.08 Law of Vision Representation in MLLMs 提出了AC score指标，AC score越高，视觉表示越好。
2024.08 CogVLM2: Visual Language Models for Image and Video Understanding
2024.08 EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
2024.08 A Practitioner's Guide to Continual Multimodal Pretraining
2024.08 Building and better understanding vision-language models: insights and future directions
2024.08 LongVILA: Scaling Long-Context Visual Language Models for Long Videos
2024.08 UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling
2024.08 xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
2024.08 LLaVA-OneVision: Easy Visual Task Transfer LLaVA-NeXT系列的集大成。
2024.08 MiniCPM-V: A GPT-4V Level MLLM on Your Phone 超强的小钢炮MLLM。
2024.08 SAM 2: Segment Anything in Images and Videos

经典论文

2021.02 Learning Transferable Visual Models From Natural Language Supervision CLIP
2022.04 Flamingo: a Visual Language Model for Few-Shot Learning DeepMind出品，MLLM先驱。
2023.01 BLIP-2 提出Q-Former。
2023.03 Sigmoid Loss for Language Image Pre-Training CLIP的变种替代品，Sigmoid损失。
2023.04 MiniGPT-4 热度很高。
2023.04 Visual Instruction Tuning LLaVA系列的第一篇文章。
2023.05 InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
2023.05 Segment Anything SAM
2023.12 Gemini: A Family of Highly Capable Multimodal Models
2024.01 AGENT AI:SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION 李飞飞团队出品。
2024.04 MM1- Methods, Analysis & Insights from Multimodal LLM Pre-training 苹果出品。
2024.05 An Introduction to Vision-Language Modeling Meta出品，短小精悍。
2024.05 DeepSeek-VL: Towards Real-World Vision-Language Understanding
2024.06 Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs 提出以视觉为中心的benchmark CV-Bench，实验探究各个方面对VLM表现的影响，训练Cambrian-1模型。
2024.09 Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
2024.09 Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution
2024.09 Molmo and PixMo:Open Weights and Open Data for State-of-the-Art Multimodal Models Allen出品，同时开源模型和数据。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mllm_papers.md

mllm_papers.md

MLLM论文精选（持续更新）

最新动态

经典论文

Files

mllm_papers.md

Latest commit

History

mllm_papers.md

File metadata and controls

MLLM论文精选（持续更新）

最新动态

经典论文