Awesome-Multimodal-Papers

A curated list of awesome Multimodal studies.

Contribution

If you have published a high-quality paper or come across one that you think is valuable, feel free to contribute! To submit a paper, please open an issue and include the following information in the specified format:

{
    "title": paper title,
    "url": paper URL,
    "venue": the venue where the paper was published, such as ICML 2025, CVPR 2025 or arXiv,
    "category": one or more relevant categories from our directory, or feel free to propose a new, more suitable category,
    "code": [Optional] code URL,
    "project_page": [Optional] project page URL,
    "dataset": [Optional] HuggingFace Dataset URL,
    "collections": [Optional] HuggingFace Collections URL
}

Awesome-Multimodal-Papers

Visual Understanding

Title	Venue	Date	Code	Supplement
Diversity-Guided MLP Reduction for Efficient Large Vision Transformers (DGMR)	arXiv	2025-06-10
Learning Compact Vision Tokens for Efficient Large Multimodal Models (LLaVA-STF)	arXiv	2025-06-08
✨ InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models	arXiv	2025-04-14
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression (Baidu)	arXiv	2025-03-27		-
M-LLM Based Video Frame Selection for Efficient Video Understanding (CMU)	arXiv	2025-02-27	-	-
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs	ICLR 2025	2025-02-24		-
✨ Qwen2.5 VL	-	2025-01-26
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling	arXiv	2025-01-21		-
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding (Spatial-Temporal Compression)	arXiv	2025-01-14		-
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding	arXiv	2025-01-09		-
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos	arXiv	2025-01-07
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models (Video Token Compression)	arXiv	2024-12-30
✨ Apollo: An Exploration of Video Understanding in Large Multimodal Models (Exploration) (Meta)	arXiv	2024-12-13
CompCap: Improving Multimodal Large Language Models with Composite Captions (Meta)	arXiv	2024-12-09	-	-
✨ Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling (InternVL 2.5)	arXiv	2024-12-06
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs	arXiv	2024-10-21	-
[Model, Dataset] Personalized Visual Instruction Tuning (PVIT, PVIT-3M)	arXiv	2024-10-09
✨ Video Instruction Tuning With Synthetic Data (LLaVA-Video, LLaVA-NeXT Series)	arXiv	2024-10-03
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions	arXiv	2024-09-26	-
Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model (MGLMM, Alibaba)	arXiv	2024-09-20
POINTS: Improving Your Vision-language Model with Affordable Strategies (WeChat)	arXiv	2024-09-07		-
✨ xGen-MM (BLIP-3): A Family of Open Large Multimodal Models	arXiv	2024-08-16
✨ LLaVA-OneVision: Easy Visual Task Transfer (LLaVA-NeXT Series)	arXiv	2024-08-06
Tarsier: Recipes for Training and Evaluating Large Video Description Models (Tarsier, Dream1k, by ByteDance)	arXiv	2024-07-30
✨ InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	arXiv	2024-07-03		-
TokenPacker: Efficient Visual Projector for Multimodal LLM	arXiv	2024-07-02		-
✨ Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs (Cambrian, Data Rationing)	arXiv	2024-06-24
✨ Long Context Transfer from Language to Vision (LongVA, by Ziwei Liu, Chunyuan Li)	arXiv	2024-06-24
Generative Visual Instruction Tuning	arXiv	2024-06-17		-
✨ VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding	arXiv	2024-06-13
✨ 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (Apple)	arXiv	2024-06-13
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	arXiv	2024-06-11		-
Wings: Learning Multimodal LLMs without Text-only Forgetting	arXiv	2024-06-05	-	-
Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment (MIVPG)	arXiv	2024-06-05	-	-
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM	arXiv	2024-06-05		-
OLIVE: Object Level In-Context Visual Embeddings	ACL 2024	2024-06-02		-
X-VILA: Cross-Modality Alignment for Large Language Model (by NVIDIA)	arXiv	2024-05-29	-
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models	arXiv	2024-05-24		-
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models	arXiv	2024-05-24	-	-
LOVA3: Learning to Visual Question Answering, Asking and Assessment	arXiv	2024-05-23		-
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability	arXiv	2024-05-23
Chameleon: Mixed-Modal Early-Fusion Foundation Models (Meta)	arXiv	2024-05-16
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts	arXiv	2024-05-09
ImageInWords: Unlocking Hyper-Detailed Image Descriptions (Google)	arXiv	2024-05-05
✨ What matters when building vision-language models? (Idefics2)	arXiv	2024-05-03	-
MANTIS: Interleaved Multi-Image Instruction Tuning	arXiv	2024-05-02
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs	CVPR 2024 Workshop	2024-04-23	-
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models	arXiv	2024-04-19
MoVA: Adapting Mixture of Vision Experts to Multimodal Context	arXiv	2024-04-19		-
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models	arXiv	2024-04-18	-
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? (LaDiC)	NAACL 2024	2024-04-16		-
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (AesExpert, AesMMIT Dataset)	arXiv	2024-04-15		-
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models (Ferret-v2)	arXiv	2024-04-11	-	-
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies (MiniCPM series)	arXiv	2024-04-09
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Ferret-UI)	arXiv	2024-04-08	-	-
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	CVPR 2024	2024-04-08
Koala: Key frame-conditioned long video-LLM	CVPR 2024	2024-04-05
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens	arXiv	2024-04-04
LongVLM: Efficient Long Video Understanding via Large Language Models	arXiv	2024-04-04		-
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	ECCV 2024	2024-03-22		-
VideoAgent: Long-form Video Understanding with Large Language Model as Agent (key frame)	arXiv	2024-03-15	-	-
✨ MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training (Apple)	arXiv	2024-03-14	-	-
UniCode: Learning a Unified Codebook for Multimodal Large Language Models	arXiv	2024-03-14	-	-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	arXiv	2024-03-08	-
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models	arXiv	2023-03-05		-
RegionGPT: Towards Region Understanding Vision Language Model	CVPR 2024	2024-03-04	-
All in an Aggregated Image for In-Image Learning	arXiv	2024-02-28		-
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners	CVPR 2024	2024-02-27
TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages	arXiv	2024-02-25	-	-
LLMBind: A Unified Modality-Task Integration Framework	arXiv	2024-02-22	-	-
✨ ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model (ALLaVA)	arXiv	2024-02-18
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model	arXiv	2024-02-06		-
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices	arXiv	2023-12-28		-
Gemini: A Family of Highly Capable Multimodal Models	arXiv	2023-12-19	-
✨ Osprey: Pixel Understanding with Visual Instruction Tuning	CVPR 2024	2023-12-15		-
✨ VILA: On Pre-training for Visual Language Models (NVIDIA, MIT)	CVPR 2024	2023-12-12		-
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models	arXiv	2023-12-11
Prompt Highlighter: Interactive Control for Multi-Modal LLMs	CVPR 2024	2023-12-07
PixelLM: Pixel Reasoning with Large Multimodal Model	CVPR 2024	2023-12-04
APoLLo : Unified Adapter and Prompt Learning for Vision Language Models	EMNLP 2023	2023-12-04
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	arXiv	2023-11-28
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models	arXiv	2023-11-22		-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions	arXiv	2023-11-21
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge	CVPR 2024	2023-11-20
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	arXiv	2023-11-16		-
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	arXiv	2023-11-07		-
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	arXiv	2023-10-14
Ferret: Refer and Ground Anything Anywhere at Any Granularity (Ferret)	ICLR 2024	2023-10-11		-
✨ Improved Baselines with Visual Instruction Tuning (LLaVA-1.5)	arXiv	2023-10-05
Aligning Large Multimodal Models with Factually Augmented RLHF (LLaVA-RLHF, MMHal-Bench (hallucination))	arXiv	2023-09-25
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning	ICLR 2024	2023-09-14		-
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	arXiv	2023-08-24
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages (VisCPM-Chat/Paint)	ICLR 2024	2023-08-23		-
SVIT: Scaling up Visual Instruction Tuning	arXiv	2023-07-09
Kosmos-2: Grounding Multimodal Large Language Models to the World (Kosmos-2, GrIT Dataset)	arXiv	2023-06-26
M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	arXiv	2023-06-07	-
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	NeurIPS 2023	2023-05-11		-
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans	arXiv	2023-05-08		-
VPGTrans: Transfer Visual Prompt Generator across LLMs	NeurIPS 2023	2023-05-02
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	arXiv	2023-04-27		-
✨ MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	ICLR 2024	2023-04-20
✨ Visual Instruction Tuning (LLaVA)	NeurIPS 2023	2023-04-17
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)	NeurIPS 2023	2023-02-27		-
Multimodal Chain-of-Thought Reasoning in Language Models	arXiv	2023-02-02		-
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	ICML 2023	2023-01-30		-
Flamingo: a Visual Language Model for Few-Shot Learning	NeurIPS 2022	2022-04-29		-

Omni Understanding

Title	Venue	Date	Code	Supplement
Ming-Omni: A Unified Multimodal Model for Perception and Generation (Ant Group)	arXiv	2025-06-11
✨ Qwen2.5-Omni Technical Report	arXiv	2025-03-26
Baichuan-Omni-1.5 Technical Report	arXiv	2025-01-26		-
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning	arXiv	2025-03-07
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference	arXiv	2025-02-25
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment (THU, Tencent Hunyuan, NTU S-Lab)	arXiv	2025-02-06
[Benchmark] WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs (Xiaohongshu, SJTU)	arXiv	2025-02-06
Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback	arXiv	2024-12-20
[Survey] From Specific-MLLM to Omni-MLLM: A Survey about the MLLMs alligned with Multi-Modality (HIT, Peng Cheng Lab)	arXiv	2024-12-16	-	-
OMCAT: Omni Context Aware Transformer (OCTAV, OMCAT) (NVIDIA)	arXiv	2024-10-15	-
Baichuan-Omni Technical Report	arXiv	2024-10-11		-
[Benchmark] OmniBench: Towards The Future of Universal Omni-Language Models	arXiv	2024-09-23
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces	arXiv	2024-07-16
Explore the Limits of Omni-modal Pretraining at Scale (MiCo)	arXiv	2024-06-13
ViT-Lens: Towards Omni-modal Representations (TencentARC)	CVPR 2024	2023-08-20
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	NeurIPS 2023	2023-05-29		-
ImageBind: One Embedding Space To Bind Them All	CVPR 2023	2023-05-09
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	TPAMI 2024	2023-04-17

Unified Understanding and Generation

Title	Venue	Date	Code	Supplement
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities	arXiv	2025-05-26	-
✨ MMaDA: Multimodal Large Diffusion Language Models (ByteDance Seed)	arXiv	2025-05-21
UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation (Apple)	arXiv	2025-05-20	-	-
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation (ByteDance Seed)	arXiv	2025-05-08	-	-
[Survey] Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities	arXiv	2025-05-05		-
Unified Reward Model for Multimodal Understanding and Generation (UnifiedReward) (Fudan, Shanghai AI Lab)	arXiv	2025-03-07
UniTok: A Unified Tokenizer for Visual Generation and Understanding (ByteDance)	arXiv	2025-02-27
✨ Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling (by deepseek)	arXiv	2025-01-29		-
LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation (Meta)	arXiv	2024-12-19	-	-
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding	arXiv	2024-12-12	coming soon	-
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance	arXiv	2024-12-09	-
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation (TencentARC)	arXiv	2024-12-05		-
Liquid: Language Models are Scalable Multi-modal Generators (Bytedance)	arXiv	2024-12-05		arXiv
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation (ByteDance)	arXiv	2024-12-04
Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads	ICML 2025	2024-11-28		-
✨ Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation (by deepseek)	-	2024-10-17		-
✨ Emu3: Next-Token Prediction is All You Need	arXiv	2024-09-27
✨ Show-o: One Single Transformer to Unify Multimodal Understanding and Generation	arXiv	2024-08-22
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model	ICLR 2025 Oral	2024-08-20		-
An Image is Worth 32 Tokens for Reconstruction and Generation (TiTok, by ByteDance)	arXiv	2024-06-11
✨ Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	arXiv	2024-05-27
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing	-	2024-04-25
✨ SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation	arXiv	2024-04-22		-
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling	arXiv	2024-02-19
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization	arXiv	2024-02-05
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action	arXiv	2023-12-28
Generative Multimodal Models are In-Context Learners (Emu2)	CVPR 2024	2023-12-20
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation	arXiv	2023-11-30
LLMGA: Multimodal Large Language Model based Generation Assistant	arXiv	2023-11-27
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation	arXiv	2023-12-14		-
Kosmos-G: Generating Images in Context with Multimodal Large Language Models	ICLR 2024	2023-10-04
✨ MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens	arXiv	2023-10-03
DreamLLM: Synergistic Multimodal Comprehension and Creation	ICLR 2024	2023-09-20
NExT-GPT: Any-to-Any Multimodal LLM	arXiv	2023-09-11
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization (LaVIT)	ICLR 2024	2023-09-09		-
Planting a SEED of Vision in Large Language Model	ICLR 2024	2023-07-16
Generative Pretraining in Multimodality (Emu1)	ICLR 2024	2023-07-11		-
Generating Images with Multimodal Language Models (GILL)	NeurIPS 2023	2023-05-26
Any-to-Any Generation via Composable Diffusion (CoDi-1)	NeurIPS 2023	2023-05-19
Grounding Language Models to Images for Multimodal Inputs and Outputs (FROMAGe)	ICML 2023	2023-01-31

Diffusion MLLM

Title	Venue	Date	Code	Supplement
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model	arXiv	2025-05-29		-
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities	arXiv	2025-05-26	-
Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding (NUS)	arXiv	2025-05-22		-
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning (Gaoling)	arXiv	2025-05-22
LaViDa: A Large Diffusion Language Model for Multimodal Understanding (UCLA, Panasonic AI, Salesforce, Adobe)	arXiv	2025-05-22
MMaDA: Multimodal Large Diffusion Language Models (ByteDance Seed)	arXiv	2025-05-21

Multimodal Embedding/Retrieval

Title	Venue	Date	Code	Supplement
jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval	arXiv	2025-06-23	-
Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval (UNITE)	arXiv	2025-05-26
[Benchmark] MIEB: Massive Image Embedding Benchmark	arXiv	2025-04-14
[Data, Model, Benchmark] IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval	arXiv	2025-04-01		-
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning	arXiv	2025-03-04
CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval (CoT)	arXiv	2025-02-28	-	-
[Model, Dataset] Learning Fine-Grained Representations through Textual Token Disentanglement in Composed Video Retrieval (FDCA, FineCVR-1M)	ICLR 2025	2025-02-26
[Benchmark, Model] MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos (MomentSeeker, V-Embedder) (Gaoling)	arXiv	2025-02-18
[Data, Model, Benchmark] Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval (Vis-IR task, VIRA, UniSE, MVRB)	arXiv	2025-02-17	-	-
[Data, Model] Learning Fine-Grained Representations through Textual Token Disentanglement in Composed Video Retrieval (FineCVR-1M, FDCA)	ICLR 2025	2025-01-23
[Benchmark] CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval (CaReBench)	CVPR 2025	2024-12-31
MINIMA: Modality Invariant Image Matching	CVPR 2025	2024-12-27
✨ GME: Improving Universal Multimodal Retrieval by Multimodal LLMs (Tongyi Lab)	CVPR 2025	2024-12-22
✨ [Dataset, Model] MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval (MagaPairs, BGE-VL)	arXiv	2024-12-19
Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval (OSrCIR) (CoT)	CVPR 2025	2024-12-15		-
✨ LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant	arXiv	2024-12-02
MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs (NVIDIA)	ICLR 2025 Poster	2024-11-04	-
OMCAT: Omni Context Aware Transformer	arXiv	2024-10-15	-
✨ VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks (TIGER Lab)	arXiv	2024-10-07
✨ E5-V: Universal Embeddings with Multimodal Large Language Models	arXiv	2024-07-17		-
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces	arXiv	2024-07-16
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models (NVIDIA)	ICLR 2025	2024-05-27	-
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions (Google DeepMind)	ICML 2024 Oral	2024-03-28
DREAM: Improving Video-Text Retrieval Through Relevance-Based Augmentation Using Large Foundation Models	NAACL	2024-04-07	-	-
Composed Video Retrieval via Enriched Context and Discriminative Embeddings	CVPR 2024	2024-03-25		-
✨ UniIR: Training and Benchmarking Universal Multimodal Information Retrievers (TIGER Lab)	ECCV 2024	2023-11-28
CoVR-2: Automatic Data Construction for Composed Video Retrieval&CoVR: Learning Composed Video Retrieval from Web Video Captions	TPAMI 2024 & AAAI 2024	2023-08-23
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	NeurIPS 2023	2023-05-29		-
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	TPAMI2024	2023-04-17
✨ (QB-Norm) Cross Modal Retrieval with Querybank Normalisation	CVPR 2022	2021-12-23
✨ (DSL) Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss	arXiv	2021-09-09		-

Image Understanding Benchmark

Title	Venue	Date	Code	Supplement
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning	arXiv	2024-06-18	-
LOVA3: Learning to Visual Question Answering, Asking and Assessment	arXiv	2024-05-23		-
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI	arXiv	2024-04-24
BLINK: Multimodal Large Language Models Can See but Not Perceive	arXiv	2024-04-18
Ferret: Refer and Ground Anything Anywhere at Any Granularity (Ferret-Bench)	ICLR 2024	2023-10-11		-
Aligning Large Multimodal Models with Factually Augmented RLHF (LLaVA-RLHF, MMHal-Bench (hallucination))	arXiv	2023-09-25
Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations (AffectVisDial)	ECCV 2024	2023-08-30
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	CVPR 2024	2023-07-30		-

Video Understanding Benchmark

Title	Venue	Date
TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos	ACL 2025 Main	2025-05-26
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models	arXiv	2024-10-30
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark (AuroraCap, VDC)	arXiv	2024-10-24
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models	arXiv	2024-10-14
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding	NeurIPS 2024	2024-09-26
Tarsier: Recipes for Training and Evaluating Large Video Description Models (Tarsier, Dream1k) (ByteDance)	arXiv	2024-07-30
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning (VideoVista)	arXiv	2024-06-17
VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?	arXiv	2024-06-16
MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding	arXiv	2024-06-06
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis (Video-MME)	arXiv	2024-05-31
TempCompass: Do Video LLMs Really Understand Videos?	arXiv	2024-03-01
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark (MVBench, VideoChat2)	CVPR 2024 Highlight	2023-11-28
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding	NeurIPS 2023	2023-08-17
Perception Test: A Diagnostic Benchmark for Multimodal Video Models (Perception Test, by Google DeepMind)	NeurIPS 2023	2023-05-23

Audio

Title	Venue	Date	Code	Supplement
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities	EMNLP 2023 (Findings)	2023-05-18

Multimodal Dialogue

Title	Venue	Date	Code	Supplement
DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation	arXiv	2024-03-13		-
STICKERCONV: Generating Multimodal Empathetic Responses from Scratch	ACL 2024 Main	2024-01-20
VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue	arXiv	2023-09-14	-	-
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts	ACL 2023	2023-05-24
TikTalk: A Multi-Modal Dialogue Dataset for Real-World Chitchat	ACM MM 2023	2023-01-14		Dataset
MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation	ACL 2023	2022-11-10		Dataset
Multimodal Dialogue Response Generation (Divter)	ACL 2022	2021-10-16	-	-
Maria: A Visual Experience Powered Conversational Agent	ACL 2021	2021-05-27		-
Multi-Modal Open-Domain Dialogue	EMNLP 2021	2020-10-02	-	-
Open Domain Dialogue Generation with Latent Images	AAAI 2021	2020-04-04	-	-
Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog	WWW 2020	2020-03-10		-

Multimodal Learning

Title	Venue	Date	Code	Supplement
Video as the New Language for Real-World Decision Making	arXiv	2024-02-27	-	-
Tokenize Anything via Prompting	arXiv	2023-12-14		-
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment	ICLR 2024	2023-10-03		-
ImageBind: One Embedding Space To Bind Them All	CVPR 2023	2023-05-09
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks	CVPR 2023	2022-11-17		-
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks (BEiT-3)	CVPR 2023	2022-08-22		-
BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers	arXiv	2022-08-12		-
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning	AAAI 2023	2022-06-17
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework	ICML 2022	2022-02-07		-
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	ICML 2022	2022-01-28		-
Uni-Perceiver: Pre-Training Unified Architecture for Generic Perception for Zero-Shot and Few-Shot Tasks	CVPR 2022	2021-12-02		-
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation (ALBEF)	NeurIPS 2021	2021-07-16
BEiT: BERT Pre-Training of Image Transformers	ICLR 2022	2021-06-15		-
Learning Transferable Visual Models From Natural Language Supervision	ICML 2021	2021-02-26
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision	ICML 2021	2021-02-05		-

Image Generation

Title	Venue	Date	Code	Supplement
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing	arXiv	2025-03-13		-
✨ OmniGen: Unified Image Generation	arXiv	2024-09-17
✨ Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens (by Kaiming He, DeepMind, MIT)	arXiv	2024-10-17	-	-
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (Lumina-T2X, Flag-DiT) (Text2Any)	arXiv	2024-05-09
FreeU: Free Lunch in Diffusion U-Net (FreeU, by Ziwei Liu)	CVPR 2024 Oral	2023-09-20
Lazy Diffusion Transformer for Interactive Image Editing	arXiv	2024-04-18	-
Salient Object-Aware Background Generation using Text-Guided Diffusion Models	CVPR 2024 Workshop	2024-04-15		-
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing	arXiv	2024-04-15
UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark (UNIAA-LLaVA, UNIAA-Bench)	arXiv	2024-04-15	-	-
PMG: Personalized Multimodal Generation with Large Language Models	WWW 2024	2024-04-07	-	-
Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models	arXiv	2024-04-05
Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models	CVPR 2024	2024-04-05	-	-
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (VAR)	arXiv	2024-04-03
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation (HuaWei, Enze Xie)	arXiv	2024-03-07
Multi-LoRA Composition for Image Generation	arXiv	2024-02-26
PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models (HuaWei, Enze Xie)	arXiv	2024-01-10
Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model	AAAI 2024	2023-12-19		-
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models (Tencent Xintao Wang)	arXiv	2023-12-11
InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following	arXiv	2023-12-11
Emu Edit: Precise Image Editing via Recognition and Generation Tasks	arXiv	2023-11-16	-
BeautifulPrompt: Towards Automatic Prompt Engineering for Text-to-Image Synthesis	EMNLP 2023	2023-11-12
AnyText: Multilingual Visual Text Generation And Editing	ICLR 2024	2023-11-06		-
EasyGen: Easing Multimodal Generation with a Bidirectional Conditional Diffusion Model and LLMs	arXiv	2023-10-13		-
Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models	arXiv	2023-10-11
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis (HuaWei, Enze Xie)	ICLR 2024 Spotlight	2023-09-30
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models	arXiv	2023-08-13
Kosmos-G: Generating Images in Context with Multimodal Large Language Models	arXiv	2023-10-04
Improving Image Generation with Better Captions (DALL-E 3)	OpenAI	2023	-	-
Scaling up GANs for Text-to-Image Synthesis (GigaGAN)	CVPR 2023	2023-05-09
Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)	ICCV 2023	2023-02-10		-
Scalable Diffusion Models with Transformers (DiT)	ICCV 2023	2022-12-19
InstructPix2Pix: Learning to Follow Image Editing Instructions	CVPR 2023	2022-11-17
All are Worth Words: A ViT Backbone for Diffusion Models (U-ViT, first Diffsuion Transformer) (RUC, Chongxuan Li)	CVPR 2023	2022-09-25		-
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation	CVPR 2023	2022-08-25
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)	NeurIPS 2022	2022-05-23
Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)	OpenAI	2022-04-13		-
High-Resolution Image Synthesis with Latent Diffusion Models (LDM, Stable Diffusion)	CVPR 2022	2021-12-20		-
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models	ICML 2022	2021-12-20		-
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion	ECCV 2022	2021-11-24		-
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations	ICLR 2022	2021-08-02
CogView: Mastering Text-to-Image Generation via Transformers	NeurIPS 2021	2021-05-26		-
Zero-Shot Text-to-Image Generation (DALL-E 1)	ICML 2021	2021-02-24
Taming Transformers for High-Resolution Image Synthesis (VQ-GAN)	CVPR 2021	2020-12-17

Video Generation

Title	Venue	Date	Code
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video	ICCV2025	2025-03-14
[Dataset] Señorita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists	arXiv	2025-02-10
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment	arXiv	2024-12-06
[Dataset] VidGen-1M: A Large-Scale Dataset for Text-to-video Generation	arXiv	2024-08-05
[Dataset] MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions (Mira)	arXiv	2024-07-08	Video Generation
VIMI: Grounding Video Generation through Multi-modal Instruction	arXiv	2024-07-08
[Dataset] OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation	ICLR 2025	2024-07-02
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (Lumina-T2X, Flag-DiT) (Text2Any)	arXiv	2024-05-09
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text (Long Video Generation)	arXiv	2024-03-21
AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks	arXiv	2024-03-21
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation (FRESCO) (NTU, Ziwei Liu)	CVPR 2024	2024-03-19
Latte: Latent Diffusion Transformer for Video Generation (Latte) (NTU, Ziwei Liu)	arXiv	2024-01-05
FreeInit: Bridging Initialization Gap in Video Diffusion Models (FreeInit) (NTU, Ziwei Liu)	arXiv	2023-12-12
VideoBooth: Diffusion-based Video Generation with Image Prompts (VideoBooth) (NTU, Ziwei Liu)	arXiv	2023-12-01
VBench: Comprehensive Benchmark Suite for Video Generative Models [Benchmark] (VBench) (NTU, Ziwei Liu)	CVPR 2024	2023-11-29
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets (SVD)	arXiv	2023-11-25
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction (NTU, Ziwei Liu)	ICLR 2024	2023-10-31
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling (FreeNoise) (NTU, Ziwei Liu)	ICLR 2024	2023-10-23
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models (LaVie) (NTU, Ziwei Liu)	arXiv	2023-09-26

Multimodal Dataset

Title	Venue	Date	Code	Supplement
[Benchmark & Dataset] VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models	arXiv	2025-04-21	Visual Reasoning Benchmark & Dataset
[Dataset] Señorita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists	arXiv	2025-02-10
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data	arXiv	2024-10-24	-
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions	NeurIPS 2024	2024-10-14
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation	arXiv	2024-08-05
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions (Mira)	arXiv	2024-07-08	Video Generation
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation	ICLR 2025	2024-07-02
GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension	IJCAI 2024	2024-06-26	-
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens	arXiv	2024-06-17
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation	arXiv	2024-06-15		-
What If We Recaption Billions of Web Images with LLaMA-3? (Recap-DataComp-1B)	arXiv	2024-06-12		[
TextSquare: Scaling up Text-Centric Visual Instruction Tuning	arXiv	2024-04-19	Visual Instruction Tuning	-
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing	arXiv	2024-04-15	Instruction Image Editing
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (AesExpert, AesMMIT Dataset)	arXiv	2024-04-15	Aesthetic Multi-Modality Instruction Tuning
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers	CVPR 2024	2024-02-29	video-caption
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model	arXiv	2024-02-18	GPT4V-synthesized Data
STICKERCONV: Generating Multimodal Empathetic Responses from Scratch	ACL 2024 Main	2024-01-20	Multimodal Empathetic Dialogue
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation	ICLR 2024	2023-07-13
SVIT: Scaling up Visual Instruction Tuning	arXiv	2023-07-09	Instruction Tuning
Kosmos-2: Grounding Multimodal Large Language Models to the World (Kosmos-2, GrIT Dataset)	arXiv	2023-06-26	Grounded image-text pairs
M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	arXiv	2023-06-07	Instruction Tuning
Visual Instruction Tuning (LLaVA)	NeurIPS 2023	2023-04-17	Instruction Tuning
Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text	NeurIPS D&B 2023	2023-04-14	Interleaved Image-Text
AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content Creation (Wiki)	TREC 2023 Workshop	2023-04-04
TikTalk: A Multi-Modal Dialogue Dataset for Real-World Chitchat	ACM MM 2023	2023-01-14	Multimodal Dialogue
MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation	ACL 2023	2022-11-10	Multimodal Dialogue
LAION-5B: An open large-scale dataset for training next generation image-text models	NeurIPS 2022	2022-10-16	Image-Text Pairs
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs	NeurIPS Workshop 2021	2021-11-03	Image-Text Pairs
MMConv: An Environment for Multimodal Conversational Search across Multiple Domains	ACM SIGIR 2021	2021-07	Multimodal Dialogue
PhotoChat: A Human-Human Dialogue Dataset With Photo Sharing Behavior For Joint Image-Text Modeling	ACL 2021	2021-07-06	Open-domain Multimodal Dialogue
Image-Chat: Engaging Grounded Conversations	ACL 2020	2018-11-02	Multimodal Dialogue

Multimodal Survey

Title	Venue	Date	Supplement	Latest Update
Discrete Diffusion in Large Language and Multimodal Models: A Survey	arXiv	2025-06-16		-
A Survey on Bridging VLMs and Synthetic Data	OpenReview	2025-05-16		-
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities	arXiv	2025-05-05		-
From Specific-MLLM to Omni-MLLM: A Survey about the MLLMs alligned with Multi-Modality (HIT, Peng Cheng Lab)	arXiv	2024-12-16	-
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding (TikTok)	arXiv	2024-09-27	-
Video Diffusion Models: A Survey	arXiv	2024-05-06		-
Theoretical research on generative diffusion models: an overview	arXiv	2024-04-13	-	-
A Review of Multi-Modal Large Language and Vision Models	arXiv	2024-03-28	-	-
The (R)Evolution of Multimodal Large Language Models: A Survey	arXiv	2024-02-19	-	-
MM-LLMs: Recent Advances in MultiModal Large Language Models	arXiv	2024-01-24	-	2024-02-20
Multimodal Large Language Models: A Survey	IEEE BigData 2023	2023-11-22	-	-
Multimodal Foundation Models: From Specialists to General-Purpose Assistants	CVPR 2023	2023-09-18	-	-
Understanding Deep Learning	-	2023	-	-
Large Multimodal Models: Notes on CVPR 2023 Tutorial	CVPR 2023	2023-06-26	-	-
A Survey on Multimodal Large Language Models	arXiv	2023-06-23	-	2024-04-01
Multimodal Deep Learning	arXiv	2023-01-12	-	-
Diffusion Models: A Comprehensive Survey of Methods and Applications	ACM Computing Surveys	2022-09-02	-	2024-02-06
Multimodal Learning with Transformers: A Survey	IEEE TPAMI 2023	2022-01-13	-	2023-05-10
Multimodal Machine Learning: A Survey and Taxonomy	IEEE PAMI 2019	2017-05-26	-	2017-08-01

This site is open source. Improve this page.