| Diversity-Guided MLP Reduction for Efficient Large Vision Transformers (DGMR) |
arXiv |
2025-06-10 |
 |
 |
| Learning Compact Vision Tokens for Efficient Large Multimodal Models (LLaVA-STF) |
arXiv |
2025-06-08 |
 |
 |
| ✨ InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models |
arXiv |
2025-04-14 |
 |

 |
| InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression (Baidu) |
arXiv |
2025-03-27 |
 |
- |
| M-LLM Based Video Frame Selection for Efficient Video Understanding (CMU) |
arXiv |
2025-02-27 |
- |
- |
| MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs |
ICLR 2025 |
2025-02-24 |
 |
- |
| ✨ Qwen2.5 VL |
- |
2025-01-26 |
 |
 |
| InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling |
arXiv |
2025-01-21 |
 |
- |
| LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding (Spatial-Temporal Compression) |
arXiv |
2025-01-14 |
 |
- |
| LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding |
arXiv |
2025-01-09 |
 |
- |
| Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos |
arXiv |
2025-01-07 |
 |
 |
| FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models (Video Token Compression) |
arXiv |
2024-12-30 |
 |
 |
| ✨ Apollo: An Exploration of Video Understanding in Large Multimodal Models (Exploration) (Meta) |
arXiv |
2024-12-13 |
 |
 |
| CompCap: Improving Multimodal Large Language Models with Composite Captions (Meta) |
arXiv |
2024-12-09 |
- |
- |
| ✨ Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling (InternVL 2.5) |
arXiv |
2024-12-06 |
 |
 |
| xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs |
arXiv |
2024-10-21 |
- |
 |
| [Model, Dataset] Personalized Visual Instruction Tuning (PVIT, PVIT-3M) |
arXiv |
2024-10-09 |
 |
 |
| ✨ Video Instruction Tuning With Synthetic Data (LLaVA-Video, LLaVA-NeXT Series) |
arXiv |
2024-10-03 |
 |

 |
| EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions |
arXiv |
2024-09-26 |
- |
 |
| Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model (MGLMM, Alibaba) |
arXiv |
2024-09-20 |
 |
 |
| POINTS: Improving Your Vision-language Model with Affordable Strategies (WeChat) |
arXiv |
2024-09-07 |
 |
- |
| ✨ xGen-MM (BLIP-3): A Family of Open Large Multimodal Models |
arXiv |
2024-08-16 |
 |

 |
| ✨ LLaVA-OneVision: Easy Visual Task Transfer (LLaVA-NeXT Series) |
arXiv |
2024-08-06 |
 |

 |
| Tarsier: Recipes for Training and Evaluating Large Video Description Models (Tarsier, Dream1k, by ByteDance) |
arXiv |
2024-07-30 |
 |
 |
| ✨ InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output |
arXiv |
2024-07-03 |
 |
- |
| TokenPacker: Efficient Visual Projector for Multimodal LLM |
arXiv |
2024-07-02 |
 |
- |
| ✨ Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs (Cambrian, Data Rationing) |
arXiv |
2024-06-24 |
 |

 |
| ✨ Long Context Transfer from Language to Vision (LongVA, by Ziwei Liu, Chunyuan Li) |
arXiv |
2024-06-24 |
 |
 |
| Generative Visual Instruction Tuning |
arXiv |
2024-06-17 |
 |
- |
| ✨ VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding |
arXiv |
2024-06-13 |
 |
 |
| ✨ 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (Apple) |
arXiv |
2024-06-13 |
 |

 |
| VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs |
arXiv |
2024-06-11 |
 |
- |
| Wings: Learning Multimodal LLMs without Text-only Forgetting |
arXiv |
2024-06-05 |
- |
- |
| Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment (MIVPG) |
arXiv |
2024-06-05 |
- |
- |
| PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM |
arXiv |
2024-06-05 |
 |
- |
| OLIVE: Object Level In-Context Visual Embeddings |
ACL 2024 |
2024-06-02 |
 |
- |
| X-VILA: Cross-Modality Alignment for Large Language Model (by NVIDIA) |
arXiv |
2024-05-29 |
- |
 |
| ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models |
arXiv |
2024-05-24 |
 |
- |
| Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models |
arXiv |
2024-05-24 |
- |
- |
| LOVA3: Learning to Visual Question Answering, Asking and Assessment |
arXiv |
2024-05-23 |
 |
- |
| AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability |
arXiv |
2024-05-23 |
 |
 |
| Chameleon: Mixed-Modal Early-Fusion Foundation Models (Meta) |
arXiv |
2024-05-16 |
 |
 |
| CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts |
arXiv |
2024-05-09 |
 |


 |
| ImageInWords: Unlocking Hyper-Detailed Image Descriptions (Google) |
arXiv |
2024-05-05 |
 |

 |
| ✨ What matters when building vision-language models? (Idefics2) |
arXiv |
2024-05-03 |
- |
 |
| MANTIS: Interleaved Multi-Image Instruction Tuning |
arXiv |
2024-05-02 |
 |
 |
| Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs |
CVPR 2024 Workshop |
2024-04-23 |
- |
 |
| Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models |
arXiv |
2024-04-19 |
 |

 |
| MoVA: Adapting Mixture of Vision Experts to Multimodal Context |
arXiv |
2024-04-19 |
 |
- |
| Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models |
arXiv |
2024-04-18 |
- |

 |
| LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? (LaDiC) |
NAACL 2024 |
2024-04-16 |
 |
- |
| AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (AesExpert, AesMMIT Dataset) |
arXiv |
2024-04-15 |
 |
- |
| Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models (Ferret-v2) |
arXiv |
2024-04-11 |
- |
- |
| MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies (MiniCPM series) |
arXiv |
2024-04-09 |

 |
 |
| Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Ferret-UI) |
arXiv |
2024-04-08 |
- |
- |
| MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding |
CVPR 2024 |
2024-04-08 |
 |
 |
| Koala: Key frame-conditioned long video-LLM |
CVPR 2024 |
2024-04-05 |
 |
 |
| MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens |
arXiv |
2024-04-04 |
 |
 |
| LongVLM: Efficient Long Video Understanding via Large Language Models |
arXiv |
2024-04-04 |
 |
- |
| InternVideo2: Scaling Foundation Models for Multimodal Video Understanding |
ECCV 2024 |
2024-03-22 |
 |
- |
| VideoAgent: Long-form Video Understanding with Large Language Model as Agent (key frame) |
arXiv |
2024-03-15 |
- |
- |
| ✨ MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training (Apple) |
arXiv |
2024-03-14 |
- |
- |
| UniCode: Learning a Unified Codebook for Multimodal Large Language Models |
arXiv |
2024-03-14 |
- |
- |
| Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context |
arXiv |
2024-03-08 |
- |
 |
| Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models |
arXiv |
2023-03-05 |
 |
- |
| RegionGPT: Towards Region Understanding Vision Language Model |
CVPR 2024 |
2024-03-04 |
- |
 |
| All in an Aggregated Image for In-Image Learning |
arXiv |
2024-02-28 |
 |
- |
| Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners |
CVPR 2024 |
2024-02-27 |
 |
 |
| TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages |
arXiv |
2024-02-25 |
- |
- |
| LLMBind: A Unified Modality-Task Integration Framework |
arXiv |
2024-02-22 |
- |
- |
| ✨ ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model (ALLaVA) |
arXiv |
2024-02-18 |
 |

 |
| MobileVLM V2: Faster and Stronger Baseline for Vision Language Model |
arXiv |
2024-02-06 |
 |
- |
| MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices |
arXiv |
2023-12-28 |
 |
- |
| Gemini: A Family of Highly Capable Multimodal Models |
arXiv |
2023-12-19 |
- |
 |
| ✨ Osprey: Pixel Understanding with Visual Instruction Tuning |
CVPR 2024 |
2023-12-15 |
 |
- |
| ✨ VILA: On Pre-training for Visual Language Models (NVIDIA, MIT) |
CVPR 2024 |
2023-12-12 |
 |
- |
| Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models |
arXiv |
2023-12-11 |
 |
 |
| Prompt Highlighter: Interactive Control for Multi-Modal LLMs |
CVPR 2024 |
2023-12-07 |
 |
 |
| PixelLM: Pixel Reasoning with Large Multimodal Model |
CVPR 2024 |
2023-12-04 |
 |
 |
| APoLLo : Unified Adapter and Prompt Learning for Vision Language Models |
EMNLP 2023 |
2023-12-04 |
 |
 |
| LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models |
arXiv |
2023-11-28 |
 |

 |
| PG-Video-LLaVA: Pixel Grounding Large Video-Language Models |
arXiv |
2023-11-22 |
 |
- |
| ShareGPT4V: Improving Large Multi-Modal Models with Better Captions |
arXiv |
2023-11-21 |
 |
 |
| LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge |
CVPR 2024 |
2023-11-20 |
 |
 |
| Video-LLaVA: Learning United Visual Representation by Alignment Before Projection |
arXiv |
2023-11-16 |
 |
- |
| mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration |
arXiv |
2023-11-07 |
 |
- |
| MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning |
arXiv |
2023-10-14 |
 |
 |
| Ferret: Refer and Ground Anything Anywhere at Any Granularity (Ferret) |
ICLR 2024 |
2023-10-11 |
 |
- |
| ✨ Improved Baselines with Visual Instruction Tuning (LLaVA-1.5) |
arXiv |
2023-10-05 |
 |
 |
| Aligning Large Multimodal Models with Factually Augmented RLHF (LLaVA-RLHF, MMHal-Bench (hallucination)) |
arXiv |
2023-09-25 |
 |

 |
| MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning |
ICLR 2024 |
2023-09-14 |
 |
- |
| Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond |
arXiv |
2023-08-24 |
 |
 |
| Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages (VisCPM-Chat/Paint) |
ICLR 2024 |
2023-08-23 |
 |
- |
| SVIT: Scaling up Visual Instruction Tuning |
arXiv |
2023-07-09 |
 |
 |
| Kosmos-2: Grounding Multimodal Large Language Models to the World (Kosmos-2, GrIT Dataset) |
arXiv |
2023-06-26 |
 |

 |
| M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning |
arXiv |
2023-06-07 |
- |

 |
| InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning |
NeurIPS 2023 |
2023-05-11 |
 |
- |
| MultiModal-GPT: A Vision and Language Model for Dialogue with Humans |
arXiv |
2023-05-08 |
 |
- |
| VPGTrans: Transfer Visual Prompt Generator across LLMs |
NeurIPS 2023 |
2023-05-02 |
 |
 |
| mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality |
arXiv |
2023-04-27 |
 |
- |
| ✨ MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models |
ICLR 2024 |
2023-04-20 |
 |
 |
| ✨ Visual Instruction Tuning (LLaVA) |
NeurIPS 2023 |
2023-04-17 |
 |

 |
| Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1) |
NeurIPS 2023 |
2023-02-27 |
 |
- |
| Multimodal Chain-of-Thought Reasoning in Language Models |
arXiv |
2023-02-02 |
 |
- |
| BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models |
ICML 2023 |
2023-01-30 |
 |
- |
| Flamingo: a Visual Language Model for Few-Shot Learning |
NeurIPS 2022 |
2022-04-29 |
 |
- |