InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression (Baidu) |
arXiv |
2025-03-27 |
 |
- |
M-LLM Based Video Frame Selection for Efficient Video Understanding (CMU) |
arXiv |
2025-02-27 |
- |
- |
✨ Qwen2.5 VL |
- |
2025-01-26 |
 |
 |
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling |
arXiv |
2025-01-21 |
 |
- |
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding (Spatial-Temporal Compression) |
arXiv |
2025-01-14 |
 |
- |
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding |
arXiv |
2025-01-09 |
 |
- |
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos |
arXiv |
2025-01-07 |
 |
 |
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models (Video Token Compression) |
arXiv |
2024-12-30 |
 |
 |
✨ Apollo: An Exploration of Video Understanding in Large Multimodal Models (Exploration) (Meta) |
arXiv |
2024-12-13 |
 |
 |
CompCap: Improving Multimodal Large Language Models with Composite Captions (Meta) |
arXiv |
2024-12-09 |
- |
- |
✨ Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling (InternVL 2.5) |
arXiv |
2024-12-06 |
 |
 |
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs |
arXiv |
2024-10-21 |
- |
 |
[Model, Dataset] Personalized Visual Instruction Tuning (PVIT, PVIT-3M) |
arXiv |
2024-10-09 |
 |
 |
✨ Video Instruction Tuning With Synthetic Data (LLaVA-Video, LLaVA-NeXT Series) |
arXiv |
2024-10-03 |
 |

 |
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions |
arXiv |
2024-09-26 |
- |
 |
Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model (MGLMM, Alibaba) |
arXiv |
2024-09-20 |
 |
 |
POINTS: Improving Your Vision-language Model with Affordable Strategies (WeChat) |
arXiv |
2024-09-07 |
 |
- |
✨ xGen-MM (BLIP-3): A Family of Open Large Multimodal Models |
arXiv |
2024-08-16 |
 |

 |
✨ LLaVA-OneVision: Easy Visual Task Transfer (LLaVA-NeXT Series) |
arXiv |
2024-08-06 |
 |

 |
Tarsier: Recipes for Training and Evaluating Large Video Description Models (Tarsier, Dream1k, by ByteDance) |
arXiv |
2024-07-30 |
 |
 |
✨ InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output |
arXiv |
2024-07-03 |
 |
- |
TokenPacker: Efficient Visual Projector for Multimodal LLM |
arXiv |
2024-07-02 |
 |
- |
✨ Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs (Cambrian, Data Rationing) |
arXiv |
2024-06-24 |
 |

 |
✨ Long Context Transfer from Language to Vision (LongVA, by Ziwei Liu, Chunyuan Li) |
arXiv |
2024-06-24 |
 |
 |
Generative Visual Instruction Tuning |
arXiv |
2024-06-17 |
 |
- |
✨ VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding |
arXiv |
2024-06-13 |
 |
 |
✨ 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (Apple) |
arXiv |
2024-06-13 |
 |

 |
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs |
arXiv |
2024-06-11 |
 |
- |
Wings: Learning Multimodal LLMs without Text-only Forgetting |
arXiv |
2024-06-05 |
- |
- |
Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment (MIVPG) |
arXiv |
2024-06-05 |
- |
- |
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM |
arXiv |
2024-06-05 |
 |
- |
OLIVE: Object Level In-Context Visual Embeddings |
ACL 2024 |
2024-06-02 |
 |
- |
X-VILA: Cross-Modality Alignment for Large Language Model (by NVIDIA) |
arXiv |
2024-05-29 |
- |
 |
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models |
arXiv |
2024-05-24 |
 |
- |
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models |
arXiv |
2024-05-24 |
- |
- |
LOVA3: Learning to Visual Question Answering, Asking and Assessment |
arXiv |
2024-05-23 |
 |
- |
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability |
arXiv |
2024-05-23 |
 |
 |
Chameleon: Mixed-Modal Early-Fusion Foundation Models (Meta) |
arXiv |
2024-05-16 |
 |
 |
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts |
arXiv |
2024-05-09 |
 |


 |
ImageInWords: Unlocking Hyper-Detailed Image Descriptions (Google) |
arXiv |
2024-05-05 |
 |

 |
✨ What matters when building vision-language models? (Idefics2) |
arXiv |
2024-05-03 |
- |
 |
MANTIS: Interleaved Multi-Image Instruction Tuning |
arXiv |
2024-05-02 |
 |
 |
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs |
CVPR 2024 Workshop |
2024-04-23 |
- |
 |
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models |
arXiv |
2024-04-19 |
 |

 |
MoVA: Adapting Mixture of Vision Experts to Multimodal Context |
arXiv |
2024-04-19 |
 |
- |
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models |
arXiv |
2024-04-18 |
- |

 |
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? (LaDiC) |
NAACL 2024 |
2024-04-16 |
 |
- |
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (AesExpert, AesMMIT Dataset) |
arXiv |
2024-04-15 |
 |
- |
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models (Ferret-v2) |
arXiv |
2024-04-11 |
- |
- |
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies (MiniCPM series) |
arXiv |
2024-04-09 |

 |
 |
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Ferret-UI) |
arXiv |
2024-04-08 |
- |
- |
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding |
CVPR 2024 |
2024-04-08 |
 |
 |
Koala: Key frame-conditioned long video-LLM |
CVPR 2024 |
2024-04-05 |
 |
 |
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens |
arXiv |
2024-04-04 |
 |
 |
LongVLM: Efficient Long Video Understanding via Large Language Models |
arXiv |
2024-04-04 |
 |
- |
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding |
ECCV 2024 |
2024-03-22 |
 |
- |
VideoAgent: Long-form Video Understanding with Large Language Model as Agent (key frame) |
arXiv |
2024-03-15 |
- |
- |
✨ MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training (Apple) |
arXiv |
2024-03-14 |
- |
- |
UniCode: Learning a Unified Codebook for Multimodal Large Language Models |
arXiv |
2024-03-14 |
- |
- |
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context |
arXiv |
2024-03-08 |
- |
 |
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models |
arXiv |
2023-03-05 |
 |
- |
RegionGPT: Towards Region Understanding Vision Language Model |
CVPR 2024 |
2024-03-04 |
- |
 |
All in an Aggregated Image for In-Image Learning |
arXiv |
2024-02-28 |
 |
- |
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners |
CVPR 2024 |
2024-02-27 |
 |
 |
TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages |
arXiv |
2024-02-25 |
- |
- |
LLMBind: A Unified Modality-Task Integration Framework |
arXiv |
2024-02-22 |
- |
- |
✨ ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model (ALLaVA) |
arXiv |
2024-02-18 |
 |

 |
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model |
arXiv |
2024-02-06 |
 |
- |
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices |
arXiv |
2023-12-28 |
 |
- |
Gemini: A Family of Highly Capable Multimodal Models |
arXiv |
2023-12-19 |
- |
 |
✨ Osprey: Pixel Understanding with Visual Instruction Tuning |
CVPR 2024 |
2023-12-15 |
 |
- |
✨ VILA: On Pre-training for Visual Language Models (NVIDIA, MIT) |
CVPR 2024 |
2023-12-12 |
 |
- |
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models |
arXiv |
2023-12-11 |
 |
 |
Prompt Highlighter: Interactive Control for Multi-Modal LLMs |
CVPR 2024 |
2023-12-07 |
 |
 |
PixelLM: Pixel Reasoning with Large Multimodal Model |
CVPR 2024 |
2023-12-04 |
 |
 |
APoLLo : Unified Adapter and Prompt Learning for Vision Language Models |
EMNLP 2023 |
2023-12-04 |
 |
 |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models |
arXiv |
2023-11-28 |
 |

 |
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models |
arXiv |
2023-11-22 |
 |
- |
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions |
arXiv |
2023-11-21 |
 |
 |
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge |
CVPR 2024 |
2023-11-20 |
 |
 |
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection |
arXiv |
2023-11-16 |
 |
- |
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration |
arXiv |
2023-11-07 |
 |
- |
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning |
arXiv |
2023-10-14 |
 |
 |
Ferret: Refer and Ground Anything Anywhere at Any Granularity (Ferret) |
ICLR 2024 |
2023-10-11 |
 |
- |
✨ Improved Baselines with Visual Instruction Tuning (LLaVA-1.5) |
arXiv |
2023-10-05 |
 |
 |
Aligning Large Multimodal Models with Factually Augmented RLHF (LLaVA-RLHF, MMHal-Bench (hallucination)) |
arXiv |
2023-09-25 |
 |

 |
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning |
ICLR 2024 |
2023-09-14 |
 |
- |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond |
arXiv |
2023-08-24 |
 |
 |
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages (VisCPM-Chat/Paint) |
ICLR 2024 |
2023-08-23 |
 |
- |
SVIT: Scaling up Visual Instruction Tuning |
arXiv |
2023-07-09 |
 |
 |
Kosmos-2: Grounding Multimodal Large Language Models to the World (Kosmos-2, GrIT Dataset) |
arXiv |
2023-06-26 |
 |

 |
M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning |
arXiv |
2023-06-07 |
- |

 |
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning |
NeurIPS 2023 |
2023-05-11 |
 |
- |
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans |
arXiv |
2023-05-08 |
 |
- |
VPGTrans: Transfer Visual Prompt Generator across LLMs |
NeurIPS 2023 |
2023-05-02 |
 |
 |
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality |
arXiv |
2023-04-27 |
 |
- |
✨ MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models |
ICLR 2024 |
2023-04-20 |
 |
 |
✨ Visual Instruction Tuning (LLaVA) |
NeurIPS 2023 |
2023-04-17 |
 |

 |
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1) |
NeurIPS 2023 |
2023-02-27 |
 |
- |
Multimodal Chain-of-Thought Reasoning in Language Models |
arXiv |
2023-02-02 |
 |
- |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models |
ICML 2023 |
2023-01-30 |
 |
- |
Flamingo: a Visual Language Model for Few-Shot Learning |
NeurIPS 2022 |
2022-04-29 |
 |
- |