We develop a universal multimodal embedder UNITE, allowing for a unified representation of arbitrary multimodal contents.
Overview of UNITE: (a) Model architecture utilizing LMM as the backbone, supporting multimodal inputs (text, images, videos, and their combinations). (b) Similarity matrix after applying MAMCL, which enables focused contrastive learning by restricting comparisons to samples sharing the same target modality, thus reducing inter-modal interference.
Performance comparison on fine-grained video-text benchmark (CaReBench) and image-text benchmarks (ShareGPT4V, Urban1K, DOCCI). Our UNITE achieves the overall optimal performance.
Performance comparison on instruction-based retrieval benchmarks (left: MMEB and right: WebVid-CoVR). Our UNITE achieves leading performance on various tasks, even surpassing models with larger parameter scales.
@article{kong2025modality,
title={Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval},
author={Kong, Fanheng and Zhang, Jingyuan and Liu, Yahui and Zhang, Hongzhi and Feng, Shi and Yang, Xiaocui and Wang, Daling and Tian, Yu and W, Victoria and Zhang, Fuzheng and Zhou, Guorui},
journal={arXiv preprint arXiv:2505.19650},
year={2025}
}