Logo TUNA

Comprehensive Fine-grained Temporal
Understanding Evaluation on Dense Dynamic Videos

Corresponding Author
1Northeastern University, 2Kuaishou Technology

ACL 2025 Main

Abstract

Videos are unique in their integration of temporal elements, including camera, scene, action, and attribute, along with their dynamic relationships over time. However, existing benchmarks for video understanding often treat these properties separately or narrowly focus on specific aspects, overlooking the holistic nature of video content. To address this, we introduce TUNA, a temporal-oriented benchmark for fine-grained understanding on dense dynamic videos, with two complementary tasks: captioning and QA. Our TUNA features diverse video scenarios and dynamics, assisted by interpretable and robust evaluation criteria. We evaluate several leading models on our benchmark, providing fine-grained performance assessments across various dimensions. This evaluation reveals key challenges in video temporal understanding, such as limited action description, inadequate multi-subject understanding, and insensitivity to camera motion, offering valuable insights for improving video understanding models.

Leaderboard (TUNA-CAP)

Listed here is the F1 score.

Low-Dy.: Low-Dynamic    High-Dy.: High-Dynamic    Multi-Sc.: Multi-Scene    Multi-Su.: Multi-Subject

By default, this leaderboard is sorted by overall F1 score, with overall Recall score as a secondary sort key. To view other sorted results, please click on the corresponding cell.

Model LLM
Params
Frames Date Overall (%) Dynamic Element Type (%) Visual Characteristic (%)
Camera Scene Action Attribute Low-Dy. High-Dy. Multi-Sc. Multi-Su.
GPT-4o-0806

OpenAI

- 1/2 fps1* 2024-08-06 58.5 61.3 66.4 48.0 57.8 58.2 58.7 58.1 55.5
Gemini 1.5 Pro 002

Google

- 1/2 fps1* 2024-05-24 57.4 60.7 63.3 46.3 56.0 58.7 56.7 57.0 53.3
Gemini 1.5 Flash 002

Google

- 1/2 fps1* 2024-05-24 55.7 59.6 65.1 42.9 55.2 56.0 55.5 55.9 55.9
InternVL2-76B

Shanghai AI Lab

72B 32 2024-07-04 51.9 53.9 61.4 41.2 50.9 52.8 51.5 51.1 49.3
Qwen2-VL-72B

Alibaba

72B 2 fps2* 2024-08-30 51.7 54.0 52.8 42.6 48.5 55.7 49.7 48.0 43.3
InternVL2-40B

Shanghai AI Lab

34B 32 2024-07-04 51.7 55.1 59.0 39.3 52.3 53.9 50.5 50.5 48.0
MiniCPM-V 2.6

OpenBMB

8B 32 2024-08-06 51.7 56.0 60.6 38.8 50.2 53.0 51.0 51.7 49.0
LLaVA-Video-7B

Bytedance & NTU S-Lab

7B 32 2024-09-30 51.0 50.4 58.9 37.8 53.1 52.2 50.3 50.0 45.8
LLaVA-Video-72BSlowFast

Bytedance & NTU S-Lab

72B 32 2024-09-30 50.2 50.3 56.4 39.3 50.8 50.6 50.0 49.3 45.7
LLaVA-OneVision-72B

Bytedance & NTU S-Lab

72B 32 2024-08-05 49.6 51.9 57.7 36.0 48.8 48.6 45.9 50.1 49.4
LLaVA-OneVision-7B

Bytedance & NTU S-Lab

7B 32 2024-08-05 49.3 51.0 57.6 36.8 49.3 50.0 48.9 48.4 43.8
InternVL2-26B

Shanghai AI Lab

20B 32 2024-07-04 49.0 51.6 58.7 37.0 49.1 49.4 48.9 48.4 45.8
Qwen2-VL-7B

Alibaba

7B 2 fps2* 2024-08-30 48.9 49.0 56.7 37.0 46.7 53.8 46.4 44.4 39.9
Tarsier-34B

Bytedance

34B 32 2024-07-04 48.2 42.3 44.4 47.6 42.2 49.1 47.8 49.6 47.3
Kangaroo

Meituan & UCAS

8B 32 2024-07-17 42.7 44.1 51.9 31.9 39.5 45.6 41.1 39.3 35.7
InternVL2-8B

Shanghai AI Lab

7B 32 2024-07-04 40.8 41.7 44.7 30.0 42.3 44.5 38.9 38.4 35.2
Tarsier-7B

Bytedance

7B 32 2024-07-04 38.6 34.8 33.1 36.2 33.3 46.5 34.5 35.8 33.2
PLLaVA-34B

NUS & NYU & Bytedance

34B 16 2024-04-24 34.2 37.4 39.9 22.3 33.2 38.9 31.8 30.2 27.6
LongVA

NTU S-Lab

7B 32 2024-06-24 31.8 32.5 40.6 22.0 28.4 37.3 29.0 27.6 23.7
PLLaVA-13B

NUS & NYU & Bytedance

13B 16 2024-04-24 30.6 33.0 40.3 18.5 29.8 36.0 27.8 26.0 24.3
PLLaVA-7B

NUS & NYU & Bytedance

7B 16 2024-04-24 27.4 28.9 36.6 16.5 25.3 32.7 24.7 22.8 22.5

Date: indicates the publication date of open-source models    - indicates closed-source models

1* The videos are sampled at 2 fps when the video duration <16s, otherwise it is 1 fps.

2* The videos are sampled at 2 fps, and the upper limit is 64 frames.

Benchmark

Data Examples

All data are newly annotated by humans, not from any existing video dataset.

Dataset Statistics

data-composition

Detailed statistics for TUNA-1K, including: number of videos (#Videos), video duration (Duration), number of events (#Events), number of visual elements in captions (#Elements (Narrative-level)), number of visual elements in events (#Elements (Narrative-level)), number of tokens of caption (#Tokens).

Benchmark Comparison

Method

Analysis

Video Complexity / Input Frames

More Data Examples

Citation


    @article{kong2025tuna,
      title={TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos},
      author={Kong, Fanheng and Zhang, Jingyuan and Zhang, Hongzhi and Feng, Shi and Wang, Daling and Tian, Yu and Yu, Linhao and Ji, Xingguang and W, Victoria and Zhang, Fuzheng},
      journal={arXiv preprint arXiv:2505.20124},
      year={2025}
    }