TUNA

Comprehensive Fine-grained Temporal
Understanding Evaluation on Dense Dynamic Videos

Fanheng Kong^1,2, Jingyuan Zhang², Hongzhi Zhang², Shi Feng^1†,
Daling Wang¹, Linhao Yu², Xingguang Ji², Yu Tian², Victoria W., Fuzheng Zhang²

^†Corresponding Author

¹Northeastern University, ²Kuaishou Technology

ACL 2025 Main

arXiv Code

🤗

Dataset

🏆

Leaderboard

Abstract

Videos are unique in their integration of temporal elements, including camera, scene, action, and attribute, along with their dynamic relationships over time. However, existing benchmarks for video understanding often treat these properties separately or narrowly focus on specific aspects, overlooking the holistic nature of video content. To address this, we introduce TUNA, a temporal-oriented benchmark for fine-grained understanding on dense dynamic videos, with two complementary tasks: captioning and QA. Our TUNA features diverse video scenarios and dynamics, assisted by interpretable and robust evaluation criteria. We evaluate several leading models on our benchmark, providing fine-grained performance assessments across various dimensions. This evaluation reveals key challenges in video temporal understanding, such as limited action description, inadequate multi-subject understanding, and insensitivity to camera motion, offering valuable insights for improving video understanding models.

Leaderboard (TUNA-CAP)

Listed here is the F1 score.

Low-Dy.: Low-Dynamic High-Dy.: High-Dynamic Multi-Sc.: Multi-Scene Multi-Su.: Multi-Subject

By default, this leaderboard is sorted by overall F1 score, with overall Recall score as a secondary sort key. To view other sorted results, please click on the corresponding cell.

Model	LLM Params	Frames	Date	Overall (%)	Dynamic Element Type (%)				Visual Characteristic (%)
Model	LLM Params	Frames	Date	Overall (%)	Camera	Scene	Action	Attribute	Low-Dy.	High-Dy.	Multi-Sc.	Multi-Su.
GPT-4o-0806 OpenAI	-	1/2 fps^1*	2024-08-06	58.5	61.3	66.4	48.0	57.8	58.2	58.7	58.1	55.5
Gemini 1.5 Pro 002 Google	-	1/2 fps^1*	2024-05-24	57.4	60.7	63.3	46.3	56.0	58.7	56.7	57.0	53.3
Gemini 1.5 Flash 002 Google	-	1/2 fps^1*	2024-05-24	55.7	59.6	65.1	42.9	55.2	56.0	55.5	55.9	55.9
InternVL2-76B Shanghai AI Lab	72B	32	2024-07-04	51.9	53.9	61.4	41.2	50.9	52.8	51.5	51.1	49.3
Qwen2-VL-72B Alibaba	72B	2 fps^2*	2024-08-30	51.7	54.0	52.8	42.6	48.5	55.7	49.7	48.0	43.3
InternVL2-40B Shanghai AI Lab	34B	32	2024-07-04	51.7	55.1	59.0	39.3	52.3	53.9	50.5	50.5	48.0
MiniCPM-V 2.6 OpenBMB	8B	32	2024-08-06	51.7	56.0	60.6	38.8	50.2	53.0	51.0	51.7	49.0
LLaVA-Video-7B Bytedance & NTU S-Lab	7B	32	2024-09-30	51.0	50.4	58.9	37.8	53.1	52.2	50.3	50.0	45.8
LLaVA-Video-72B_SlowFast Bytedance & NTU S-Lab	72B	32	2024-09-30	50.2	50.3	56.4	39.3	50.8	50.6	50.0	49.3	45.7
LLaVA-OneVision-72B Bytedance & NTU S-Lab	72B	32	2024-08-05	49.6	51.9	57.7	36.0	48.8	48.6	45.9	50.1	49.4
LLaVA-OneVision-7B Bytedance & NTU S-Lab	7B	32	2024-08-05	49.3	51.0	57.6	36.8	49.3	50.0	48.9	48.4	43.8
InternVL2-26B Shanghai AI Lab	20B	32	2024-07-04	49.0	51.6	58.7	37.0	49.1	49.4	48.9	48.4	45.8
Qwen2-VL-7B Alibaba	7B	2 fps^2*	2024-08-30	48.9	49.0	56.7	37.0	46.7	53.8	46.4	44.4	39.9
Tarsier-34B Bytedance	34B	32	2024-07-04	48.2	42.3	44.4	47.6	42.2	49.1	47.8	49.6	47.3
Kangaroo Meituan & UCAS	8B	32	2024-07-17	42.7	44.1	51.9	31.9	39.5	45.6	41.1	39.3	35.7
InternVL2-8B Shanghai AI Lab	7B	32	2024-07-04	40.8	41.7	44.7	30.0	42.3	44.5	38.9	38.4	35.2
Tarsier-7B Bytedance	7B	32	2024-07-04	38.6	34.8	33.1	36.2	33.3	46.5	34.5	35.8	33.2
PLLaVA-34B NUS & NYU & Bytedance	34B	16	2024-04-24	34.2	37.4	39.9	22.3	33.2	38.9	31.8	30.2	27.6
LongVA NTU S-Lab	7B	32	2024-06-24	31.8	32.5	40.6	22.0	28.4	37.3	29.0	27.6	23.7
PLLaVA-13B NUS & NYU & Bytedance	13B	16	2024-04-24	30.6	33.0	40.3	18.5	29.8	36.0	27.8	26.0	24.3
PLLaVA-7B NUS & NYU & Bytedance	7B	16	2024-04-24	27.4	28.9	36.6	16.5	25.3	32.7	24.7	22.8	22.5

Date: indicates the publication date of open-source models - indicates closed-source models

1* The videos are sampled at 2 fps when the video duration <16s, otherwise it is 1 fps.

2* The videos are sampled at 2 fps, and the upper limit is 64 frames.

Benchmark

Data Examples

All data are newly annotated by humans, not from any existing video dataset.

Visual Characteristic: High-Dynamic, Multi-Scene, Multi-Subject Domain: Film

The video begins with the camera focused on a wooden decorative piece, behind which a man is watching through it.

Then, the camera cuts to an outdoor scene with blurred edges and a clear center. A white news van with the logo “KXBD 6 News at 6” is visible by the roadside. Next to the van is a set-up camera, and a military green vehicle passes in front of the lens. In the background, greenery and a pedestrian path are visible. A woman with a bag on her right shoulder and a bag in her left hand walks along the sidewalk. The camera moves to the right, where a person is standing by the front passenger door of the news van, making a phone call.

Next, the camera cuts back indoors, where a man in a black suit suddenly turns to look inside. Behind him is an ornately decorated wall. The man in the suit turns again to look outside, then steps back while closing the door in front of him. He then turns and walks further into the room.

Visual Characteristic: High-Dynamic, Multi-Scene Domain: Cooking

At the beginning of the video, the camera focuses on a transparent glass bowl containing liquid and some ingredients. A hand holding an egg whisk with a black handle stirs the liquid inside, which appears milky white, and bubbles can be seen forming on the surface during the stirring process. Next, the camera cuts to a hand placing several raw pieces of meat into the liquid. The meat pieces are submerged in the milky white liquid.

Then, the camera shifts to a wooden cutting board, showing a a piece of raw chicken. One hand holds the chicken while the other holds a knife, slicing the chicken in half. The cut can be seen revealing the meat and bone. Both hands pick up the two halves of the chicken and turn them left and right to show the camera.

Subsequently, the camera returns to the glass bowl, where fingers gently stir the liquid. The fingers press down on a chicken piece, fully immersing it in the liquid. Afterward, text prompts “Make Sure It's Submerged” appear on the screen.

Visual Characteristic: High-Dynamic Domain: Daily Life

At the beginning of the video, the camera focuses on a black table, which has no items on it. In the background, a laptop is placed on the sofa, with some text displayed on the screen. There is a geometric-patterned cushion on the sofa.

Next, a man dressed in dark clothes enters from the left side of the scene, walks up to the table, and faces the camera without revealing his face. He draws three and a half circles on the table with his left index finger, finally stopping at the center of the table directly in front of him.

The man lowers his left hand, retrieves a slice of bread from under the table, and places it in the top-right corner of the table. Then, he places a transparent small jar at the center of the table with his right hand. Next, he places a thick book on the right side of the table, directly in front of the slice of bread. Then, he places a white plug behind the jar. Finally, he places a black data cable behind the jar, in front of the plug.

The man first walks to the left side of the scene, then walks towards the front of the scene, and finally disappears from the scene.

Visual Characteristic: High-Dynamic Domain: Sports Activity

At the beginning of the video, the camera focuses on an indoor skateboarding park. The venue is spacious with a smooth floor, and the walls are adorned with skateboards of various colors and designs. On the left side of the skateboarding park, there are small ramps and a set of stairs. In the middle, there is a larger platform where a person dressed in all black stands. On the wall behind the scene, there is a black circular logo on the right and a slogan with a black background and white letters on the left.

A male skateboarder wearing a gray T-shirt and light-colored pants appears on the right side of the scene. He is wearing a white hat and has a skateboard under his feet as he glides forward, gradually approaching the camera. The camera follows his movements and rotates to the left. After skating over a small ramp, the skateboarder quickly jumps into the air, flipping the skateboard in mid-air. As a result, the skateboard lands with its wheels facing upwards. The skateboarder loses balance and steps away from the skateboard, taking a few steps forward.

The camera continues to rotate to the left, revealing two males on the left side of the platform. One male, wearing a black wool cap and a gray short-sleeved shirt, is seated at the edge of the platform. Behind the male in the gray short-sleeved shirt, another male stands and is looking down at an electronic device he is holding.

Visual Characteristic: Low-Dynamic, Multi-Scene Domain: Driving

The video begins with the camera focuses on a black Mercedes-Benz ML 350 4MATIC SUV. To the right of the Mercedes is a metal fence with weeds growing underneath it. To the right of the fence is the opposing lane, where vehicles are continuously driving. Ahead is a traffic intersection with a yellow traffic light. In the background, tall buildings and a pedestrian overpass can be seen, with the words "In Front of Qingshan College" written on it. The weather is sunny, with some white clouds in the sky.

Subsequently, the traffic light changes from yellow to red, and the brake lights of the black Mercedes SUV illuminate as it begins to move forward slowly. Below the red light on the traffic signal ahead, the right-turn arrow is green, and the camera starts to move forward.

The Mercedes turns right at the intersection, and the camera pans to the right along with the movement of the Mercedes. The Mercedes ahead enters a narrower road. A yellow solid line is painted in the middle of the road. A long line of vehicles is waiting to pass on the right lane, and a speed limit sign of 40 is painted on the ground of the left lane. There is a row of trees and shrubs on the right side of the street.

The camera continues to move forward, and ahead is another traffic intersection with a red traffic light. The Mercedes ahead begins to slow down. There is already a car stopped in front of the Mercedes. The camera pans to the right front, shifting the focus to the adjacent lane on the right of the Mercedes. Ahead is a white van, and on the right opposing lane, cars and motorcycles are lined up waiting to move forward.

Dataset Statistics

Detailed statistics for TUNA-1K, including: number of videos (#Videos), video duration (Duration), number of events (#Events), number of visual elements in captions (#Elements (Narrative-level)), number of visual elements in events (#Elements (Narrative-level)), number of tokens of caption (#Tokens).

Benchmark Comparison

Comparison with various video understanding benchmarks across several aspects: number of videos (#Videos); number of samples (#Samp.); annotation method (Anno., with M/A denoting manual/automatic); domain (Domain); temporal orientation (Temporal Orientated); presence of scene transitions (Scene Trans.); consideration of camera (Camera) and scene (Scene); use of keypoints (Key.) for controllability and interpretability; Judgement of semantically identical yet diverse representations (Sem.); availability of multi-dimensional scores (M.D.); if global (Global) and fine-grained (Fine.) understanding are concerned.

Several video understanding benchmark examples and analysis.

Method

Dataset Construction

Overview of TUNA-1K construction. We collect and filter high-quality, short videos featuring dynamic temporal content from various sources. Each video is then categorized based on its visual characteristics and domain. Trained annotators provide temporally dense descriptions, followed by cross-validation. Video experts continuously review annotations, guiding annotators to refine their works, thus ensuring quality of the annotations.

Automatic Captioning Evaluation

Overview of the evaluation workflow for TUNA-CAP. We first split candidate caption into multiple events and match them to reference events in TUNA-1K. Then we discard the mismatched events (useless content or inconsistent chronology), and connect the matched candidate events with the same reference event, considering the temporal sequence of the captions. Finally, we classify the relationship of visual elements to the candidate event.

Analysis

Video Complexity / Input Frames

Video Complexity

Performance comparison across different video complexities.

Enrichment of Visual Inputs

Performance comparison across different number of input frames.

Video Complexity & Enrichment of Visual Inputs

Performance comparison of different input frames with different video complexity for models trained in long contexts (over 8K tokens). The horizontal coordinate is the number of input frames.

More Data Examples

TUNA-1K

A detailed example in TUNA-1K.

TUNA-MCQ

Several examples in TUNA-MCQ, involving Camera Motion, Camera Transition, Scene Description and Scene Transition tasks.

TUNA-MCQ

Several examples in TUNA-MCQ, involving Action Recognition, Action Sequence, and Action-Subject Matching tasks.

TUNA-MCQ

Several examples in TUNA-MCQ, involving Object Recognition, Object Appearance, and Object Location tasks.

Citation


    @article{kong2025tuna,
      title={TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos},
      author={Kong, Fanheng and Zhang, Jingyuan and Zhang, Hongzhi and Feng, Shi and Wang, Daling and Tian, Yu and Yu, Linhao and Ji, Xingguang and W., Victoria and Zhang, Fuzheng},
      journal={arXiv preprint arXiv:2505.20124},
      year={2025}
    }