Jiwook Han1* Geo Ahn1* Youngrae Kim2* Jinwoo Choi1†
1 Kyung Hee University · 2 University of Southern California
* Equally contributed first authors · † Corresponding author
Naïvely fine-tuned MLLMs memorize visual shortcuts. We inject object-centric visual representations through a lightweight Slot Adapter — boosting OOD robustness while training only 0.25% of the parameters.
our philosophy, animated
Per frame, the Slot Adapter runs iterative slot attention to compress visual tokens into K=4
competing slots. We never tell the model which slot should be which — DINOv2 objectness priors
gently nudge the four slots toward semantically coherent regions. The LLM decoder then reasons over the
reconstructed entity-aware tokens to produce a temporal window
[tstart, tend].
[ts, te]Abstract
Fine-tuning MLLMs for Video Temporal Grounding (VTG) makes them memorize dataset-specific shortcuts rather than ground in actual visual content, hurting Out-of-Domain (OOD) generalization. Object-centric learning is a promising fix, but existing approaches require re-running the full vision–language alignment and instruction-tuning pipeline from scratch.
SlotVTG is a lightweight slot adapter that plugs into a pre-trained MLLM and residually adds entity-level structure: it decomposes visual tokens into slots via slot attention, then reconstructs the original sequence. A Slot Alignment loss distills objectness priors from DINOv2 into the slot attention map. Cross-domain evaluation shows consistent OOD gains while preserving In-Domain accuracy — with only 0.25% trainable parameters.
The Problem
A naïvely fine-tuned Qwen2.5-VL-3B drops 31% off-domain. Four diagnostics tell us why — and why object-centric decomposition fixes it.
Approach
SlotVTG has two ingredients: a Slot Adapter that decomposes visual tokens into entity-level slots, and a Slot Alignment Loss that distills objectness priors from DINOv2 into the slot attention map.
Slot Visualization
We visualize the slot attention maps by masking each frame region with its highest-attending slot. Across both In-Domain (Charades-STA) and Out-of-Domain (QVHighlights, ActivityNet) samples, the slots decompose scenes into semantically coherent regions — people, objects, and backgrounds — though the specific slot–to–entity mapping varies across frames.
Importantly, this decomposition generalizes to unseen domains (QVH., ANet) without any domain-specific supervision, confirming that the Slot Adapter learns transferable entity-level representations rather than dataset-specific patterns.
Cross-Domain Results
Compared against the strong Chrono-Qwen baseline, SlotVTG improves OOD R1@0.5 by up to +4.3 while staying on par with ID accuracy — all with a 0.25%-parameter adapter.
| Source Dataset |
LLM | Method | Target Dataset | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Charades-STA | ActivityNet | QVHighlights | ||||||||||||
| R1@0.3 | R1@0.5 | R1@0.7 | mIoU | R1@0.3 | R1@0.5 | R1@0.7 | mIoU | R1@0.3 | R1@0.5 | R1@0.7 | mIoU | |||
| Zero-Shot | 7B | HawkEye | 50.6 | 31.4 | 14.5 | 33.7 | 49.1 | 29.3 | 10.7 | 32.7 | – | – | – | – |
| 7B | TimeSuite | 69.9 | 48.7 | 24.0 | – | – | 16.6 | 9.3 | 22.0 | – | 12.3 | 9.2 | 21.3 | |
| 7B | UniTime | – | 59.1 | 31.9 | 52.2 | – | 22.8 | 14.1 | 27.3 | – | 41.0 | 31.5 | 43.7 | |
| 2B | VideoMind | 67.6 | 51.1 | 26.0 | 45.2 | 44.0 | 26.5 | 12.6 | 30.1 | – | – | – | – | |
| 7B | VideoMind | 73.5 | 59.1 | 31.2 | 50.2 | 48.4 | 30.3 | 15.7 | 33.3 | – | – | – | – | |
| Charades-STA | – | EaTR | 67.7 | 55.2 | 33.1 | 47.7 | 36.9 | 18.8 | 7.3 | 24.1 | 31.7 | 17.0 | 6.4 | 21.5 |
| – | CG-DETR | 69.7 | 57.6 | 35.1 | 49.5 | 32.6 | 16.8 | 6.8 | 22.1 | 37.4 | 22.8 | 10.5 | 25.2 | |
| 4B | Chrono-BLIP | 77.5 | 68.8 | 48.5 | 57.2 | 41.8 | 22.4 | 9.7 | 27.7 | 66.6 | 43.9 | 23.7 | 43.9 | |
| 3B | Chrono-Qwen | 77.2 | 63.4 | 40.3 | 55.2 | 44.4 | 26.3 | 13.1 | 30.1 | 63.3 | 43.6 | 23.3 | 42.7 | |
| 3B | SlotVTG (Ours) | 77.2 | 64.0 | 41.2 | 55.4 | 47.7 | 28.7 | 14.4 | 32.2 | 66.0 | 47.9 | 26.2 | 45.0 | |
| 7B | Chrono-Qwen | 79.1 | 67.8 | 46.9 | 58.1 | 46.5 | 29.2 | 14.6 | 32.6 | 70.3 | 53.5 | 29.6 | 49.0 | |
| 7B | SlotVTG (Ours) | 79.5 | 67.6 | 46.7 | 58.3 | 52.0 | 33.2 | 16.7 | 35.5 | 74.0 | 57.6 | 32.2 | 51.3 | |
| QVHighlights | – | EaTR | 40.8 | 27.2 | 13.0 | 28.0 | 36.7 | 20.9 | 9.7 | 25.3 | 70.3 | 59.6 | 40.3 | 53.1 |
| – | CG-DETR | 42.8 | 25.5 | 12.2 | 28.5 | 37.7 | 21.5 | 10.4 | 26.0 | 77.5 | 65.6 | 52.1 | 61.3 | |
| 4B | Chrono-BLIP | 61.5 | 37.0 | 19.8 | 41.4 | 41.8 | 22.4 | 9.7 | 27.7 | 86.1 | 76.8 | 62.8 | 70.8 | |
| 3B | Chrono-Qwen | 70.6 | 45.7 | 21.8 | 45.7 | 55.2 | 35.3 | 20.8 | 39.2 | 87.6 | 79.1 | 64.8 | 71.7 | |
| 3B | SlotVTG (Ours) | 70.7 | 46.6 | 22.6 | 46.1 | 56.1 | 35.7 | 21.1 | 40.0 | 87.3 | 79.5 | 64.6 | 71.7 | |
| 7B | Chrono-Qwen | 75.2 | 53.3 | 27.4 | 49.9 | 60.7 | 41.4 | 24.8 | 43.4 | 90.7 | 81.8 | 67.6 | 74.9 | |
| 7B | SlotVTG (Ours) | 76.0 | 53.7 | 28.2 | 50.4 | 61.7 | 42.0 | 25.3 | 44.1 | 91.3 | 82.9 | 69.3 | 76.0 | |
Bold: best within each (source · LLM size) group on each target. DETR-based methods (EaTR, CG-DETR) use pre-extracted CLIP + SlowFast features at 0.5 fps. Zero-shot rows are reference numbers without source fine-tuning.
If you find this useful
@inproceedings{han2026slotvtg,
title = {{SlotVTG}: Object-Centric Adapter for Generalizable Video Temporal Grounding},
author = {Han, Jiwook and Ahn, Geo and Kim, Youngrae and Choi, Jinwoo},
booktitle = {CVPR 2026 Workshop on Grounded Retrieval and Agentic Intelligence for Vision-Language (GRAIL-V)},
year = {2026}
}