SlotVTG Object-Centric Adapter for Generalizable Video Temporal Grounding

Jiwook Han1* Geo Ahn1* Youngrae Kim2* Jinwoo Choi1†

1 Kyung Hee University   ·   2 University of Southern California

* Equally contributed first authors  ·  Corresponding author

Naïvely fine-tuned MLLMs memorize visual shortcuts. We inject object-centric visual representations through a lightweight Slot Adapter — boosting OOD robustness while training only 0.25% of the parameters.

our philosophy, animated

Each frame, decomposed into entity-level slotsemergently.

Per frame, the Slot Adapter runs iterative slot attention to compress visual tokens into K=4 competing slots. We never tell the model which slot should be which — DINOv2 objectness priors gently nudge the four slots toward semantically coherent regions. The LLM decoder then reasons over the reconstructed entity-aware tokens to produce a temporal window [tstart, tend].

1 Video & query
2 Slot attention
3 Reconstruction
4 Residual addition
5 LLM → [ts, te]
Slot 1 Slot 2 Slot 3 Slot 4 · assignment is emergent; not labeled

Abstract

Bringing object-centric learning into MLLMs — cheaply.

Fine-tuning MLLMs for Video Temporal Grounding (VTG) makes them memorize dataset-specific shortcuts rather than ground in actual visual content, hurting Out-of-Domain (OOD) generalization. Object-centric learning is a promising fix, but existing approaches require re-running the full vision–language alignment and instruction-tuning pipeline from scratch.

SlotVTG is a lightweight slot adapter that plugs into a pre-trained MLLM and residually adds entity-level structure: it decomposes visual tokens into slots via slot attention, then reconstructs the original sequence. A Slot Alignment loss distills objectness priors from DINOv2 into the slot attention map. Cross-domain evaluation shows consistent OOD gains while preserving In-Domain accuracy — with only 0.25% trainable parameters.

The Problem

Fine-tuned MLLMs don't see — they memorize.

A naïvely fine-tuned Qwen2.5-VL-3B drops 31% off-domain. Four diagnostics tell us why — and why object-centric decomposition fixes it.

Four diagnostic observations: (a) ID vs OOD gap, (b) visual similarity, (c) noise perturbation, (d) MMD comparison
OOD performance drops 31%. R1@0.5 on Cha. → QVH falls 63.4 → 43.6.
The drop tracks visual distance. Most-similar OOD: 52.8; most-dissimilar: 39.1.
The model stops looking. On OOD, perturbing the GT segment vs. a random one yields nearly the same drop (gap = 0.5%p).
Slots shrink the gap by 49.6%. Source–target MMD: 0.192 → 0.097.

Approach

A 7.6M-parameter bottleneck, plugged into the LLM.

SlotVTG has two ingredients: a Slot Adapter that decomposes visual tokens into entity-level slots, and a Slot Alignment Loss that distills objectness priors from DINOv2 into the slot attention map.

Visual tokens · per frame
N × D  (N = 64)
Slot attention
K = 4, I = 3 iters
Reconstruction
entity-aware tokens
residual + back to LLM
SlotVTG full pipeline. (a) Overview of the adapter inserted into early LLM-decoder layers. (b) Slot Alignment Loss that aligns slot-attention similarity with DINOv2 affinity.
Fig. 3 from the paper. (a) Slot Adapter is inserted into the early LLM-decoder layers and adds residually to the token stream. (b) Slot Alignment Loss aligns slot-attention similarity with DINOv2 affinity.

Slot Visualization

What do the slots actually learn?

Slot visualization across Charades-STA (ID), QVHighlights (OOD), and ActivityNet (OOD) samples. Each row shows the original frame and the four slot masks.
Fig. 4 from the paper · each frame is masked by its highest-attending slot for one Cha. (ID), QVH (OOD), and ANet (OOD) sample.

Decomposition generalizes — without any domain-specific supervision.

We visualize the slot attention maps by masking each frame region with its highest-attending slot. Across both In-Domain (Charades-STA) and Out-of-Domain (QVHighlights, ActivityNet) samples, the slots decompose scenes into semantically coherent regions — people, objects, and backgrounds — though the specific slot–to–entity mapping varies across frames.

Importantly, this decomposition generalizes to unseen domains (QVH., ANet) without any domain-specific supervision, confirming that the Slot Adapter learns transferable entity-level representations rather than dataset-specific patterns.

Cross-Domain Results

Consistent OOD gains. Preserved ID performance.

Compared against the strong Chrono-Qwen baseline, SlotVTG improves OOD R1@0.5 by up to +4.3 while staying on par with ID accuracy — all with a 0.25%-parameter adapter.

Charades-STA, source · 3B
+4.3
R1@0.5 on QVHighlights · 43.6 → 47.9
Charades-STA, source · 7B
+4.0
R1@0.5 on ActivityNet · 29.2 → 33.2
Source
Dataset
LLM Method Target Dataset
Charades-STA ActivityNet QVHighlights
R1@0.3R1@0.5R1@0.7mIoU R1@0.3R1@0.5R1@0.7mIoU R1@0.3R1@0.5R1@0.7mIoU
Zero-Shot7BHawkEye 50.631.414.533.7 49.129.310.732.7
7BTimeSuite 69.948.724.0 16.69.322.0 12.39.221.3
7BUniTime 59.131.952.2 22.814.127.3 41.031.543.7
2BVideoMind 67.651.126.045.2 44.026.512.630.1
7BVideoMind 73.559.131.250.2 48.430.315.733.3
Charades-STAEaTR 67.755.233.147.7 36.918.87.324.1 31.717.06.421.5
CG-DETR 69.757.635.149.5 32.616.86.822.1 37.422.810.525.2
4BChrono-BLIP 77.568.848.557.2 41.822.49.727.7 66.643.923.743.9
3BChrono-Qwen 77.263.440.355.2 44.426.313.130.1 63.343.623.342.7
3BSlotVTG (Ours) 77.264.041.255.4 47.728.714.432.2 66.047.926.245.0
7BChrono-Qwen 79.167.846.958.1 46.529.214.632.6 70.353.529.649.0
7BSlotVTG (Ours) 79.567.646.758.3 52.033.216.735.5 74.057.632.251.3
QVHighlightsEaTR 40.827.213.028.0 36.720.99.725.3 70.359.640.353.1
CG-DETR 42.825.512.228.5 37.721.510.426.0 77.565.652.161.3
4BChrono-BLIP 61.537.019.841.4 41.822.49.727.7 86.176.862.870.8
3BChrono-Qwen 70.645.721.845.7 55.235.320.839.2 87.679.164.871.7
3BSlotVTG (Ours) 70.746.622.646.1 56.135.721.140.0 87.379.564.671.7
7BChrono-Qwen 75.253.327.449.9 60.741.424.843.4 90.781.867.674.9
7BSlotVTG (Ours) 76.053.728.250.4 61.742.025.344.1 91.382.969.376.0

Bold: best within each (source · LLM size) group on each target. DETR-based methods (EaTR, CG-DETR) use pre-extracted CLIP + SlowFast features at 0.5 fps. Zero-shot rows are reference numbers without source fine-tuning.

If you find this useful

Cite SlotVTG

@inproceedings{han2026slotvtg,
  title     = {{SlotVTG}: Object-Centric Adapter for Generalizable Video Temporal Grounding},
  author    = {Han, Jiwook and Ahn, Geo and Kim, Youngrae and Choi, Jinwoo},
  booktitle = {CVPR 2026 Workshop on Grounded Retrieval and Agentic Intelligence for Vision-Language (GRAIL-V)},
  year      = {2026}
}