SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding

our philosophy, animated

Each frame, decomposed into entity-level slots — emergently.

Per frame, the Slot Adapter runs iterative slot attention to compress visual tokens into K=4 competing slots. We never tell the model which slot should be which — DINOv2 objectness priors gently nudge the four slots toward semantically coherent regions. The LLM decoder then reasons over the reconstructed entity-aware tokens to produce a temporal window [t_start, t_end].

1 Video & query

2 Slot attention

3 Reconstruction

4 Residual addition

5 LLM → [t_s, t_e]

Slot 1 Slot 2 Slot 3 Slot 4 · assignment is emergent; not labeled

Abstract

Bringing object-centric learning into MLLMs — cheaply.

Fine-tuning MLLMs for Video Temporal Grounding (VTG) makes them memorize dataset-specific shortcuts rather than ground in actual visual content, hurting Out-of-Domain (OOD) generalization. Object-centric learning is a promising fix, but existing approaches require re-running the full vision–language alignment and instruction-tuning pipeline from scratch.

SlotVTG is a lightweight slot adapter that plugs into a pre-trained MLLM and residually adds entity-level structure: it decomposes visual tokens into slots via slot attention, then reconstructs the original sequence. A Slot Alignment loss distills objectness priors from DINOv2 into the slot attention map. Cross-domain evaluation shows consistent OOD gains while preserving In-Domain accuracy — with only 0.25% trainable parameters.

The Problem

Fine-tuned MLLMs don't see — they memorize.

A naïvely fine-tuned Qwen2.5-VL-3B drops 31% off-domain. Four diagnostics tell us why — and why object-centric decomposition fixes it.

Four diagnostic observations: (a) ID vs OOD gap, (b) visual similarity, (c) noise perturbation, (d) MMD comparison

OOD performance drops 31%. R1@0.5 on Cha. → QVH falls 63.4 → 43.6.

The drop tracks visual distance. Most-similar OOD: 52.8; most-dissimilar: 39.1.

The model stops looking. On OOD, perturbing the GT segment vs. a random one yields nearly the same drop (gap = 0.5%p).

Slots shrink the gap by 49.6%. Source–target MMD: 0.192 → 0.097.

Approach

A 7.6M-parameter bottleneck, plugged into the LLM.

SlotVTG has two ingredients: a Slot Adapter that decomposes visual tokens into entity-level slots, and a Slot Alignment Loss that distills objectness priors from DINOv2 into the slot attention map.

Visual tokens · per frame

N × D (N = 64)

→

Slot attention

K = 4, I = 3 iters

→

Reconstruction

entity-aware tokens
residual + back to LLM

SlotVTG full pipeline. (a) Overview of the adapter inserted into early LLM-decoder layers. (b) Slot Alignment Loss that aligns slot-attention similarity with DINOv2 affinity. — **Fig. 3 from the paper.** (a) Slot Adapter is inserted into the early LLM-decoder layers and adds residually to the token stream. (b) Slot Alignment Loss aligns slot-attention similarity with DINOv2 affinity.

Slot Visualization

What do the slots actually learn?

Decomposition generalizes — without any domain-specific supervision.

We visualize the slot attention maps by masking each frame region with its highest-attending slot. Across both In-Domain (Charades-STA) and Out-of-Domain (QVHighlights, ActivityNet) samples, the slots decompose scenes into semantically coherent regions — people, objects, and backgrounds — though the specific slot–to–entity mapping varies across frames.

Importantly, this decomposition generalizes to unseen domains (QVH., ANet) without any domain-specific supervision, confirming that the Slot Adapter learns transferable entity-level representations rather than dataset-specific patterns.

Cross-Domain Results

Consistent OOD gains. Preserved ID performance.

Compared against the strong Chrono-Qwen baseline, SlotVTG improves OOD R1@0.5 by up to +4.3 while staying on par with ID accuracy — all with a 0.25%-parameter adapter.

Charades-STA, source · 3B

+4.3

R1@0.5 on QVHighlights · 43.6 → 47.9

Charades-STA, source · 7B

+4.0

R1@0.5 on ActivityNet · 29.2 → 33.2

Source Dataset	LLM	Method	Target Dataset
			Charades-STA				ActivityNet				QVHighlights
			R1@0.3	R1@0.5	R1@0.7	mIoU	R1@0.3	R1@0.5	R1@0.7	mIoU	R1@0.3	R1@0.5	R1@0.7	mIoU
Zero-Shot	7B	HawkEye	50.6	31.4	14.5	33.7	49.1	29.3	10.7	32.7	–	–	–	–
	7B	TimeSuite	69.9	48.7	24.0	–	–	16.6	9.3	22.0	–	12.3	9.2	21.3
	7B	UniTime	–	59.1	31.9	52.2	–	22.8	14.1	27.3	–	41.0	31.5	43.7
	2B	VideoMind	67.6	51.1	26.0	45.2	44.0	26.5	12.6	30.1	–	–	–	–
	7B	VideoMind	73.5	59.1	31.2	50.2	48.4	30.3	15.7	33.3	–	–	–	–
Charades-STA	–	EaTR	67.7	55.2	33.1	47.7	36.9	18.8	7.3	24.1	31.7	17.0	6.4	21.5
	–	CG-DETR	69.7	57.6	35.1	49.5	32.6	16.8	6.8	22.1	37.4	22.8	10.5	25.2
	4B	Chrono-BLIP	77.5	68.8	48.5	57.2	41.8	22.4	9.7	27.7	66.6	43.9	23.7	43.9
	3B	Chrono-Qwen	77.2	63.4	40.3	55.2	44.4	26.3	13.1	30.1	63.3	43.6	23.3	42.7
	3B	SlotVTG (Ours)	77.2	64.0	41.2	55.4	47.7	28.7	14.4	32.2	66.0	47.9	26.2	45.0
	7B	Chrono-Qwen	79.1	67.8	46.9	58.1	46.5	29.2	14.6	32.6	70.3	53.5	29.6	49.0
	7B	SlotVTG (Ours)	79.5	67.6	46.7	58.3	52.0	33.2	16.7	35.5	74.0	57.6	32.2	51.3
QVHighlights	–	EaTR	40.8	27.2	13.0	28.0	36.7	20.9	9.7	25.3	70.3	59.6	40.3	53.1
	–	CG-DETR	42.8	25.5	12.2	28.5	37.7	21.5	10.4	26.0	77.5	65.6	52.1	61.3
	4B	Chrono-BLIP	61.5	37.0	19.8	41.4	41.8	22.4	9.7	27.7	86.1	76.8	62.8	70.8
	3B	Chrono-Qwen	70.6	45.7	21.8	45.7	55.2	35.3	20.8	39.2	87.6	79.1	64.8	71.7
	3B	SlotVTG (Ours)	70.7	46.6	22.6	46.1	56.1	35.7	21.1	40.0	87.3	79.5	64.6	71.7
	7B	Chrono-Qwen	75.2	53.3	27.4	49.9	60.7	41.4	24.8	43.4	90.7	81.8	67.6	74.9
	7B	SlotVTG (Ours)	76.0	53.7	28.2	50.4	61.7	42.0	25.3	44.1	91.3	82.9	69.3	76.0

Bold: best within each (source · LLM size) group on each target. DETR-based methods (EaTR, CG-DETR) use pre-extracted CLIP + SlowFast features at 0.5 fps. Zero-shot rows are reference numbers without source fine-tuning.

SlotVTG Object-Centric Adapter for Generalizable Video Temporal Grounding