Paper breakdown

June 23, 2026

8 min reading

Meet PEEK: A Tiny Model for Smarter Frame Selection

Get an overview of PEEK: a tiny model for smarter frame selection

Paper arXiv GitHub Hugging Face weights Hugging Face Space demo Code license: Apache-2.0 Weights license: CC-BY-NC-SA-4.0

Introduction

Vision-language models (VLMs) have made strong progress on image-language tasks, but video understanding remains expensive because videos are long, redundant, and often require sparse relevant cues to be extracted from a frame sequence.

For example, even a short video of 30 seconds is made of 720 images (24 frames per second), and feeding every single image to a captioning model would be too computationally expensive, and for longer videos it becomes untractable.

Before processing a video, most videoLLMs use a simple strategy to process fewer frames: they select a few of them uniformly from the video (e.g. a maximum of 768 regardless of the video length), and hope they will not miss an important event. Uniform sampling is deterministic, model-free, and often produces good results, outperforming adaptive strategies on some benchmarks. This makes uniform sampling a good baseline rather than a naive approach. However, it is still fundamentally content-blind: a short clip where the key event happens in a single instant and a clip where useful evidence is spread across the whole duration are treated identically.

Figure 1: Uniform sampling of a video sequence of T frames. One frame is selected every k frames, producing a fixed number of uniformly spaced samples that cover the full temporal extent of the video, independent of content.

To address this, we propose PEEK, a tiny model that can score each frame in a video, enabling the selection of a few high-score frames only, which are more relevant than selecting them uniformly.

Figure 2: Stratified argmax selection with budget k = 4. The video is partitioned into k equal temporal segments; within each segment, the frame with the highest predicted score is selected (teal). Unlike uniform sampling, selected frames are not evenly spaced, they are pulled toward informative moments within each segment.

Method

Our method is based on teacher-student knowledge distillation: train a small model (student) to reproduce the outputs of a larger one (teacher). 

We use SigLIP2 (400M parameters) as our teacher, acting as an Oracle: for each video-caption pair in our training data the student encodes the caption with its text encoder, and encodes each frame independently with its vision encoder. Then, we compute a similarity score for each frame embedding against the caption embedding, to obtain the final scores the student will learn to reproduce. 

Our student is a tiny model (13.1M parameters) that can only see the frames of the video, and is trained to predict a score for each frame, without access to the caption, which we do not have at inference time since our goal is to select relevant frames for video captioning. We use a ListMLE loss function for the student model to learn to rank frames as the teacher. Figure 2 summarizes our approach visually.

At inference time, PEEK can predict a score for each frame in the input video. Since video frames can be very redundant around a frame, selecting directly the frames with the highest scores would result in a high concentration around a few temporal points. To ensure good temporal coverage, we split the input video into several equal temporal windows, and sample the highest-scoring frame from each window, as shown in Figure 3.

Figure 3: Stratified argmax selection with budget k = 4. The video is partitioned into k equal temporal segments; within each segment, the frame with the highest predicted score is selected (teal). Unlike uniform sampling, selected frames are not evenly spaced, they are pulled toward informative moments within each segment.

This policy combines two priors: the scorer chooses content-rich frames locally, while the segments preserve temporal coverage. For k = 1, stratified argmax reduces to selecting the single highest-scoring frame in the video. The selected frames are sorted in temporal order before being forwarded to the downstream captioning model.

Experiments

Data

We train our model on ActivityNet Captions, using the official splits and report all metrics on the test set. We also evaluate on the MSR-VTT test split without re-training to assess zero-shot transfer to clip-level captioning. Table 1 summarizes the splits used for training and evaluation.

Dataset Split #Videos #Segments Avg. segment
duration (s)
Avg. video
duration (s)
Avg. words
/ caption
ANC train 10,009 37,421 35.5 117.3 13.5
val 4,917 17,031 37.7 118.2 13.6
test 4,885 17,505 40.2 118.2 12.0
MSR-VTT test 2,990 2,990 15.2 15.2 9.3

Table 1: Statistics of the splits used for training and evaluation.

ActivityNet Captions

ActivityNet Captions (ANC) consists of untrimmed YouTube videos drawn from the ActivityNet dataset, each densely annotated with multiple temporally-localized natural language descriptions. A single video contains, on average, between 3 and 4 overlapping or sequential events, with a typical total duration of about two minutes. Every annotated event is described by a free-form English sentence.

MSR-VTT

MSR-VTT contains short web video clips paired with 20 crowd-sourced English captions per clip. Unlike ANC, captions describe the entire clip rather than localized events, so each test video contributes a single “segment” whose temporal extent coincides with the clip itself. We use MSR-VTT exclusively for evaluation, in order to probe whether our model trained on ANC videos generalizes zero-shot to clips with a different caption distribution.

Training and Evaluation

We train PEEK on the ANC dataset, and evaluate our trained selector on video captioning, on both ANC and MSR-VTT test sets, selecting 1, 2, 4 or 8 frames that are fed to a downstream VLM. We compare PEEK against five training-free frame selection methods:

  • Oracle is the teacher model which has access to ground-truth captions. Although it cannot be used at inference time, we evaluate it to estimate an approximate upper bound on the sampler’s achievable performance.
  • Uniform splits the (densely) sampled frames into k equal temporal sub-segments and selects the center frame of each. 
  • Random uses the same temporal sub-segments as Uniform and samples one frame at random from each sub-segment, using a fixed seed shared across all VLMs.
  • MaxInfo selects a diverse, high-information subset by applying a maximum volume criterion to CLIP image embeddings; we use its fixed-cardinality mode with exactly k selected frames.
  • CSTA is originally a video summarization method that predicts frame-importance scores and selects a summary under a length budget. Since our evaluation requires a fixed number of frames, we adapt only its scoring stage: frames are scored with CSTA, then one highest-scoring frame is selected from each of the k temporal sub-segments.
  • PEEK is our student model trained on ANC with stratified argmax selection.

We fix the parameters and seed for all evaluated downstream VLMs for fair comparison, and use the same candidate frames for all methods.

For each selected frame budget, we generate captions conditioned only on the k chosen frames, in temporal order, and a short captioning prompt. We evaluate four downstream VLMs of various sizes: SmolVLM2-2.2B-Instruct, Qwen2.5-VL-3B, Qwen3.5-4B and Qwen2.5-VL-7B. We report CIDEr, which remains the primary metric for discussion because it is the most commonly reported metric for video captioning.

Results

ActivityNet Captions

VLM Selector CIDEr
k=1 2 4 8
SmolVLM2-2.2B Oracle 37.36 38.19 39.41 39.05
Uniform 29.79 31.23 32.76 33.85
Random 28.40 31.06 32.35 33.47
CSTA 28.37 30.38 32.62 33.69
MaxInfo 27.07 29.49 31.91 33.06
PEEK (ours) 31.53 32.98 33.45 34.33
Qwen2.5-VL-3B Oracle 37.31 41.67 45.23 46.58
Uniform 30.05 35.36 39.51 42.33
Random 29.03 34.99 39.54 41.91
CSTA 28.56 34.53 39.16 42.06
MaxInfo 26.47 35.10 39.47 41.91
PEEK (ours) 32.39 37.02 40.55 42.42
Qwen3.5-4B Oracle 38.4 40.64 40.19 40.30
Uniform 29.55 33.35 34.66 35.42
Random 28.94 32.93 34.72 34.93
CSTA 28.51 33.47 34.54 35.39
MaxInfo 27.01 32.97 34.44 35.22
PEEK (ours) 31.73 34.51 34.93 35.31
Qwen2.5-VL-7B Oracle 36.60 40.10 38.05 33.88
Uniform 28.54 34.43 33.54 29.40
Random 28.44 33.68 33.55 29.58
CSTA 28.02 33.63 33.02 29.77
MaxInfo 26.44 33.56 33.92 29.43
PEEK (ours) 31.54 35.04 33.15 30.01

Table 2: ActivityNet Captions test captioning metrics with different downstream VLMs and frame budgets. PEEK uses the same ActivityNet-trained checkpoint for all downstream VLMs. Oracle scores frames against the ground-truth caption.

Table 2 reports the results on ActivityNet Captions. The results show that PEEK is the strongest query-free selector on this benchmark, obtaining the best CIDEr in 14 out of 16 model/budget settings. The gains are most pronounced at k=1, where PEEK improves over the strongest query-free baseline by +1.74 CIDEr points for SmolVLM2-2.2B, +2.34 for Qwen2.5-VL-3B, +2.18 for Qwen3.5-4B, and +3.00 for Qwen2.5-VL-7B. The same conclusion holds at k=2, where PEEK is again best for all four VLMs, with gains ranging from+0.61 to +1.75 CIDEr points.

Compared with the adaptive baselines, PEEK is consistently stronger in the low-budget regime.

Zero-shot Transfer on MSR-VTT

VLM Selector CIDEr
k=1 2 4 8
SmolVLM2-2.2B Oracle 48.33 49.35 48.99 33.99
Uniform 42.15 43.87 46.76 31.42
Random 41.77 44.53 46.18 32.52
CSTA 39.99 43.47 45.49 32.99
MaxInfo 34.36 43.01 44.96 31.49
PEEK (ours) 44.83 46.28 46.67 33.71
Qwen2.5-VL-3B Oracle 33.67 38.10 42.45 44.75
Uniform 29.18 34.57 40.79 42.68
Random 28.34 35.04 39.42 43.37
CSTA 28.20 34.19 40.34 43.09
MaxInfo 21.82 33.76 39.56 43.18
PEEK (ours) 30.64 34.85 40.36 42.94
Qwen3.5-4B Oracle 38.80 36.67 35.62 34.26
Uniform 32.63 33.32 33.62 32.77
Random 32.21 32.75 33.84 32.88
CSTA 31.23 32.65 33.63 32.79
MaxInfo 26.10 33.30 33.90 33.03
PEEK (ours) 34.89 33.92 33.84 32.89
Qwen2.5-VL-7B Oracle 36.63 52.42 53.46 47.08
Uniform 31.08 48.57 51.42 45.04
Random 31.91 47.13 48.95 42.10
CSTA 31.54 46.85 49.12 41.71
MaxInfo 25.60 46.53 49.01 42.30
PEEK (ours) 33.16 48.69 50.99 45.29

Table 3: Zero-shot MSR-VTT test captioning metrics with different downstream VLMs and frame budgets. PEEK uses the same query-free ActivityNet-trained selector for all downstream VLMs. Oracle scores frames against the ground-truth caption.

Table 3 evaluates the same ActivityNet-trained selector on MSR-VTT, without re-training. This setting tests whether PEEK learns a transferable visual relevance prior rather than ANC specific domain distribution. The strongest transfer result is again obtained at k=1. PEEK is the best query-free method for all four downstream VLMs and all reported metrics in the one frame setting.

Efficiency

Selector Text at inference Selector time Per segment Full pipeline
Uniform/Random No negligible 33h35m
PEEK (ours) No 1h44m (26m) 0.36s 35h20m (+5.2%)
CSTA No 21h58m (5h31m) 4.52s 55h33m (+65.4%)
MaxInfo No 71h04m (17h49m) 14.62s 104h4m (+211.9%)
Oracle Yes 9h52m (2h28m) 2.03s 43h27m (+29.4%)

Table 4: Selection and end-to-end captioning time on the full ActivityNet Captions evaluation split with 17,505 segments, with SmolVLM2-2.2B-Instruct. Timings are measured on 4×NVIDIA A10G GPUs. We report total GPU time, with 4-GPU wall-clock estimates in parentheses. The full pipeline evaluates k ∈ {1, 2, 4, 8}.

Table 4 reports the selection and end-to-end captioning time on the full ANC evaluation split. Uniform and Random sampling have negligible selection cost, while all content-aware methods require an additional scoring pass over the candidate frames. On the ANC evaluation split, PEEK scores all 17,505 segments in 1h44m of GPU time, corresponding to 0.36s per segment. By contrast, CSTA requires 21h58m of GPU time, or 4.52s per segment, while MaxInfo requires 71h04m of GPU time, or 14.62s per segment. The Oracle is also more expensive than PEEK, requiring 9h52m of GPU time, or 2.03s per segment, and is not deployable because it uses the ground-truth caption.

When frame scores are reused for the full k ∈ {1, 2, 4, 8} captioning pipeline, PEEK increases total GPU time by only 5.2% over Uniform. In comparison, CSTA increases the total time by 65.4%, MaxInfo by 211.9%, and the Oracle by 29.4%. Thus, PEEK is not free, but it is a lot cheaper than the other content-aware selectors evaluated here. This efficiency is central to its practical value: PEEK recovers part of the Oracle’s caption-relevance signal while remaining query-free and lightweight enough to be used as a practical preprocessing stage.

Qualitative analysis

Figure 4: Top frames selected on a short video segment in which a basketball player dunks. From left to right: (a) the uniform (center) frame, (b) the top-ranked frame from PEEK, and (c) the top-ranked frame from the SigLIP2 teacher. The one-sentence caption below each frame is generated by Qwen2.5-VL-3B; the ground-truth caption is shown at the bottom. PEEK selects the actual dunk above the hoop, while the central frame finds a preparatory moment before the dunk and the teacher on the landing.

To complement the quantitative results, Figure 5 compares PEEK and SigLIP2 scores on ANC test segments. The two methods agree on global salient regions but often differ locally, with PEEK producing smoother temporal profiles than the frame-wise Oracle. Figure 4 shows PEEK strength: only PEEK identifies the player dunking, while uniform and the teacher both miss it.

Figure 5: Per-frame relevance scores on three test segments. Curves are min-max normalized per video and markers indicate the argmax frame for each method. PEEK (red) and SigLIP2 (blue) agree on the global temporal structure but disagree locally, and their top-frame choices differ.

Discussion and limitations

The results indicate that learned frame selection is most useful when the visual budget is tight. Across both benchmarks, PEEK is the best query-free selector in all one-frame CIDEr settings and in most two-frame settings. This supports the central hypothesis of the paper: part of the caption-conditioned relevance signal produced by an Oracle teacher can be recovered from visual evidence alone. The comparison with CSTA and MaxInfo shows that our method is different from generic video summarization, or visual diversity alone. Instead, PEEK learns a caption-oriented notion of visual relevance that is particularly useful when only one or two frames can be passed to the captioner.

At the same time, the results should not be interpreted as showing that learned frame selection is universally preferable to uniform sampling. Uniform remains a strong baseline, especially when several frames can be forwarded to the captioner. This is particularly visible on MSR-VTT at k=4, where Uniform often obtains the best CIDEr. The likely reason is the evaluation setting: ANC segments and MSR-VTT clips are relatively short, so a few uniformly spaced frames often cover the main event. As the frame budget increases, the value of selecting the single most relevant frame decreases, while temporal coverage and diversity become more important.

Another limitation is that the teacher signal is derived from ground-truth captions. This makes it useful as Oracle supervision, but it also ties the learned notion of relevance to reference-caption alignment rather than to all visually meaningful events in the video. A frame that supports a correct but non-reference caption may receive a weak teacher score. This limitation is also related to the use of reference-based captioning metrics, which can penalize correct captions that differ from the reference and can behave non-monotonically as more visual context is added. Extending this analysis to longer videos, adaptive frame budgets, and human or model-based factuality judgments would give a more complete picture of when learned frame selection is preferable.

A final limitation is that our evaluation is restricted to short-caption generation. Both ANC segments and MSR-VTT clips are associated with relatively compact descriptions, while long-form video captioning may require preserving multiple events, fine-grained temporal order, and details that are not all captured by a single caption-conditioned relevance ranking. In such settings, selecting only the most caption-aligned frames could overemphasize the dominant event and discard secondary but still important visual cues. Moreover, although PEEK is query-free by design, other video understanding tasks such as video question answering or retrieval may benefit from task- or query-specific frame selection. Our method could still be useful as a lightweight first-stage selector or as a transferable initialization, but evaluating this requires dedicated experiments. Extending the distillation framework to longer descriptions, adaptive frame budgets, and query-conditioned supervision is therefore an important direction for future work.

Conclusion

  • We introduced PEEK, a state-of-the-art query-free frame selector for video captioning.
  • PEEK also provides a favorable efficiency trade-off. It is much faster than CSTA and MaxInfo, while consistently outperforming them in the low-frame regime. This makes it a practical selector for efficient video captioning and a natural candidate for related applications such as thumbnail or preview-frame selection.

Paper: https://arxiv.org/abs/2605.31029

Project page: https://www.killian-steunou.com/peek/

Code: https://github.com/momentslab/peek

Similar articles

Moments Lab Research Team
Moments Lab Research Team