Papers
arxiv:2604.11177

Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding

Published on Apr 13
· Submitted by
Ashish Choithani
on Apr 15
Authors:
,
,

Abstract

Research examines how internal reasoning traces affect video scene understanding in vision-language models, revealing that quality improvements from extended reasoning plateau quickly and that different model variants produce distinct reasoning patterns.

AI-generated summary

We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four configurations of Google's Gemini 2.5 Flash and Flash Lite across scenes extracted from 100 hours of video, we ask three questions: does more thinking lead to better outputs, where do the gains stop, and what do these models actually think about? We introduce three evaluation metrics. Contentfulness measures how much of the thought stream is useful scene content versus meta-commentary. Thought-Final Coverage measures how faithfully the thought stream translates into the final output. Dominant Entity Analysis identifies which subjects, actions, and settings the model focuses on. GPT-5 serves as an independent judge. We find that quality gains from additional thinking plateau quickly, with most improvement occurring in the first few hundred tokens. Flash Lite offers the best balance between quality and token usage. Tight reasoning budgets cause the model to add content in the final output that it never reasoned about, a form of compression-step hallucination. Despite being different model tiers, Flash and Flash Lite produce similar thought streams, though they differ in style: Flash discusses its reasoning process, while Lite focuses on describing the scene.

Community

Paper author Paper submitter

Conclusions:

  1. More thinking helps, but gains plateau quickly in our setup. Most quality improvement happens
    in the first few hundred thought tokens. Beyond about 700 tokens, additional thinking adds cost with
    smaller gains in this dataset.
  2. Lite 1024 is the quality leader. It achieves the best F1, Thought Coverage, Output Grounding, and
    perfect-score rate while using 30% fewer thought tokens than Flash Dynamic.
  3. Tight budgets increase compression-step hallucination. Flash 128 more often outputs details
    that were not explicitly present in its thought stream.
  4. Flash and Lite think about the same things. Cross-tier thought stream similarity is nearly as high
    as same-model determinism, suggesting the two tiers share underlying reasoning patterns.
  5. Flash Lite is more token-efficient in this setup. It tends to spend less on process narration and
    more on scene content.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.11177
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.11177 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.11177 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.11177 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.