arxiv:2604.11177

Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding

Published on Apr 13

· Submitted by

Ashish Choithani on Apr 15

VideoDB

Upvote

Authors:

Ashish Choithani ,

Abstract

Research examines how internal reasoning traces affect video scene understanding in vision-language models, revealing that quality improvements from extended reasoning plateau quickly and that different model variants produce distinct reasoning patterns.

AI-generated summary

We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four configurations of Google's Gemini 2.5 Flash and Flash Lite across scenes extracted from 100 hours of video, we ask three questions: does more thinking lead to better outputs, where do the gains stop, and what do these models actually think about? We introduce three evaluation metrics. Contentfulness measures how much of the thought stream is useful scene content versus meta-commentary. Thought-Final Coverage measures how faithfully the thought stream translates into the final output. Dominant Entity Analysis identifies which subjects, actions, and settings the model focuses on. GPT-5 serves as an independent judge. We find that quality gains from additional thinking plateau quickly, with most improvement occurring in the first few hundred tokens. Flash Lite offers the best balance between quality and token usage. Tight reasoning budgets cause the model to add content in the final output that it never reasoned about, a form of compression-step hallucination. Despite being different model tiers, Flash and Flash Lite produce similar thought streams, though they differ in style: Flash discusses its reasoning process, while Lite focuses on describing the scene.

View arXiv page View PDF Project page GitHub 5 Add to collection

Community

ash-ishh

Paper author Paper submitter about 12 hours ago

Conclusions:

More thinking helps, but gains plateau quickly in our setup. Most quality improvement happens
in the first few hundred thought tokens. Beyond about 700 tokens, additional thinking adds cost with
smaller gains in this dataset.
Lite 1024 is the quality leader. It achieves the best F1, Thought Coverage, Output Grounding, and
perfect-score rate while using 30% fewer thought tokens than Flash Dynamic.
Tight budgets increase compression-step hallucination. Flash 128 more often outputs details
that were not explicitly present in its thought stream.
Flash and Lite think about the same things. Cross-tier thought stream similarity is nearly as high
as same-model determinism, suggesting the two tiers share underlying reasoning patterns.
Flash Lite is more token-efficient in this setup. It tends to spend less on process narration and
more on scene content.

librarian-bot

about 4 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.11177

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.11177 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.11177 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.11177 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.