Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models Paper • 2606.03748 • Published 4 days ago • 5
3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code Paper • 2606.01057 • Published 6 days ago • 7
Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini Paper • 2605.27295 • Published 11 days ago • 23
LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs Paper • 2605.17260 • Published 20 days ago • 25
Geometric Context Transformer for Streaming 3D Reconstruction Paper • 2604.14141 • Published Apr 15 • 21
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens Paper • 2604.04913 • Published Apr 6 • 12
MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios Paper • 2603.28130 • Published Mar 30 • 11
Understanding the Challenges in Iterative Generative Optimization with LLMs Paper • 2603.23994 • Published Mar 25 • 29
Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos Paper • 2603.22529 • Published Mar 23 • 7
Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders Paper • 2603.19209 • Published Mar 19 • 6
Versatile Editing of Video Content, Actions, and Dynamics without Training Paper • 2603.17989 • Published Mar 18 • 18
Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD Paper • 2603.20155 • Published Mar 20 • 10
V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning Paper • 2603.14482 • Published Mar 15 • 36