VisualLLM
updated
How Far Are We from Intelligent Visual Deductive Reasoning?
Paper
• 2403.04732
• Published
• 21
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
• 2403.07508
• Published
• 77
DragAnything: Motion Control for Anything using Entity Representation
Paper
• 2403.07420
• Published
• 14
Learning and Leveraging World Models in Visual Representation Learning
Paper
• 2403.00504
• Published
• 33
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
Paper
• 2403.13248
• Published
• 78
Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos
Paper
• 2403.13044
• Published
• 15
Vid2Robot: End-to-end Video-conditioned Policy Learning with
Cross-Attention Transformers
Paper
• 2403.12943
• Published
• 15
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper
• 2403.11703
• Published
• 17
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Paper
• 2403.11481
• Published
• 13
Uni-SMART: Universal Science Multimodal Analysis and Research
Transformer
Paper
• 2403.10301
• Published
• 54
RAFT: Adapting Language Model to Domain Specific RAG
Paper
• 2403.10131
• Published
• 72
VideoAgent: Long-form Video Understanding with Large Language Model as
Agent
Paper
• 2403.10517
• Published
• 37
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
• 2403.18814
• Published
• 48
Improving Text-to-Image Consistency via Automatic Prompt Optimization
Paper
• 2403.17804
• Published
• 20
Can large language models explore in-context?
Paper
• 2403.15371
• Published
• 33
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
Paper
• 2403.15042
• Published
• 27
InternVideo2: Scaling Video Foundation Models for Multimodal Video
Understanding
Paper
• 2403.15377
• Published
• 29
DragAPart: Learning a Part-Level Motion Prior for Articulated Objects
Paper
• 2403.15382
• Published
• 11
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real
Computer Environments
Paper
• 2404.07972
• Published
• 51
Rho-1: Not All Tokens Are What You Need
Paper
• 2404.07965
• Published
• 94
WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents
Paper
• 2404.05902
• Published
• 22
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
Language Models
Paper
• 2404.07973
• Published
• 32
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper
• 2404.07503
• Published
• 31
OmniFusion Technical Report
Paper
• 2404.06212
• Published
• 77
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper
• 2404.05719
• Published
• 83
ByteEdit: Boost, Comply and Accelerate Generative Image Editing
Paper
• 2404.04860
• Published
• 25
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web
Navigating Agent
Paper
• 2404.03648
• Published
• 29
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with
Interleaved Visual-Textual Tokens
Paper
• 2404.03413
• Published
• 27
Scaling Instructable Agents Across Many Simulated Worlds
Paper
• 2404.10179
• Published
• 28
What matters when building vision-language models?
Paper
• 2405.02246
• Published
• 103
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper
• 2405.15223
• Published
• 17
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video
Understanding
Paper
• 2406.14515
• Published
• 33
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous
Reinforcement Learning
Paper
• 2406.11896
• Published
• 20
VideoLLM-online: Online Video Large Language Model for Streaming Video
Paper
• 2406.11816
• Published
• 26
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized
Sounds
Paper
• 2407.01494
• Published
• 15