VisualLLM - a JueZhang Collection

JueZhang 's Collections

VisualLLM

updated Jul 4, 2024

How Far Are We from Intelligent Visual Deductive Reasoning?

Paper • 2403.04732 • Published Mar 7, 2024 • 21
MoAI: Mixture of All Intelligence for Large Language and Vision Models

Paper • 2403.07508 • Published Mar 12, 2024 • 77
DragAnything: Motion Control for Anything using Entity Representation

Paper • 2403.07420 • Published Mar 12, 2024 • 14
Learning and Leveraging World Models in Visual Representation Learning

Paper • 2403.00504 • Published Mar 1, 2024 • 33
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework

Paper • 2403.13248 • Published Mar 20, 2024 • 78
Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

Paper • 2403.13044 • Published Mar 19, 2024 • 15
Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Paper • 2403.12943 • Published Mar 19, 2024 • 15
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

Paper • 2403.11703 • Published Mar 18, 2024 • 17
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Paper • 2403.11481 • Published Mar 18, 2024 • 13
Uni-SMART: Universal Science Multimodal Analysis and Research Transformer

Paper • 2403.10301 • Published Mar 15, 2024 • 54
RAFT: Adapting Language Model to Domain Specific RAG

Paper • 2403.10131 • Published Mar 15, 2024 • 72
VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Paper • 2403.10517 • Published Mar 15, 2024 • 37
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Paper • 2403.18814 • Published Mar 27, 2024 • 48
Improving Text-to-Image Consistency via Automatic Prompt Optimization

Paper • 2403.17804 • Published Mar 26, 2024 • 20
Can large language models explore in-context?

Paper • 2403.15371 • Published Mar 22, 2024 • 33
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Paper • 2403.15042 • Published Mar 22, 2024 • 27
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

Paper • 2403.15377 • Published Mar 22, 2024 • 29
DragAPart: Learning a Part-Level Motion Prior for Articulated Objects

Paper • 2403.15382 • Published Mar 22, 2024 • 11
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Paper • 2404.07972 • Published Apr 11, 2024 • 51
Rho-1: Not All Tokens Are What You Need

Paper • 2404.07965 • Published Apr 11, 2024 • 94
WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents

Paper • 2404.05902 • Published Apr 8, 2024 • 22
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Paper • 2404.07973 • Published Apr 11, 2024 • 32
Best Practices and Lessons Learned on Synthetic Data for Language Models

Paper • 2404.07503 • Published Apr 11, 2024 • 31
OmniFusion Technical Report

Paper • 2404.06212 • Published Apr 9, 2024 • 77
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Paper • 2404.05719 • Published Apr 8, 2024 • 83
ByteEdit: Boost, Comply and Accelerate Generative Image Editing

Paper • 2404.04860 • Published Apr 7, 2024 • 25
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent

Paper • 2404.03648 • Published Apr 4, 2024 • 29
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Paper • 2404.03413 • Published Apr 4, 2024 • 27
Scaling Instructable Agents Across Many Simulated Worlds

Paper • 2404.10179 • Published Mar 13, 2024 • 28
What matters when building vision-language models?

Paper • 2405.02246 • Published May 3, 2024 • 103
iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24, 2024 • 17
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

Paper • 2406.14515 • Published Jun 20, 2024 • 33
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

Paper • 2406.11896 • Published Jun 14, 2024 • 20
VideoLLM-online: Online Video Large Language Model for Streaming Video

Paper • 2406.11816 • Published Jun 17, 2024 • 26
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

Paper • 2407.01494 • Published Jul 1, 2024 • 15