F1-VLA / README.md

Zeng-Jia

update paper link

7ff9ff7 7 months ago

4.83 kB

	---
	pipeline_tag: robotics
	library_name: transformers
	license: cc-by-nc-sa-4.0
	tags:
	- vision-language-model
	- manipulation
	- robotics
	---

	<div align="center">
	<video src="https://huggingface.co/proxy/cdn-uploads.huggingface.co/production/uploads/678123194248fde89e4fc9bf/_cbIWKHPzffRxIpfmqdFG.mp4"
	controls autoplay muted playsinline loop width="720"></video>

	<p><em>🏁 Best viewed with sound on</em></p>
	</div>


	# F1: A Vision Language Action Model Bridging<br>Understanding and Generation to Actions
	[![Paper](https://img.shields.io/badge/Paper-arXiv-red.svg)](https://arxiv.org/abs/2509.06951)
	[![Code](https://img.shields.io/badge/GitHub-Code-800820?logo=github)](https://github.com/InternRobotics/F1-VLA)
	[![Website](https://img.shields.io/badge/Website-Pages-blue.svg)](https://aopolin-lv.github.io/F1-VLA)



	## 🚀 Key Innovations

	- 🧠 Predictive Inverse Dynamics: Visual foresight generation for planning-based control
	- 🏗️ Mixture-of-Transformer: Three specialized experts (Understanding, Generation, Action)
	- 📈 Three-Stage Training: Progressive alignment, pretraining, and adaptation

	## 🤖 Real-World Robot Experiments

	<!-- <div align="center">
	<video src="https://huggingface.co/proxy/cdn-uploads.huggingface.co/production/uploads/678123194248fde89e4fc9bf/FPZ45NJd9_B_T1gOP8QVf.qt"
	controls autoplay muted playsinline loop width="720"></video>
	<p><em>9 diverse manipulation tasks including pick-and-place, handover, and complex object manipulation</em></p>
	</div> -->

	<div style="display: flex; flex-direction: column; align-items: center; gap: 10px;">
	<!-- First Row -->
	<div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
	<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	<source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/arx_v2_long.mp4" type="video/mp4">
	</video>
	<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	<source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/arx_v1_dyna.mp4" type="video/mp4">
	</video>
	<video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	<source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/franka_v1_sweep.mp4" type="video/mp4">
	</video>
	</div>
	<!-- Second Row -->
	<div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
	<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	<source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/genie_v2_handover.mp4" type="video/mp4">
	</video>
	<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	<source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/genie_v3_tea.mp4" type="video/mp4">
	</video>
	<video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	<source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/genie_v1_flower.mp4" type="video/mp4">
	</video>
	</div>
	<p><em>Diverse manipulation tasks across multiple robot platforms.</em></p>
	</div>


	## 📊 Performance Summary

	\| Task \| Platform \| F1 \| π0 \| Improvement \|
	\|:--------:\|:------------:\|:------------------:\|:------------:\|:---------------:\|
	\| Multi-task \| Genie-1 \| 82.2% \| 65.2% \| +17.0% \|
	\| Adaptation \| Franka \| 66.7% \| 53.3% \| +13.4% \|
	\| Long-horizon \| ARX LIFT II \| 40.0% \| 0.0% \| +40.0% \|
	\| Dynamic Env \| ARX LIFT II \| 66.7% \| 33.3% \| +33.4% \|

	## Usage
	Please refer to our official repo [F1-VLA](https://github.com/InternRobotics/F1-VLA).

	## 📚 Citation

	If you find our work helpful, please cite:

	```bibtex
	@article{f1_vla_2025,
	title={F1: A Vision Language Action Model Bridging Understanding and Generation to Actions},
	author={Qi Lv and Weijie Kong and Hao Li and Jia Zeng and Zherui Qiu and Delin Qu and Haoming Song and Qizhi Chen and Xiang Deng and Jiangmiao Pang},
	journal={Conference/Journal Name},
	year={2025},
	url={https://arxiv.org/abs/2509.06951}
	}
	```

	## License
	This work is under the [cc-by-nc-sa-4.0](LICENSE).

	## Acknowledgements
	This repository is based on [Lerobot](https://github.com/huggingface/lerobot), [Any4lerobot](https://github.com/Tavish9/any4lerobot/), and [VAR](https://github.com/FoundationVision/VAR).