| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - computer-use |
| - gui-agent |
| - vision-language-model |
| - screen-understanding |
| - vla |
| datasets: |
| - TESS-Computer/tess-agentnet |
| base_model: HuggingFaceTB/SmolVLM2-500M-Instruct |
| pipeline_tag: image-text-to-text |
| --- |
| |
| # TESS-500M |
|
|
| **TESS** is a Vision-Language-Action (VLA) model for computer use, inspired by robotic VLAs. Given a screenshot and natural language instruction, it predicts either a mouse action (click coordinates) or keyboard action (typing/shortcuts). |
|
|
| ## Model Description |
|
|
| - **Base Model**: SmolVLM2-500M-Instruct |
| - **Architecture**: SmolVLM + Router + Mouse/Keyboard heads |
| - **Parameters**: 508M total, 48M trainable |
| - **Training Data**: [tess-agentnet](https://huggingface.co/datasets/TESS-Computer/tess-agentnet) (~312K samples) |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| from PIL import Image |
| |
| # Clone the TESS repo |
| # git clone https://github.com/husseinlezzaik/TESS.git |
| # cd TESS/model |
| |
| from test_checkpoint import load_model, predict |
| |
| # Load model |
| model, processor = load_model("path/to/checkpoint.pt", device="cuda") |
| |
| # Run inference |
| image = Image.open("screenshot.png") |
| result = predict(model, processor, image, "Click the search button") |
| |
| print(result) |
| # Mouse action: {'action_type': 'mouse', 'xy': array([0.45, 0.32]), 'click_type': 'LEFT_CLICK'} |
| # Keyboard action: {'action_type': 'keyboard', 'action': 'type', 'value': 'hello world'} |
| ``` |
|
|
| ## Output Format |
|
|
| **Mouse actions:** |
| ```python |
| { |
| 'action_type': 'mouse', |
| 'xy': [x, y], # Normalized coordinates (0-1) |
| 'click_type': 'LEFT_CLICK' | 'RIGHT_CLICK' | 'DOUBLE_CLICK' | ... |
| } |
| ``` |
|
|
| **Keyboard actions:** |
| ```python |
| { |
| 'action_type': 'keyboard', |
| 'action': 'type' | 'press' | 'hotkey', |
| 'value': 'text to type' | '<ENTER>' | '<SUPER+C>' |
| } |
| ``` |
|
|
| ## Architecture |
|
|
| ``` |
| Screenshot + Instruction β SmolVLM2 β Shared MLP β Router |
| β |
| βββββββββββββββββ΄ββββββββββββββββ |
| β β |
| Mouse Branch Keyboard Branch |
| (XY + Click heads) (VLM text generation) |
| ``` |
|
|
| ## Training |
|
|
| - **Epochs**: 3 |
| - **Batch Size**: 48 |
| - **Optimizer**: AdamW (LR 2e-4 heads, 5e-4 embeddings) |
| - **Hardware**: NVIDIA H100 80GB |
| - **Training Time**: ~8 hours |
|
|
| ## Limitations |
|
|
| - Trained primarily on desktop/web screenshots |
| - English instructions only |
| - May struggle with unusual UI layouts not seen in training |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{tess2025, |
| title={TESS: A Vision-Language-Action Model for Computer Use}, |
| author={Hussein Lezzaik}, |
| year={2025}, |
| url={https://github.com/husseinlezzaik/TESS} |
| } |
| ``` |
|
|