Sapiens2-5B

Sapiens2 is a family of high-resolution vision transformers pretrained on 1 billion human images β€” designed for human-centric tasks such as pose estimation, body-part segmentation, surface normals, and pointmaps.

This repository contains the 5B parameter pretrained backbone. It produces dense per-patch features suitable for fine-tuning downstream task heads.

Model Details

  • Developed by: Meta
  • Model type: Vision Transformer
  • License: Sapiens2 License
  • Task: pretrain
  • Format: safetensors
  • File: sapiens2_5b_pretrain.safetensors

Quick Start

Install the Sapiens2 repo (pip install -e .).

import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from sapiens.backbones.standalone.sapiens2 import Sapiens2

# Build the model and load the pretrained checkpoint
model = Sapiens2(arch="sapiens2_5b", img_size=(1024, 768), patch_size=16).eval().cuda()  # img_size is (H, W)
ckpt_path = hf_hub_download(repo_id="facebook/sapiens2-pretrain-5b", filename="sapiens2_5b_pretrain.safetensors")
model.load_state_dict(load_file(ckpt_path))

# Forward pass on a single image (RGB; ImageNet normalization recommended)
x = torch.randn(1, 3, 1024, 768).cuda()
with torch.no_grad():
    features = model(x)[0]  # dense backbone features: (B, num_tokens, embed_dim)

Model Card

Field Value
Architecture Sapiens2 ViT (RoPE, GQA, SwiGLU, RMSNorm, QK-norm)
Parameters 5.071 B
FLOPs 15.722 T
Embedding dim 2432
Layers 56
Attention heads 32
Pretraining resolution 1024 Γ— 768 (H Γ— W)
Patch size 16
Pretraining data 1B human images

Sapiens2 Family

Model Params FLOPs Embed dim Layers Heads
Sapiens2-0.1B 0.114 B 0.342 T 768 12 12
Sapiens2-0.4B 0.398 B 1.260 T 1024 24 16
Sapiens2-0.8B 0.818 B 2.592 T 1280 32 16
Sapiens2-1B 1.462 B 4.715 T 1536 40 24
Sapiens2-1B-4K 1.607 B β€” 1536 40 24
Sapiens2-5B (this) 5.071 B 15.722 T 2432 56 32

See the Sapiens2 Collection for all variants and downstream task checkpoints (pose, segmentation, normals, pointmaps).

Intended Use

  • Feature extraction for human-centric downstream tasks
  • Initialization for fine-tuning task heads (pose, segmentation, normals, pointmap)
  • Research on human-centric vision

License

Released under the Sapiens2 License.

Citation

@article{khirodkarsapiens2,
  title={Sapiens2},
  author={Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan and Su, Zhaoen and Saito, Shunsuke},
  journal={arXiv preprint arXiv:2604.21681},
  year={2026}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for facebook/sapiens2-pretrain-5b

Finetunes
4 models

Collection including facebook/sapiens2-pretrain-5b

Paper for facebook/sapiens2-pretrain-5b