Instructions to use Nanbeige/Nanbeige4.1-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Nanbeige/Nanbeige4.1-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Nanbeige/Nanbeige4.1-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Nanbeige/Nanbeige4.1-3B")
model = AutoModelForCausalLM.from_pretrained("Nanbeige/Nanbeige4.1-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Nanbeige/Nanbeige4.1-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Nanbeige/Nanbeige4.1-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Nanbeige/Nanbeige4.1-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Nanbeige/Nanbeige4.1-3B

SGLang

How to use Nanbeige/Nanbeige4.1-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Nanbeige/Nanbeige4.1-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Nanbeige/Nanbeige4.1-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Nanbeige/Nanbeige4.1-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Nanbeige/Nanbeige4.1-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Nanbeige/Nanbeige4.1-3B with Docker Model Runner:
```
docker model run hf.co/Nanbeige/Nanbeige4.1-3B
```

Any Plans for an Instruct Model?

#15

by Ashacorporation - opened Feb 15

Discussion

Ashacorporation

Feb 15

This is a very capable reasoning model. I only did some light fine-tuning, but the performance improvement for my use case has been significant. In my experience, it feels comparable to GPT Oss 20B (medium setting). Really impressive work.

By the way, is there an instruct version available? If it performs this “magically” ini in instruct mode, it could potentially become a strong alternative to GPT- 4.1 nano or Gemini 2.5 Flash-Lite.

phi0112358

Feb 16

I wouldn't expect an Instruct version to perform so well, as the extended test computation time (spending a lot of thinking tokens) is very likely the secret sauce behind the model's performance

leran1995

Nanbeige LLM Lab org Feb 16

Thanks for the feedback — really glad it’s working well for you!

Yes, Nanbeige4.2 will include an instruct version. We’re also working on making the model smart without excessive thinking tokens.

Why-T

Feb 16

you got us hooked now , cant wait for the release of the 4.2 version . could you please provide any ETA or approximations about when it MIGHT release ?

cob05

Feb 17

Thanks for the feedback — really glad it’s working well for you!

Yes, Nanbeige4.2 will include an instruct version. We’re also working on making the model smart without excessive thinking tokens.

It is a really exciting model! I am honestly surprised by how capable it is for such a small model. It really is comparable to MUCH larger models and if you get the insane amount of overthinking under control, it could be a true challenger for edge applications. Really great job! Congrats to you and your team.

akumaburn

Feb 20

Thanks for the feedback — really glad it’s working well for you!

Yes, Nanbeige4.2 will include an instruct version. We’re also working on making the model smart without excessive thinking tokens.

Have you tried Chain of Draft? https://arxiv.org/abs/2502.18600

This came out a while ago; apparently these models don't actually need full reasoning chains to improve their performance, but for some reason it fell out of favor.

Narutoouz

Mar 6

•

edited Mar 6

Thanks for the feedback — really glad it’s working well for you!

Yes, Nanbeige4.2 will include an instruct version. We’re also working on making the model smart without excessive thinking tokens.

Have you tried Chain of Draft? https://arxiv.org/abs/2502.18600

This came out a while ago; apparently these models don't actually need full reasoning chains to improve their performance, but for some reason it fell out of favor.

Wow. this research paper is superp, they solved efficiency without compromising accuracy, rather slightly increasing it. Why are none of the big labs using this. Nanbeige 4.2 or qwen 4b or olmo or falcon or Jan code or whoever does this will definetly get lots of attention and downloads.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment