Instructions to use AXCXEPT/Qwen3-EZO-8B-beta with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AXCXEPT/Qwen3-EZO-8B-beta with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="AXCXEPT/Qwen3-EZO-8B-beta")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("AXCXEPT/Qwen3-EZO-8B-beta")
model = AutoModelForCausalLM.from_pretrained("AXCXEPT/Qwen3-EZO-8B-beta", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use AXCXEPT/Qwen3-EZO-8B-beta with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AXCXEPT/Qwen3-EZO-8B-beta"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AXCXEPT/Qwen3-EZO-8B-beta",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/AXCXEPT/Qwen3-EZO-8B-beta

SGLang

How to use AXCXEPT/Qwen3-EZO-8B-beta with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AXCXEPT/Qwen3-EZO-8B-beta" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AXCXEPT/Qwen3-EZO-8B-beta",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AXCXEPT/Qwen3-EZO-8B-beta" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AXCXEPT/Qwen3-EZO-8B-beta",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use AXCXEPT/Qwen3-EZO-8B-beta with Docker Model Runner:
```
docker model run hf.co/AXCXEPT/Qwen3-EZO-8B-beta
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Model Card for Model ID

We are releasing Qwen3-EZO-8b-beta, an 8B-parameter LLM based on Qwen3-8B.

While the model size corresponds to an SLM (Small Language Model), it achieves performance on multi-turn tasks comparable to Gemini 2.5 Flash and GPT-4o. It significantly improves upon the original Qwen3-8B, recording MT-Bench 9.08 and JMT-Bench 8.87 scores.

It supports parallel processing of deep-thinking prompts using our Deep-Think technique and is compatible with the OpenAI API via vLLM deployment.

Although it was initially planned as a closed model for API-based access, we have decided to release it as an open model in light of our new policy to monetize only after further accuracy improvements.

BenchMark

Based on repeated evaluations of frequent outputs at temperatures 0.2 and 0.6, conducted on May 13, 2025, using GPT-4o and Gemini 2.5 Flash as judges. All tests were performed internally on a single A40 GPU. Results may vary under external or official benchmark conditions.

How to use:

Runs on a single A40 GPU.

vllm serve AXCXEPT/Qwen3-EZO-8b-beta --enable-reasoning --reasoning-parser deepseek_r1

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

prompt = """Every morning Aya goes for a $9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the walk takes her 4 hours, including $t$ minutes spent in the coffee shop. When she walks $s+2$ kilometers per hour, the walk takes her 2 hours and 24 minutes, including $t$ minutes spent in the coffee shop. Suppose Aya walks at $s+rac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop."""
completion = client.chat.completions.create(
  model="AXCXEPT/Qwen3-EZO-8b-beta",
  messages=[
    {"role": "user", "content": prompt}
  ]
)

print(completion.choices[0].message)

Special Thanks

本モデルのベースモデルの開発を行った、Alibaba Cloud社ならびにQwen開発チームに、尊敬と敬意の念をここに表します。 We would like to express our sincere respect and appreciation to Alibaba Cloud and the Qwen development team for their work in creating the base model for this project.

Downloads last month: 30

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for AXCXEPT/Qwen3-EZO-8B-beta

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Finetuned

(1971)

this model

Merges

14 models

Quantizations

5 models