Instructions to use moonshotai/Kimi-VL-A3B-Thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use moonshotai/Kimi-VL-A3B-Thinking with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="moonshotai/Kimi-VL-A3B-Thinking", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("moonshotai/Kimi-VL-A3B-Thinking", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use moonshotai/Kimi-VL-A3B-Thinking with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "moonshotai/Kimi-VL-A3B-Thinking"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-VL-A3B-Thinking",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/moonshotai/Kimi-VL-A3B-Thinking

SGLang

How to use moonshotai/Kimi-VL-A3B-Thinking with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "moonshotai/Kimi-VL-A3B-Thinking" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-VL-A3B-Thinking",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "moonshotai/Kimi-VL-A3B-Thinking" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-VL-A3B-Thinking",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use moonshotai/Kimi-VL-A3B-Thinking with Docker Model Runner:
```
docker model run hf.co/moonshotai/Kimi-VL-A3B-Thinking
```

gguf?

by AlgorithmicKing - opened Apr 10, 2025

Discussion

AlgorithmicKing

Apr 10, 2025

please

angsuman

Apr 15, 2025

Unable to run even with 4090 and 3090, running out of memory.

AlgorithmicKing

Apr 16, 2025

Unable to run even with 4090 and 3090, running out of memory.

a 16b model running out of memory on a 24gb 4090? that's interesting, well I run models on the cloud (because I have a 3060 6gb locally) so I can't complain.

angsuman

Apr 16, 2025

It's MoE model. I am unable to even run the sample code provided and both GPU's are close to maxing out in terms of RAM. 4090 has less than 1GB left and it asks for 1 Gb.

CHNtentes

Apr 16, 2025

Unable to run even with 4090 and 3090, running out of memory.

That's expected... the model files in total already > 24GB... you need quantized version

evilperson068

Apr 16, 2025

Unable to run even with 4090 and 3090, running out of memory.

That's expected... the model files in total already > 24GB... you need quantized version

My dual-4090 rig also went poooooff, OOM. Any ideas? xD

Utochi

Apr 20, 2025

Yes, id like GGUF. id like to test out this model

Lynnspyre

Apr 26, 2025

Same if you got it. The safetensors compiling or assembling thing is prone to failure, and I'm not sure if I'm getting it right.

KeilahElla

Jun 21, 2025

where is the GGUF?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment