Instructions to use teamblobfish/DeepSeek-V4-Flash-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use teamblobfish/DeepSeek-V4-Flash-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="teamblobfish/DeepSeek-V4-Flash-GGUF", filename="IQ1_M-XL/DeepSeek-V4-Flash-IQ1_M-XL-00001-of-00002.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use teamblobfish/DeepSeek-V4-Flash-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M
Use Docker
docker model run hf.co/teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use teamblobfish/DeepSeek-V4-Flash-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "teamblobfish/DeepSeek-V4-Flash-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "teamblobfish/DeepSeek-V4-Flash-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M
- Ollama
How to use teamblobfish/DeepSeek-V4-Flash-GGUF with Ollama:
ollama run hf.co/teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M
- Unsloth Studio new
How to use teamblobfish/DeepSeek-V4-Flash-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for teamblobfish/DeepSeek-V4-Flash-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for teamblobfish/DeepSeek-V4-Flash-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for teamblobfish/DeepSeek-V4-Flash-GGUF to start chatting
- Pi new
How to use teamblobfish/DeepSeek-V4-Flash-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use teamblobfish/DeepSeek-V4-Flash-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use teamblobfish/DeepSeek-V4-Flash-GGUF with Docker Model Runner:
docker model run hf.co/teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M
- Lemonade
How to use teamblobfish/DeepSeek-V4-Flash-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.DeepSeek-V4-Flash-GGUF-Q4_K_M
List all available models
lemonade list
THIS IS A WIP (Work In Progress)
DeepSeek V4 Flash · GGUF
GGUF quantizations of deepseek-ai/DeepSeek-V4-Flash for use with the V4-aware llama.cpp fork at cchuter/llama.cpp @ feat/v4-port-cuda.
📦 Required: V4-aware llama.cpp fork. These quants don't load on upstream
ggml-org/llama.cpp— V4 architecture support (compressor decode, hyperconnection, lightning indexer, FP8 KV simulation, NextN heads) lives only in the fork:git clone -b feat/v4-port-cuda https://github.com/cchuter/llama.cppFull build + run instructions in Loading below.
🖥️ Supported backends: Apple Silicon (Metal), NVIDIA CUDA (Ada/Blackwell), and CPU. All 5 V4 custom ops (
ggml_dsv4_rope_tail,ggml_dsv4_hc_split_sinkhorn,ggml_dsv4_hc_weighted_sum,ggml_dsv4_hc_expand,ggml_dsv4_fp8_kv_quantize) have Metal kernels AND CUDA kernels in this fork (validated 19/19 on RTX 5090, CUDA 12.8, SM_120 native). The CUDA FP8 path is gated behind__CUDA_ARCH__ >= 890; older NVIDIA hardware (Volta/Turing/Ampere) uses a software-emulated FP8 path that builds cleanly under-DCMAKE_CUDA_ARCHITECTURES=70but hasn't been runtime-validated yet. CUDA testers wanted — file issues at the fork if you hit problems. ROCm / Vulkan / Metal-on-AMD have no V4 kernels and will fail at the first dsv4 op.
Available quants
| Quant | Size | BPW | Decode (M3 Ultra) | gate-tools | Notes |
|---|---|---|---|---|---|
| Q8_0 | ~282 GiB (7 shards) | 8.50 | 21.69 t/s | ✓ pass | Reference. Full-fidelity baseline. |
| Q4_K_M-XL | ~163 GiB (4 shards) | 4.92 | 22.85 t/s | ✓ pass | Recommended. K-quant body, non-expert tensors and embedding/output pinned at Q8_0. Matches Q8 on tool calling at half the size. |
| Q2_K-XL | ~100 GiB (3 shards) | 3.01 | 23.38 t/s | ✓ pass | Smaller-footprint K-quant alternative to Q4_K_M-XL with the same XL pin recipe. |
| IQ2_XS-XL | ~81 GiB (2 shards) | 2.45 | 23.73 t/s | ✓ pass † | IQ2 body with XL pins. |
| IQ2_XXS-XL | ~73 GiB (2 shards) | 2.21 | 23.75 t/s | ✓ pass † | IQ2 body with XL pins. |
| IQ1_M-XL | ~63 GiB (2 shards) | 1.91 | 23.29 t/s | ✓ pass † | IQ1_M body with XL pins. |
| IQ1_M | ~60 GiB (2 shards) | 1.81 | 15.15 t/s | ✓ pass † | IQ1_M without XL pins. Below the 16 t/s decode floor on M3 Ultra; use the -XL variant unless disk is tight. |
| IQ1_S-XL | ~57 GiB (2 shards) | 1.73 | 23.28 t/s | ✓ pass † | IQ1_S body with XL pins. Smallest variant clearing the decode floor. |
imatrix/imatrix-v4-flash.dat |
~449 MiB | — | — | — | wikitext-103 1000-chunk imatrix calibration produced by v4-port-I-imatrix. Reproducibility seed for downstream IQ-class builds. |
imatrix/dsml.jinja |
~5 KiB | — | — | — | DSML chat template, also baked into every GGUF in this repo. Published here for reference and downstream tooling. |
† All quants in this repo ship with the DSML chat template baked into the GGUF metadata, so llama-server --jinja does the right thing without any extra flags. The imatrix/dsml.jinja file is also published in this repo for reference and downstream tooling.
-XL suffix means non-expert tensors (output_tensor, token_embd, attention projections, attention compressors, hyper-connection mixers, lightning indexer, NextN heads) are pinned at Q8_0; only the routed and shared experts use the named quant body. Without that pinning, IQ-class quants fall below the 16 t/s decode floor on M3 Ultra.
Recommended use by quant
| Use case | Recommended | Notes |
|---|---|---|
| General agent / Claude Code workloads | Q4_K_M-XL | Top decode speed at 4-bit body, full tool-calling support, half the disk of Q8 |
| Reference / "is this a quant artifact?" debugging | Q8_0 | Full-fidelity baseline |
| Smaller VRAM / disk budget | Q2_K-XL | Same XL recipe at lower BPW |
| Maximum throughput, tighter VRAM | IQ2_XS-XL | Fastest IQ-class quant |
All quants in this repo ship with V4's DSML chat template baked in, so llama-server --jinja does the right thing without any extra flags — no --chat-template-file needed. Tool calls return as proper tool_calls JSON in the response object.
Loading
# Clone the V4-aware fork
git clone -b feat/v4-port-cuda https://github.com/cchuter/llama.cpp
cd llama.cpp
# Build for Apple Silicon (Metal)
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON && cmake --build build -j
# OR build for NVIDIA CUDA. Pick your GPU's compute capability:
# sm_70 V100 | sm_75 T4 | sm_80 A100 | sm_86 RTX 3090/3080
# sm_89 RTX 4090/6000 Ada/L40 | sm_90 H100/H200 | sm_120 RTX 5090/5080
# (List multiple if you ship to mixed hardware, e.g. "86;89".)
# FP8 native path needs SM_89+ (Ada/Hopper/Blackwell) AND CUDA toolkit >= 11.8;
# older arches use the software-emulated FP8 path automatically. SM_120 native
# additionally needs toolkit >= 12.8 (older toolkits fall back to PTX JIT).
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES="<your-sm>" && cmake --build build -j
# Multi-GPU CUDA (2+ devices): pass the SCHED flag to BOTH compiler groups so
# the macro propagates to .cu translation units. CXX-only is silently no-op
# on the CUDA side. V4's dense per-layer inputs (hyperconnection + indexer +
# multiple KV caches) exceed the upstream scheduler default of 30 at
# multi-device split boundaries. Cost: ~200 MB extra scheduler memory; only
# needed on multi-GPU. Single-GPU runs do not need this flag.
# cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release \
# -DCMAKE_CUDA_ARCHITECTURES="<your-sm>" \
# -DCMAKE_CXX_FLAGS=-DGGML_SCHED_MAX_SPLIT_INPUTS=128 \
# -DCMAKE_CUDA_FLAGS=-DGGML_SCHED_MAX_SPLIT_INPUTS=128
# Download the recommended Q4_K_M-XL shards
hf download teamblobfish/DeepSeek-V4-Flash-GGUF \
--include "Q4_K_M-XL/*" \
--local-dir ~/models/DeepSeek-V4-Flash-GGUF
# Run the server (point at the first shard; llama.cpp auto-loads the rest)
./build/bin/llama-server \
--model ~/models/DeepSeek-V4-Flash-GGUF/Q4_K_M-XL/DeepSeek-V4-Flash-Q4_K_M-XL-00001-of-00004.gguf \
--jinja \
--reasoning off \
--ctx-size 393216 \
--n-gpu-layers 999 \
--flash-attn on \
--no-repack \
--temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0
Sampling values match the model card recommendation (temperature=1.0, top_p=1.0); --reasoning off is the cleanest baseline for agent workloads.
Multi-GPU CUDA (work in progress)
⚠️ Status: WIP. Multi-GPU CUDA via
--split-mode layer(the default) is working end-to-end and validated on 2× RTX 6000 Ada (sm_89, 96 GB total) at the speeds in the table below, with an external tester also reporting it working on 8× A100. Tensor-parallel (--split-mode row) is implemented but currently slower than layer split for V4 decode and not recommended yet. Expect quirks; please file issues at the fork.
Recommended config for fastest t/s on multi-GPU:
./build/bin/llama-server \
--model ~/models/DeepSeek-V4-Flash-GGUF/<quant>/<first-shard>.gguf \
--jinja --reasoning off \
--ctx-size 8192 \
--n-gpu-layers 999 \
--split-mode layer \
--flash-attn on \
--no-repack \
--temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0
Pick a quant that fits your combined VRAM (e.g. IQ2_XS-XL at 81 GiB fully fits 96 GiB across 2× 48 GB). If the quant doesn't fully fit, add -cmoe -ub 128 to offload routed experts to CPU — fits much larger quants at a generation-speed cost.
Validated speeds (IQ2_XS-XL, 2× RTX 6000 Ada):
| Config | Prompt eval | Generation |
|---|---|---|
-ngl 999 --flash-attn on (full VRAM, layer split) |
35.9 t/s | 19.4 t/s |
-ngl 999 -cmoe -ub 128 --flash-attn on (single GPU, experts on CPU) |
18.3 t/s | 11.8 t/s |
-ngl 999 --flash-attn on --split-mode row (tensor parallel, WIP) |
— | ≤9.7 t/s |
Why the -XL recipe (and why no vanilla Q4_K_M)
V4 decode is compute-bound on the indexer / sinkhorn / expert-routing kernels — not on memory bandwidth. That makes the choice of dequant codepath matter as much as the bit-count: Q8_0's int8 × per-block-scale unpack is dramatically simpler than Q4_K_M's super-block path, so on this hardware Q8_0 actually decoded faster than vanilla Q4_K_M in our earlier benchmarks (write-up).
The -XL recipe published here threads that needle: leave the discrimination-critical non-expert tensors at Q8_0 (so attention, embedding, output, etc. all use the fast dequant path) and only compress the routed and shared experts. The result is the best of both — Q4_K_M-XL is half the disk of Q8_0 with essentially identical decode speed (22.85 vs 21.69 t/s) because the experts barely touch the hot decode path while the bandwidth-heavy non-expert tensors stay on the fast codepath. Same trick applies to all the IQ-class -XL variants below.
We don't publish vanilla Q4_K_M (no XL pins) — it would be both larger and slower than Q4_K_M-XL on this hardware.
Quirks worth knowing
--cache-type-k|v q8_0is silently overridden to f16 on V4. V4's K is already FP8-quantized at write time, so q8_0's per-block stationarity assumption breaks. The fork emits aLLAMA_LOG_WARNon first override.llama-imatrixoriginally segfaulted on V4 during activation collection. Fixed inv4-port-I-imatrix; the calibration data published alongside these quants (imatrix/imatrix-v4-flash.dat) was produced by the patched binary.--no-repackis required for V4 quants in CPU mode on hosts smaller than ~600 GiB RAM. The repack codepath inggml/src/ggml-cpu/repack.cppdoesn't release the source mmap, so V4's 282-GiB Q8 source needs ~575 GiB peak RAM at load without the flag. The fork's gates pass--no-repackby default.- Validation gates:
tests/v4-port/run-all-gates.shin the fork. Each row in the table above documents the result of that gate suite at the listed BPW.
Provenance
- Source:
deepseek-ai/DeepSeek-V4-FlashHF safetensors (FP8 e4m3 weights, FP4 routed experts). - Q8_0: built via
convert_hf_to_gguf.py --outtype q8_0 --deepseek4-expert-outtypes q8_0(M3 Ultra, ~30–60 min wall), split into 50 GiB shards withllama-gguf-split. - bf16-experts-Q8 staging GGUF (not published): built via
convert_hf_to_gguf.py --outtype bf16 --deepseek4-expert-outtypes q8_0. Used as the source for IQ1/IQ2/Q2_K-XL/Q4_K_M-XL builds below to preserveembed.weightandoutput.weightBF16 source precision (other discrimination-critical tensors are FP8-native in the source so Q8 staging is essentially lossless for them). - IQ1/IQ2/Q2_K-XL/Q4_K_M-XL builds: produced via
llama-quantize --imatrix imatrix-v4-flash.datwith thev4-portfork's V4-tensor pin recipe (output_hc,attn_compressor,attn_q_a/b,attn_kv,attn_output_a/b,hc_attn,hc_ffn,indexer,nextnall at Q8_0 in-XLvariants). - imatrix: wikitext-103 test split, 1000 chunks, ~1M tokens. Per-class layer coverage verified by
tests/v4-port/gate-imatrix.sh.
License
MIT, matching the upstream DeepSeek V4 Flash license.
- Downloads last month
- 48,370
1-bit
2-bit
4-bit
8-bit
Model tree for teamblobfish/DeepSeek-V4-Flash-GGUF
Base model
deepseek-ai/DeepSeek-V4-Flash