Qwen3.6-27B for hipfire
Pre-quantized Qwen3.6-27B (DeltaNet hybrid) for hipfire, a Rust-native LLM inference engine for AMD RDNA GPUs.
Refresh of Qwen3.5-27B with newer training. Same architecture
(DeltaNet + FullAttention hybrid, arch_id=5, 32 layers, 16 attention
heads, 4 KV heads, head_dim=256), same kernel paths β no engine changes
needed.
Files
| File | Role | Size | Min VRAM | RX 7900 XTX (gfx1100) |
|---|---|---|---|---|
qwen3.6-27b.mq4 |
target | 14.0 GB | 16 GB | 44 tok/s AR / 185 tok/s w/ draft on code |
qwen36-27b-dflash-mq4.hfq |
DFlash draft | 0.92 GB | (paired with target) | β |
Decode tok/s is steady-state greedy decode on a 7900 XTX with asym3 KV.
Usage
# Install hipfire (Linux + ROCm 6+)
curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash
# Pull target + paired draft (DFlash speculative decode on by default)
hipfire pull qwen3.6:27b
hipfire pull qwen3.6:27b-draft
# Run
hipfire run qwen3.6:27b "Write a one-line Python function named square."
The engine auto-discovers the draft when both files are in ~/.hipfire/models/.
Filename matters β do not rename qwen36-27b-dflash-mq4.hfq.
DFlash draft
DFlash is hipfire's speculative-decode path: a small auxiliary draft network proposes blocks of B candidate tokens that the target model verifies in a single batched forward pass. Acceptance ratio Ο (committed tokens per cycle) is what determines wall-clock speedup; typical 27B Ο on code prompts is 4-5.
The draft is converted from
z-lab/Qwen3.6-27B-DFlash via
hipfire's dflash_convert --mq4. The draft is a 1.73B-param hybrid
(sliding_attention + full_attention) with block_size=16, target hidden
extraction at layers [1, 16, 31, 46, 61], and mask_token_id=248070.
2026-04-27 refresh
Re-quantized from the latest z-lab safetensors revision
(sha 0919688658996800f86b895034249700e9481106, upstream mtime
2026-04-27 04:19 UTC), replacing the prior Apr 24 conversion (sha
1dbb59a5...). Bench delta on 7900 XTX vs prior draft on
qwen3.6-27b.mq4 target, single-run per genre:
| genre | tok/s (old β new) | Ο (old β new) |
|---|---|---|
| code (fibonacci) | 101.13 β 105.32 | 4.40 β 4.64 |
| prose (Roman empire) | 41.82 β 46.45 | 1.06 β 1.30 |
| instruct (sky blue) | 85.13 β 84.06 | 3.50 β 3.44 |
Prose shows the largest gain (+11% tok/s, +23% Ο). Code +4% tok/s. Instruct is tied within run-to-run noise.
Pairing rule
The draft only accelerates its target. Don't pair the 3.6 draft with the 3.5-27B target or vice versa β vocabulary and hidden-state projections differ across the refresh. Pair behavior:
hipfire pull qwen3.6:27b(alone) β AR decode, ~44 tok/s.hipfire pull qwen3.6:27b-draftafter target β DFlash on by default, ~185 tok/s on code prompts when draft+target alignment is good.
About hipfire
Rust + HIP inference engine for AMD consumer GPUs (RDNA1βRDNA4). No Python in the hot path. Single binary install. Source: Kaden-Schutt/hipfire.
License
MIT for the hipfire packaging. Underlying weights inherit the upstream Qwen / z-lab licenses β see those repos for terms.
Model tree for schuttdev/hipfire-qwen3.6-27b
Base model
Qwen/Qwen3.5-27B