Qwen3.6-27B for hipfire

Pre-quantized Qwen3.6-27B (DeltaNet hybrid) for hipfire, a Rust-native LLM inference engine for AMD RDNA GPUs.

Refresh of Qwen3.5-27B with newer training. Same architecture (DeltaNet + FullAttention hybrid, arch_id=5, 32 layers, 16 attention heads, 4 KV heads, head_dim=256), same kernel paths β€” no engine changes needed.

Files

File Role Size Min VRAM RX 7900 XTX (gfx1100)
qwen3.6-27b.mq4 target 14.0 GB 16 GB 44 tok/s AR / 185 tok/s w/ draft on code
qwen36-27b-dflash-mq4.hfq DFlash draft 0.92 GB (paired with target) β€”

Decode tok/s is steady-state greedy decode on a 7900 XTX with asym3 KV.

Usage

# Install hipfire (Linux + ROCm 6+)
curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash

# Pull target + paired draft (DFlash speculative decode on by default)
hipfire pull qwen3.6:27b
hipfire pull qwen3.6:27b-draft

# Run
hipfire run qwen3.6:27b "Write a one-line Python function named square."

The engine auto-discovers the draft when both files are in ~/.hipfire/models/. Filename matters β€” do not rename qwen36-27b-dflash-mq4.hfq.

DFlash draft

DFlash is hipfire's speculative-decode path: a small auxiliary draft network proposes blocks of B candidate tokens that the target model verifies in a single batched forward pass. Acceptance ratio Ο„ (committed tokens per cycle) is what determines wall-clock speedup; typical 27B Ο„ on code prompts is 4-5.

The draft is converted from z-lab/Qwen3.6-27B-DFlash via hipfire's dflash_convert --mq4. The draft is a 1.73B-param hybrid (sliding_attention + full_attention) with block_size=16, target hidden extraction at layers [1, 16, 31, 46, 61], and mask_token_id=248070.

2026-04-27 refresh

Re-quantized from the latest z-lab safetensors revision (sha 0919688658996800f86b895034249700e9481106, upstream mtime 2026-04-27 04:19 UTC), replacing the prior Apr 24 conversion (sha 1dbb59a5...). Bench delta on 7900 XTX vs prior draft on qwen3.6-27b.mq4 target, single-run per genre:

genre tok/s (old β†’ new) Ο„ (old β†’ new)
code (fibonacci) 101.13 β†’ 105.32 4.40 β†’ 4.64
prose (Roman empire) 41.82 β†’ 46.45 1.06 β†’ 1.30
instruct (sky blue) 85.13 ↔ 84.06 3.50 ↔ 3.44

Prose shows the largest gain (+11% tok/s, +23% Ο„). Code +4% tok/s. Instruct is tied within run-to-run noise.

Pairing rule

The draft only accelerates its target. Don't pair the 3.6 draft with the 3.5-27B target or vice versa β€” vocabulary and hidden-state projections differ across the refresh. Pair behavior:

  • hipfire pull qwen3.6:27b (alone) β†’ AR decode, ~44 tok/s.
  • hipfire pull qwen3.6:27b-draft after target β†’ DFlash on by default, ~185 tok/s on code prompts when draft+target alignment is good.

About hipfire

Rust + HIP inference engine for AMD consumer GPUs (RDNA1–RDNA4). No Python in the hot path. Single binary install. Source: Kaden-Schutt/hipfire.

License

MIT for the hipfire packaging. Underlying weights inherit the upstream Qwen / z-lab licenses β€” see those repos for terms.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for schuttdev/hipfire-qwen3.6-27b

Base model

Qwen/Qwen3.5-27B
Finetuned
(276)
this model