Qwen3.6-27B for hipfire

Pre-quantized Qwen3.6-27B (DeltaNet hybrid) for hipfire, a Rust-native LLM inference engine for AMD RDNA GPUs.

Refresh of Qwen3.5-27B with newer training. Same architecture (DeltaNet + FullAttention hybrid, arch_id=5, 32 layers, 16 attention heads, 4 KV heads, head_dim=256), same kernel paths — no engine changes needed.

Files

File	Role	Size	Min VRAM	RX 7900 XTX (gfx1100)
`qwen3.6-27b.mq4`	target	14.0 GB	16 GB	44 tok/s AR / 185 tok/s w/ draft on code
`qwen36-27b-dflash-mq4.hfq`	DFlash draft	0.92 GB	(paired with target)	—

Decode tok/s is steady-state greedy decode on a 7900 XTX with asym3 KV.

Usage

# Install hipfire (Linux + ROCm 6+)
curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash

# Pull target + paired draft (DFlash speculative decode on by default)
hipfire pull qwen3.6:27b
hipfire pull qwen3.6:27b-draft

# Run
hipfire run qwen3.6:27b "Write a one-line Python function named square."

The engine auto-discovers the draft when both files are in ~/.hipfire/models/. Filename matters — do not rename qwen36-27b-dflash-mq4.hfq.

DFlash draft

DFlash is hipfire's speculative-decode path: a small auxiliary draft network proposes blocks of B candidate tokens that the target model verifies in a single batched forward pass. Acceptance ratio τ (committed tokens per cycle) is what determines wall-clock speedup; typical 27B τ on code prompts is 4-5.

The draft is converted from z-lab/Qwen3.6-27B-DFlash via hipfire's dflash_convert --mq4. The draft is a 1.73B-param hybrid (sliding_attention + full_attention) with block_size=16, target hidden extraction at layers [1, 16, 31, 46, 61], and mask_token_id=248070.

2026-04-27 refresh

Re-quantized from the latest z-lab safetensors revision (sha 0919688658996800f86b895034249700e9481106, upstream mtime 2026-04-27 04:19 UTC), replacing the prior Apr 24 conversion (sha 1dbb59a5...). Bench delta on 7900 XTX vs prior draft on qwen3.6-27b.mq4 target, single-run per genre:

genre	tok/s (old → new)	τ (old → new)
code (fibonacci)	101.13 → 105.32	4.40 → 4.64
prose (Roman empire)	41.82 → 46.45	1.06 → 1.30
instruct (sky blue)	85.13 ↔ 84.06	3.50 ↔ 3.44

Prose shows the largest gain (+11% tok/s, +23% τ). Code +4% tok/s. Instruct is tied within run-to-run noise.

Pairing rule

The draft only accelerates its target. Don't pair the 3.6 draft with the 3.5-27B target or vice versa — vocabulary and hidden-state projections differ across the refresh. Pair behavior:

hipfire pull qwen3.6:27b (alone) → AR decode, ~44 tok/s.
hipfire pull qwen3.6:27b-draft after target → DFlash on by default, ~185 tok/s on code prompts when draft+target alignment is good.

About hipfire

Rust + HIP inference engine for AMD consumer GPUs (RDNA1–RDNA4). No Python in the hot path. Single binary install. Source: Kaden-Schutt/hipfire.

License

MIT for the hipfire packaging. Underlying weights inherit the upstream Qwen / z-lab licenses — see those repos for terms.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for schuttdev/hipfire-qwen3.6-27b

Base model

Qwen/Qwen3.5-27B

Finetuned

(276)

this model