Qwen3-ForcedAligner-0.6B-4bit (MLX)

4-bit quantized version of Qwen/Qwen3-ForcedAligner-0.6B for Apple Silicon inference via MLX.

Predicts word-level timestamps for audio+text pairs in a single non-autoregressive forward pass.

Model Details

Component Config
Audio encoder 24 layers, d_model=1024, 16 heads, FFN=4096, float16
Text decoder 28 layers, hidden=1024, 16Q/8KV heads, 4-bit quantized (group_size=64)
Classify head Linear(1024, 5000), float16
Timestamp resolution 80ms per class (5000 classes = 400s max)
Total size 979 MB (vs 1.84 GB bf16)

How It Works

Audio + Text โ†’ Audio Encoder โ†’ Text Decoder (single pass) โ†’ Classify Head โ†’ argmax at <timestamp> positions โ†’ word timestamps

Unlike ASR (autoregressive, token-by-token), the forced aligner runs the entire sequence in one forward pass through the decoder. The classify head predicts a timestamp class (0โ€“4999) at each <timestamp> token position, which maps to time via class_index ร— 80ms.

Usage with Swift (MLX)

This model is designed for use with speech-swift:

import Qwen3ASR

let aligner = try await Qwen3ForcedAligner.fromPretrained()

let aligned = aligner.align(
    audio: audioSamples,
    text: "Can you guarantee that the replacement part will be shipped tomorrow?",
    sampleRate: 24000
)

for word in aligned {
    print("[\(String(format: "%.2f", word.startTime))s - \(String(format: "%.2f", word.endTime))s] \(word.text)")
}

CLI

# Align with provided text
qwen3-asr-cli --align --text "Hello world" audio.wav

# Transcribe first, then align
qwen3-asr-cli --align audio.wav

Output:

[0.12s - 0.45s] Can
[0.45s - 0.72s] you
[0.72s - 1.20s] guarantee
[1.20s - 1.48s] that
...

Quantization

Text decoder (attention projections, MLP, embeddings) quantized to 4-bit using group quantization (group_size=64). Audio encoder and classify head kept as float16 for accuracy.

Converted with:

python scripts/convert_forced_aligner.py \
    --source Qwen/Qwen3-ForcedAligner-0.6B \
    --upload --repo-id aufklarer/Qwen3-ForcedAligner-0.6B-4bit

Links


Links

Downloads last month
9
Safetensors
Model size
0.4B params
Tensor type
U32
ยท
F16
ยท
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Alkd/Qwen3-ForcedAligner-0.6B-4bit

Finetuned
(6)
this model