hy-mt2-1.8b-4bit-mlx

Quantized version of tencent/Hy-MT2-1.8B for Apple Silicon using MLX.

Hy-MT2-1.8B is Tencent's multilingual translation model covering 40+ languages.

Quantization: Affine integer quantization
Precision: 4-bit (~4.5 bits/weight avg)
Group size: 64
Disk size: 970 MB
Quantized by: sahilchachra

About this variant

Standard affine (integer) quantization at 4-bit with group size 64. Largest compression ratio — recommended when memory is tight or you want the fastest decode throughput.

Benchmark results

Evaluated on Apple M5 Pro with MLX. Model loaded once; performance and quality measured in a single pass.

Performance

This model FP16 baseline
Prefill (tok/s) 1486.53 1269.81
Decode (tok/s) 220.63 77.12
Peak memory (GB) 1.28 3.72
Disk size (MB) 970 3897

Translation quality (FLORES-200 devtest)

Reported as chrF++ (higher is better). Sample-size noted per pair.

Direction This model FP16 baseline n
eng_Latn→fra_Latn 65.07 63.81 20
eng_Latn→deu_Latn 58.02 57.66 20
eng_Latn→zho_Hans 27.74 29.09 20
eng_Latn→jpn_Jpan 31.9 34.19 20
eng_Latn→spa_Latn 56.19 56.5 20
fra_Latn→eng_Latn 65.1 64.58 20
zho_Hans→eng_Latn 55.34 55.17 20
jpn_Jpan→eng_Latn 54.3 55.29 20

Avg chrF++: 56.9 vs FP16 56.95
Avg BLEU: 30.98 vs FP16 30.71

Context scaling (decode tok/s)

Context length Decode tok/s
~128 tokens 97163.0
~256 tokens 214.0
~512 tokens 213.7
~1024 tokens 119402.9

Usage

Install

pip install mlx-lm

Translate

from mlx_lm import load, generate

model, tokenizer = load("sahilchachra/hy-mt2-1.8b-4bit-mlx")

prompt = (
    "Translate the following text from English to French.\n"
    "English: The early bird catches the worm.\n"
    "French:"
)
print(generate(model, tokenizer, prompt=prompt, max_tokens=128, verbose=True))

Stream

from mlx_lm import load, stream_generate

model, tokenizer = load("sahilchachra/hy-mt2-1.8b-4bit-mlx")
for chunk in stream_generate(model, tokenizer, prompt="Translate \"Hello world\" to Japanese:", max_tokens=64):
    print(chunk.text, end="", flush=True)

All variants in this collection

Model Method
sahilchachra/hy-mt2-1.8b-4bit-mlx Affine int4 (group 64) ← this model
sahilchachra/hy-mt2-1.8b-8bit-mlx Affine int8 (group 64)
sahilchachra/hy-mt2-1.8b-mxfp4-mlx Block float MX FP4
sahilchachra/hy-mt2-1.8b-mxfp8-mlx Block float MX FP8

Notes

  • Requires Apple Silicon (M1 or later) with MLX
  • Benchmarks run on Apple M5 Pro, 24 GB unified memory
  • FLORES-200 sample sizes are small — treat chrF/BLEU figures as indicative, not definitive
  • License: see tencent/Hy-MT2-1.8B for the original model's license terms

Original model

See tencent/Hy-MT2-1.8B for full model details, supported languages, and intended use.

Downloads last month
48
Safetensors
Model size
0.3B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sahilchachra/hy-mt2-1.8b-4bit-mlx

Quantized
(19)
this model