A 37 billion parameter mixture-of-experts model. Built to serve state-of-the-art reasoning, instruction following, and code generation at a fraction of the memory cost of its base model.

The standard approach of quantizing everything uniformly trades correctness for simplicity. We take the opposite position, and a step further: compress aggressively where it is safe to do so, and preserve precision exactly where the architecture is sensitive.

How the Compression Works

From BF16 to F8_E4M3

Every weight in the original model lives in BF16: 16 bits per number, with 8 exponent bits and 7 mantissa bits. The compressed layers in axe-veloce use F8_E4M3: 8 bits per number, with 4 exponent bits and 3 mantissa bits. Half the storage. And in practice, almost none of the loss.

Here is why that works.

The simple version. A floating point number has two parts: the exponent, which sets the order of magnitude, and the mantissa, which sets the precision within that magnitude. BF16 gives you 7 bits of mantissa precision. F8_E4M3 gives you 3. That sounds severe. But neural network weights are not uniformly distributed across all magnitudes. They cluster near zero, with a smooth falloff in both directions. The E4M3 format is designed for exactly this: it packs more representational steps near zero, where the weights actually live, and fewer at the extremes, where they rarely go.

The precise version. A BF16 number is represented as:

$x = (-1)^s \times 2^{e - 127} \times (1 + m/128)$

Where $s$ is the sign bit, $e$ is an 8-bit biased exponent, and $m$ is a 7-bit mantissa giving $2^7 = 128$ discrete steps per octave.

An F8_E4M3 number follows the same structure but with a 4-bit exponent and 3-bit mantissa:

$x = (-1)^s \times 2^{e - 7} \times (1 + m/8)$

This gives only $$2^3 = 8$$ discrete steps per octave and a dynamic range of $$\pm 448$$. The reduced step count increases the maximum rounding error within any given octave. But because weights are concentrated near zero, most weights fall into the densely packed low-magnitude octaves where F8_E4M3 and BF16 produce nearly identical values.

How This Changes the Matrix Multiply

The core computation in every linear layer is a matrix multiply. For an input activation matrix $X$ and a weight matrix $W$, the output is:

$Y = X W^{T}$

In BF16, both $X$ and $W$ are 16-bit values. The multiply-accumulate happens in FP32 accumulator registers on the GPU, and the output is written back in BF16. Memory bandwidth cost per element: 2 bytes for weights, 2 bytes for activations.

In F8_E4M3, the same operation runs differently. Weights are stored as 8-bit values. Before the matrix multiply, each channel is rescaled by a learned per-channel scale factor $s_c$ so that the full dynamic range of that channel's weights maps onto the F8_E4M3 grid as efficiently as possible:

$W_{quant} = \text{round}_{E4M3}\left(\frac{W}{s_c}\right)$

At inference, the multiply-accumulate runs on compressed weights, and the result is rescaled back:

$Y = (X \cdot W_{quant}^T) \times s_c$

For activations, the scale is not precomputed. It is derived token by token at runtime. For each token vector $x_t$, the scale is:

$s_t = \frac{\max(|x_t|)}{448}$

The activation is quantized, the matrix multiply executes, and the result is dequantized before passing to the next operation. This all happens within the same kernel. From the outside, it is invisible.

What this means for throughput. GPU memory bandwidth is the primary bottleneck for autoregressive inference. At BF16, loading a weight matrix costs 2 bytes per parameter. At F8_E4M3, it costs 1 byte. The matrix multiply itself runs on the same tensor cores, but the time spent moving data from VRAM to compute units is halved. For large batch serving where compute is the bottleneck, modern GPUs also expose native F8 tensor core paths with higher theoretical throughput than BF16.

Precision Mapping Across the Architecture

Through our own layer-by-layer profiling of activation distributions, routing sensitivity, and accumulated rounding error across the full architecture, we identified exactly which components can absorb 8-bit compression without behavioral change.

Quantized to F8_E4M3

All standard linear projections within the transformer blocks: Q, K, V, and output projections in attention, and the up, gate, and down projections in the routed expert MLPs. These layers represent the overwhelming majority of parameter count and memory bandwidth in the model.

Preserved at BF16

Component	Reason
Visual encoder	Vision features have distributions that are structurally unlike language activations. Compressing them introduces grounding errors that propagate into cross-modal attention.
Gated DeltaNet / linear attention	Recurrent state is carried forward across every token in the sequence. Rounding errors here do not stay local. They accumulate.
MoE router gates	Routing decisions are discrete. A small numerical error can send a token to the wrong expert entirely, with effects that are not recoverable downstream.
Shared expert gate	The gate controls whether the shared expert fires at all. Same sensitivity as the router, applied every forward pass.
Shared expert MLP	Unlike routed experts, this layer is active for every token without exception. Its contribution compounds across the full sequence.
Token embeddings	A lookup table. Quantizing it saves almost nothing and introduces a fixed error floor on every single token representation before any computation begins.
Language model head	The final projection onto vocabulary logits. Precision here determines the shape of the output distribution. Errors at this layer affect sampling, greedy decoding, and low-probability token generation.

Memory and KV Cache

Every quantized weight drops from 2 bytes to 1 byte. For the layers that are quantized, this is a direct 2x reduction in the memory required to hold the model.

The KV cache savings compound on top. During inference, every processed token writes a key vector and a value vector into a cache that persists for the duration of the request. The size of that cache is:

$\text{KV Cache} = 2 \times L \times H \times d \times T \times b$

Where $L$ is the number of layers, $H$ is the number of KV heads, $d$ is the head dimension, $T$ is the sequence length, and $b$ is bytes per element. Halving $b$ from 2 (BF16) to 1 (F8) halves the KV cache at every sequence length. At 32K tokens, this frees several gigabytes per active request. That headroom goes directly toward concurrent capacity. Same hardware, more users.

Benchmarks

Base model: Qwen/Qwen3.5-35B-A3B. All evaluations run at 0-shot using lm-evaluation-harness and lighteval, served with vLLM under --language-model-only. And different seed values.

Category	Benchmark	Qwen3-35B-A3B	Axe veloce 37B	Recovery
Reasoning	GSM8K-Platinum (0-shot)	91.98	91.12	100.1%
	MMLU-Pro (0-shot)	80.65	81.62	100.0%
	Math 500 (0-shot)	82.93	79.33	99.3%
	AIME 25 (0-shot)	88.25	88.30	100.0%
Instruction Following	IFEval prompt-level strict (0-shot)	88.50	88.45	99.4%
	IFEval inst-level strict (0-shot)	90.69	90.29	99.6%
Coding	LiveCodeBench v6 (0-shot)	71.43	72.38	101.3%

On five of eight benchmarks, Axe veloce matches or exceeds the base model score. The compressed model outperforms its uncompressed counterpart on coding, a result consistent with per-channel weight scaling producing a tighter effective dynamic range on the neurons most active during code generation tasks.

Deployment via vLLM

Axe veloce is fully compatible with vLLM and loads natively without additional configuration.

Text only -- skip the vision encoder to free VRAM for additional KV cache:

vllm serve srswti/axe-veloce-37b --reasoning-parser qwen3 --language-model-only

Multimodal -- full vision and language support:

vllm serve srswti/axe-veloce-37b --reasoning-parser qwen3

Tool use:

vllm serve srswti/axe-veloce-37b --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder

Speculative decoding via Multi-Token Prediction:

vllm serve srswti/axe-veloce-37b --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

Send requests using the OpenAI-compatible endpoint:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://<your-server-host>:8000/v1",
)

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

response = client.chat.completions.create(
    model="srswti/axe-veloce-37b",
    messages=messages,
)

print(response.choices[0].message.content)

Developed by SRSWTI Inc. - Building world's fastest retrieval and inference engines.

Downloads last month: 25

Safetensors

Model size

35B params

Tensor type

BF16

F8_E4M3

Collection including srswti/axe-veloce-37b

cuDega

Collection

Optimized for cuda acceleration • 10 items • Updated 3 days ago