vllm-patched-calib

vLLM 0.21.0 (commit ad7125a) with calibration-v2 hooks patch applied.

  • Source repo: https://github.com/lucaspirola/moe_compress (branch feat/calibration-v2, immutable tag calib-v2-max-layer-early-exit)
  • Patch artifact (10101 lines, MD5 a8da5e321ac7fb30f1648fba3476bea6): also uploaded to this repo as vllm_calibration_hooks.patch
  • Architectures: sm_80 (A100), sm_90a (H100/H200), sm_100 (B200), sm_120 (RTX 6000 Pro Blackwell)
  • Build host: HF Jobs (cpu-performance)
  • torch: 2.11.0+cu130
  • CUDA toolkit: 13.0

Install on a fresh GPU host

hf download pirola/vllm-patched-calib --include "*.whl" --local-dir /tmp/wheels
pip install /tmp/wheels/vllm-*.whl

Calibration capture flags

The patched vLLM accepts new env vars to enable calibration data capture:

  • VLLM_CALIB_CAPTURE_ROUTER=1 โ€” per-layer router logits + topk
  • VLLM_CALIB_CAPTURE_EXPERT=1 โ€” per-expert inputs + weighted outputs
  • VLLM_CALIB_CAPTURE_EXPERT_UNWEIGHTED=1 โ€” kernel-level pre-weight per-expert outputs (Triton backend; forces VLLM_USE_FLASHINFER_MOE_FP16=0)
  • VLLM_CALIB_CAPTURE_EXPERT_MID=1 โ€” silu(gate)ยทup intermediate (input to down_proj; Triton backend)
  • VLLM_CALIB_CAPTURE_BLOCK=1 โ€” MoE block pre-residual output
  • VLLM_CALIB_CAPTURE_IMATRIX=1 โ€” per-input-channel sum-of-squares for every linear layer (writes llama.cpp-compatible .imatrix.dat)
  • VLLM_CALIB_CAPTURE_INPUT_COV=1 โ€” per-(layer, expert, "gate_proj") teacher input covariance ฮฃ_in (requires VLLM_CALIB_CAPTURE_EXPERT=1; writes dict-shaped sidecars/covariance.pt, schema v2)
  • VLLM_CALIB_MAX_LAYER=<N> โ€” L2 early-exit gate: truncate Qwen3MoeModel.forward after decoder layer N (inclusive); skips N+1..end. Default -1 / unset = disabled. Orthogonal to the capture gates above. Useful as the foundation for L1 (sequential REAP+REAM per-layer profiling) and as a standalone optimisation for any writer whose payload comes from layer L or earlier.

Spot-preemption resumability

The driver build_self_traces_calib_vllm.py writes a periodic <jsonl>.imatrix.ckpt checkpoint at every chunk boundary (CLI: --imatrix-checkpoint-every-chunks=1 by default). On --resume, the checkpoint is hydrated into the live accumulators in-place and the cumulative prompt counter is restored. The final .imatrix.dat and the periodic .imatrix.ckpt both use the temp-file + os.replace atomic-rename pattern so a kill mid-write leaves the previous file intact. .npz logit sidecars are also written atomically.

JSONL resume is hardened against trailing partial lines: each line is JSON-validated on resume; the first parse failure triggers a truncate to the last good byte offset before counting resumes.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support