vllm-patched-calib
vLLM 0.21.0 (commit ad7125a) with calibration-v2 hooks patch applied.
- Source repo: https://github.com/lucaspirola/moe_compress (branch
feat/calibration-v2, immutable tagcalib-v2-max-layer-early-exit) - Patch artifact (10101 lines, MD5
a8da5e321ac7fb30f1648fba3476bea6): also uploaded to this repo asvllm_calibration_hooks.patch - Architectures: sm_80 (A100), sm_90a (H100/H200), sm_100 (B200), sm_120 (RTX 6000 Pro Blackwell)
- Build host: HF Jobs (cpu-performance)
- torch: 2.11.0+cu130
- CUDA toolkit: 13.0
Install on a fresh GPU host
hf download pirola/vllm-patched-calib --include "*.whl" --local-dir /tmp/wheels
pip install /tmp/wheels/vllm-*.whl
Calibration capture flags
The patched vLLM accepts new env vars to enable calibration data capture:
VLLM_CALIB_CAPTURE_ROUTER=1โ per-layer router logits + topkVLLM_CALIB_CAPTURE_EXPERT=1โ per-expert inputs + weighted outputsVLLM_CALIB_CAPTURE_EXPERT_UNWEIGHTED=1โ kernel-level pre-weight per-expert outputs (Triton backend; forces VLLM_USE_FLASHINFER_MOE_FP16=0)VLLM_CALIB_CAPTURE_EXPERT_MID=1โ silu(gate)ยทup intermediate (input to down_proj; Triton backend)VLLM_CALIB_CAPTURE_BLOCK=1โ MoE block pre-residual outputVLLM_CALIB_CAPTURE_IMATRIX=1โ per-input-channel sum-of-squares for every linear layer (writes llama.cpp-compatible.imatrix.dat)VLLM_CALIB_CAPTURE_INPUT_COV=1โ per-(layer, expert, "gate_proj") teacher input covariance ฮฃ_in (requiresVLLM_CALIB_CAPTURE_EXPERT=1; writes dict-shapedsidecars/covariance.pt, schema v2)VLLM_CALIB_MAX_LAYER=<N>โ L2 early-exit gate: truncateQwen3MoeModel.forwardafter decoder layerN(inclusive); skipsN+1..end. Default-1/ unset = disabled. Orthogonal to the capture gates above. Useful as the foundation for L1 (sequential REAP+REAM per-layer profiling) and as a standalone optimisation for any writer whose payload comes from layerLor earlier.
Spot-preemption resumability
The driver build_self_traces_calib_vllm.py writes a periodic
<jsonl>.imatrix.ckpt checkpoint at every chunk boundary (CLI:
--imatrix-checkpoint-every-chunks=1 by default). On --resume, the
checkpoint is hydrated into the live accumulators in-place and the
cumulative prompt counter is restored. The final .imatrix.dat and
the periodic .imatrix.ckpt both use the temp-file + os.replace
atomic-rename pattern so a kill mid-write leaves the previous file
intact. .npz logit sidecars are also written atomically.
JSONL resume is hardened against trailing partial lines: each line is JSON-validated on resume; the first parse failure triggers a truncate to the last good byte offset before counting resumes.