Any chances for A100?

#1
by traphix - opened

Can A100 run this model?

We haven't tested it on A100 yet.

We haven't tested it on A100 yet.

Looking forward to your A100 test results

Can A100 run this model?

YES, 8xA100 CAN run this model with limited context length!

you need to do:

  1. apply this PR, SM80 gpus do not support Sparse MLA. There is a Triton Sparse MLA for sm80. https://github.com/vllm-project/vllm/pull/38476
  2. download the INT4 model.

here is my command for 8*A100 gpus:

        --served-model-name GLM-5.2
        --dtype bfloat16
        --quantization compressed-tensors
        # --kv-cache-dtype fp8 # Triton Sparse MLA do not support fp8
        --tensor-parallel-size 8
        --enable-expert-parallel
        --max-model-len auto
        --gpu-memory-utilization 0.96
        --max-num-seqs 4
        --tool-call-parser glm47
        --reasoning-parser glm45
        # --speculative-config '{"method":"mtp","num_speculative_tokens":1}' # do not enable Speculative Decoding, it will reduce the context length to 100k.
        --disable-uvicorn-access-log
        --safetensors-load-strategy prefetch

my actual settings & results:

  • vllm: 0.23.1rc1.dev255+g435f82d61 (nightly version)
  • decoding speed: Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.6 tokens/s. this speed is okey. but not as fast as Minimax-M3 and Nemotron3 Ultra.
  • actual context length: [kv_cache_utils.py:1943] Auto-fit max_model_len: reduced from 1048576 to 236736 to fit in available GPU memory (20.43 GiB available for KV cache)
  • attention backend: (Worker_TP0_EP0 pid=3508478) INFO 06-23 17:11:11 [cuda.py:458] Using TRITON_MLA_SPARSE attention backend out of potential backends: ['TRITON_MLA_SPARSE'].
  • actual experience: worked on Roo code for a night, all thing is okey.

My full command:

argv: vllm serve model_hub/GLM-5.2-Int4-Int8Mix --trust-remote-code --port 1080 --root-path /aiforward1039931975797833728 --enable-log-requests --enable-auto-tool-choice --served-model-name GLM-5.2 --dtype bfloat16 --quantization compressed-tensors --tensor-parallel-size 8 --enable-expert-parallel --max-model-len auto --gpu-memory-utilization 0.96 --max-num-seqs 4 --tool-call-parser glm47 --reasoning-parser glm45 --disable-uvicorn-access-log --safetensors-load-strategy prefetch

Sign up or log in to comment