Instructions to use RedHatAI/gemma-4-31B-it-FP8-block with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RedHatAI/gemma-4-31B-it-FP8-block with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="RedHatAI/gemma-4-31B-it-FP8-block")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("RedHatAI/gemma-4-31B-it-FP8-block")
model = AutoModelForImageTextToText.from_pretrained("RedHatAI/gemma-4-31B-it-FP8-block")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use RedHatAI/gemma-4-31B-it-FP8-block with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RedHatAI/gemma-4-31B-it-FP8-block"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/gemma-4-31B-it-FP8-block",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/RedHatAI/gemma-4-31B-it-FP8-block

SGLang

How to use RedHatAI/gemma-4-31B-it-FP8-block with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RedHatAI/gemma-4-31B-it-FP8-block" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/gemma-4-31B-it-FP8-block",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RedHatAI/gemma-4-31B-it-FP8-block" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/gemma-4-31B-it-FP8-block",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use RedHatAI/gemma-4-31B-it-FP8-block with Docker Model Runner:
```
docker model run hf.co/RedHatAI/gemma-4-31B-it-FP8-block
```

gemma-4-31B-it-FP8-block

Model Overview

Model Architecture: google/gemma-4-31B-it
- Input: Text / Image
- Output: Text
Model Optimizations:
- Weight quantization: FP8
- Activation quantization: FP8
Release Date: 2026-04-04
Version: 1.0
Model Developers: RedHatAI

This model is a quantized version of google/gemma-4-31B-it. It was evaluated on several tasks to assess its quality in comparison to the unquantized model.

Model Optimizations

This model was obtained by quantizing the weights and activations of google/gemma-4-31B-it to FP8 data type, ready for inference with vLLM. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.

Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. Vision tower, embedding, and output head layers are kept in their original precision.

Deployment

Use with vLLM

This model can be deployed using vLLM. For detailed instructions including multi-GPU deployment, multimodal inference, thinking mode, function calling, and benchmarking, see the Gemma 4 vLLM usage guide.

Start the vLLM server:

vllm serve RedHatAI/gemma-4-31B-it-FP8-block --max-model-len 32768

To enable thinking/reasoning and tool calling:

vllm serve RedHatAI/gemma-4-31B-it-FP8-block \
  --max-model-len 32768 \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --enable-auto-tool-choice

Tip: For text-only workloads, pass --limit-mm-per-prompt image=0 to skip vision encoder memory allocation. Set --gpu-memory-utilization 0.90 to maximize KV cache capacity.

Send requests to the server:

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = "RedHatAI/gemma-4-31B-it-FP8-block"

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

outputs = client.chat.completions.create(
    model=model,
    messages=messages,
)

generated_text = outputs.choices[0].message.content
print(generated_text)

Creation

This model was created by applying data-free FP8 block quantization with LLM Compressor, as presented in the code snippet below.

from llmcompressor import model_free_ptq

MODEL_ID = "google/gemma-4-31B-it"
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-block"

model_free_ptq(
    model_stub=MODEL_ID,
    save_directory=SAVE_DIR,
    scheme="FP8_BLOCK",
    ignore=["re:.*vision.*", "lm_head", "re:.*embed_tokens.*"],
    max_workers=8,
    device="cuda:0",
)

Evaluation

This model was evaluated on GSM8k-Platinum, MMLU-CoT, MMLU-Pro, and IFEval using lm-evaluation-harness, served with vLLM (OpenAI-compatible API). All evaluations were performed with thinking turned off.

Accuracy

Category	Benchmark	google/gemma-4-31B-it	RedHatAI/gemma-4-31B-it-FP8-block	Recovery
Instruction Following	GSM8k-Platinum (5-shot, strict-match)	97.60	97.82	100.2%
	MMLU-CoT (5-shot, strict_match)	90.53	90.70	100.2%
	MMLU-Pro (5-shot, custom-extract)	85.03	84.92	99.9%
	IFEval (0-shot, prompt-level strict)	91.07	91.31	100.3%
	IFEval (0-shot, inst-level strict)	93.76	93.84	100.1%

Reproduction

The results were obtained using the following commands:

Each benchmark was run 3 times with different random seeds (42, 1234, 4158) and the scores were averaged.

vLLM server:

vllm serve RedHatAI/gemma-4-31B-it-FP8-block --max-model-len 96000

GSM8k-Platinum (lm-eval, 5-shot, 3 repetitions)

lm_eval --model local-chat-completions \
  --tasks gsm8k_platinum_cot_llama \
  --model_args "model=RedHatAI/gemma-4-31B-it-FP8-block,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=2400" \
  --num_fewshot 5 \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --output_path results_gsm8k_platinum.json \
  --seed 1234 \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"

MMLU-CoT (lm-eval, 5-shot, 3 repetitions)

lm_eval --model local-chat-completions \
  --tasks mmlu_cot_llama \
  --model_args "model=RedHatAI/gemma-4-31B-it-FP8-block,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=2400" \
  --num_fewshot 5 \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --output_path results_mmlu_cot.json \
  --seed 1234 \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"

MMLU-Pro (lm-eval, 5-shot, 3 repetitions)

lm_eval --model local-chat-completions \
  --tasks mmlu_pro_chat \
  --model_args "model=RedHatAI/gemma-4-31B-it-FP8-block,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=2400" \
  --num_fewshot 5 \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --output_path results_mmlu_pro.json \
  --seed 1234 \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"

IFEval (lm-eval, 0-shot, 3 repetitions)

lm_eval --model local-chat-completions \
  --tasks ifeval \
  --model_args "model=RedHatAI/gemma-4-31B-it-FP8-block,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=2400" \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --output_path results_ifeval.json \
  --seed 1234 \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"

Downloads last month: 244,545

Safetensors

Model size

31B params

Tensor type

BF16

F8_E4M3

Model tree for RedHatAI/gemma-4-31B-it-FP8-block

Base model

google/gemma-4-31B

Finetuned

google/gemma-4-31B-it

Quantized

(209)

this model