ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response
Paper • 2604.21199 • Published • 1
Toto-1.0-QA-Experimental is a hybrid time-series foundation model (TSFM) and vision-language model (VLM) for ARFBench. It achieves comparable macro F1 and accuracy to top frontier models on ARFBench:
![]() |
|---|
| Overall accuracy and F1 on the ARFBench time series question-answering benchmark, as of paper release. Toto-1.0-QA-Experimental achieves the top accuracy and comparable F1 to top frontier models. |
It combines:
Qwen/Qwen3-VL-32B-Instruct) for image-conditioned question answering,Datadog/Toto-Open-Base-1.0),This model repository stores inference artifacts, including:
vlm/ (merged vision-language model weights),ts_modules.pt (time-series modules),config.json and processor files.The example below assumes you already have:
import torch
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info
# From our Github repository (https://github.com/DataDog/arfbench)
from model.toto_vlm_components import TotoAnomalyQAModel, TimeSeriesData
repo_id = "Datadog/Toto-1.0-QA-Experimental"
# Load model + processor from Hub artifact
model = TotoAnomalyQAModel.from_pretrained(
repo_id,
device_map="auto",
torch_dtype=torch.bfloat16,
)
processor = AutoProcessor.from_pretrained(repo_id)
model.eval()
# -----------------------------------------------------------------------------
# Example input data (replace with your real tensors and inputs)
# -----------------------------------------------------------------------------
series = ... # torch.Tensor, shape: [n_channels, n_timesteps], float32
padding_mask = ... # torch.Tensor, shape: [n_channels, n_timesteps], bool
id_mask = ... # torch.Tensor, shape: [n_channels, n_timesteps], float/bool
timestamp_seconds = ... # torch.Tensor, shape: [n_channels, n_timesteps]
time_interval_seconds = ... # torch.Tensor, shape: [n_channels]
group_names = ... # list[str], length n_channels
question = "In the following time-series, does the anomaly in this time-series correlate with the anomaly in the other time-series, if anomalies exist??"
image_paths = ["./image_1.png", "./image_2.png"]
ts_data = TimeSeriesData(
series=series,
padding_mask=padding_mask,
id_mask=id_mask,
timestamp_seconds=timestamp_seconds,
time_interval_seconds=time_interval_seconds,
num_groups=series.shape[0],
query_group="custom-query",
group_names=group_names,
)
# Build multimodal chat input (images + text)
messages = [
{
"role": "system",
"content": "You are an expert observability anomaly analyst.",
},
{
"role": "user",
"content": (
[{"type": "image", "image": p} for p in image_paths]
+ [{"type": "text", "text": question}]
),
},
]
text_prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
processed_images, _ = process_vision_info(messages)
inputs = processor(
text=[text_prompt],
images=[processed_images],
return_tensors="pt",
padding=True,
)
device = next(model.parameters()).device
inputs = {
k: v.to(device) if isinstance(v, torch.Tensor) else v
for k, v in inputs.items()
}
# Generate answer
with torch.no_grad():
output_ids = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs.get("attention_mask"),
pixel_values=inputs.get("pixel_values"),
image_grid_thw=inputs.get("image_grid_thw"),
ts_data=[ts_data], # batch of 1
max_new_tokens=512,
do_sample=False,
)
prompt_len = inputs["input_ids"].shape[1]
answer = processor.decode(
output_ids[0, prompt_len:],
skip_special_tokens=True,
).strip()
print("Answer:", answer)
Running Toto-1.0-QA-Experimental typically requires multi-GPU setup (tested on 4x A100 40GB). If memory is limited, reduce --max-ts-length and/or use quantization flags.
@misc{xie2026arfbenchbenchmarkingtimeseries,
title={ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response},
author={Stephan Xie and Ben Cohen and Mononito Goswami and Junhong Shen and Emaad Khwaja and Chenghao Liu and David Asker and Othmane Abou-Amal and Ameet Talwalkar},
year={2026},
eprint={2604.21199},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2604.21199},
}
Base model
Datadog/Toto-Open-Base-1.0