Toto-1.0-QA-Experimental

Toto-1.0-QA-Experimental is a hybrid time-series foundation model (TSFM) and vision-language model (VLM) for ARFBench. It achieves comparable macro F1 and accuracy to top frontier models on ARFBench:


Overall accuracy and F1 on the ARFBench time series question-answering benchmark, as of paper release. Toto-1.0-QA-Experimental achieves the top accuracy and comparable F1 to top frontier models.

It combines:

a vision-language backbone (Qwen/Qwen3-VL-32B-Instruct) for image-conditioned question answering,
Toto time-series representations (Datadog/Toto-Open-Base-1.0),
lightweight projection modules that inject time-series signals into VLM inference.


Overview of the Toto-1.0-QA-Experimental Architecture.

This model repository stores inference artifacts, including:

vlm/ (merged vision-language model weights),
ts_modules.pt (time-series modules),
config.json and processor files.

Basic Inference Example

The example below assumes you already have:

time-series tensors,
one or more image paths,
a text question.

import torch
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info

# From our Github repository (https://github.com/DataDog/arfbench)
from model.toto_vlm_components import TotoAnomalyQAModel, TimeSeriesData

repo_id = "Datadog/Toto-1.0-QA-Experimental"

# Load model + processor from Hub artifact
model = TotoAnomalyQAModel.from_pretrained(
    repo_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
processor = AutoProcessor.from_pretrained(repo_id)
model.eval()

# -----------------------------------------------------------------------------
# Example input data (replace with your real tensors and inputs)
# -----------------------------------------------------------------------------
series = ...  # torch.Tensor, shape: [n_channels, n_timesteps], float32
padding_mask = ...  # torch.Tensor, shape: [n_channels, n_timesteps], bool
id_mask = ...  # torch.Tensor, shape: [n_channels, n_timesteps], float/bool
timestamp_seconds = ...  # torch.Tensor, shape: [n_channels, n_timesteps]
time_interval_seconds = ...  # torch.Tensor, shape: [n_channels]
group_names = ...  # list[str], length n_channels
question = "In the following time-series, does the anomaly in this time-series correlate with the anomaly in the other time-series, if anomalies exist??"
image_paths = ["./image_1.png", "./image_2.png"]

ts_data = TimeSeriesData(
    series=series,
    padding_mask=padding_mask,
    id_mask=id_mask,
    timestamp_seconds=timestamp_seconds,
    time_interval_seconds=time_interval_seconds,
    num_groups=series.shape[0],
    query_group="custom-query",
    group_names=group_names,
)

# Build multimodal chat input (images + text)
messages = [
    {
        "role": "system",
        "content": "You are an expert observability anomaly analyst.",
    },
    {
        "role": "user",
        "content": (
            [{"type": "image", "image": p} for p in image_paths]
            + [{"type": "text", "text": question}]
        ),
    },
]

text_prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
processed_images, _ = process_vision_info(messages)

inputs = processor(
    text=[text_prompt],
    images=[processed_images],
    return_tensors="pt",
    padding=True,
)

device = next(model.parameters()).device
inputs = {
    k: v.to(device) if isinstance(v, torch.Tensor) else v
    for k, v in inputs.items()
}

# Generate answer
with torch.no_grad():
    output_ids = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs.get("attention_mask"),
        pixel_values=inputs.get("pixel_values"),
        image_grid_thw=inputs.get("image_grid_thw"),
        ts_data=[ts_data],  # batch of 1
        max_new_tokens=512,
        do_sample=False,
    )

prompt_len = inputs["input_ids"].shape[1]
answer = processor.decode(
    output_ids[0, prompt_len:],
    skip_special_tokens=True,
).strip()

print("Answer:", answer)

Minimum Requirements

Running Toto-1.0-QA-Experimental typically requires multi-GPU setup (tested on 4x A100 40GB). If memory is limited, reduce --max-ts-length and/or use quantization flags.

Resources

Citation

@misc{xie2026arfbenchbenchmarkingtimeseries,
      title={ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response}, 
      author={Stephan Xie and Ben Cohen and Mononito Goswami and Junhong Shen and Emaad Khwaja and Chenghao Liu and David Asker and Othmane Abou-Amal and Ameet Talwalkar},
      year={2026},
      eprint={2604.21199},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2604.21199}, 
}

Downloads last month: 26

Model tree for Datadog/Toto-1.0-QA-Experimental

Base model

Datadog/Toto-Open-Base-1.0

Adapter

(1)

this model

Dataset used to train Datadog/Toto-1.0-QA-Experimental

Paper for Datadog/Toto-1.0-QA-Experimental

ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response

Paper • 2604.21199 • Published 6 days ago • 1