CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?
Abstract
Healthcare workflow benchmark challenges agents with policy-dense, multi-role, and multilateral interaction requirements, revealing significant performance gaps in automated enterprise applications.
End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce χ-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.
Community
Today, we introduce CHI-Bench (Clinical Healthcare In-situ Benchmark), the first long-horizon healthcare benchmark for AI agents.
We built high-fidelity simulators for three live domains: Provider Prior Authorization, Payer Utilization Management, and Population Health Care Management, each instantiated as MCP servers that operate on patients, clinicians and insurers records.
Each trial in CHI-Bench runs an agent for 60-80 steps across four to six clinical stages, exposing 21 healthcare apps through 200+ MCP tools and a 1,279-document operations handbook. It evaluates the trajectory, every artifact, and world state using deterministic unit tests and LLM judge for evidence grounding, consent, and cross-stage consistency.
Results from 30 frontier agents on the leaderboard
- Best overall: Anthropic's Claude Code with Opus 4.6 — 28% pass@1.
- Runner-up: OpenAI's Codex with GPT-5.5 — 21%.
- By domain: utilization review 41%; care management 32%; prior-authorization paperwork 29%.
- Reliability: no agent clears 20% when the same case is run three times.
CHI-Bench is open under Apache 2.0; the leaderboard accepts community submissions today.
🤖Github: https://github.com/actava-ai/chi-bench
🤗HuggingFace: https://huggingface.co/datasets/actava/chi-bench
🏆Leaderboard: https://actava.ai/benchmarks
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AutomationBench (2026)
- PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments (2026)
- $\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows (2026)
- SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows? (2026)
- Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows (2026)
- Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows (2026)
- AutoMedBench: Towards Medical AutoResearch with Agentic AI Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.16679 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 3
test-alexpouliquen/chi-bench
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper