The landscape of artificial intelligence in enterprise IT is evolving, and the introduction of ITBench-AA signifies a pivotal moment. This new benchmark, developed through a collaboration between Artificial Analysis and IBM, aims to evaluate AI models on agentic enterprise IT tasks, starting with challenges in Site Reliability Engineering (SRE).
Benchmark Overview
ITBench-AA focuses on the performance of models in handling Kubernetes incident response. The benchmark assesses how well these models can diagnose issues in live systems by analyzing logs, tracing dependencies, and identifying root-cause entities within complex infrastructures. The underlying dataset, crafted by IBM, draws from extensive expertise in enterprise IT operations.
Initial Findings
In its inaugural evaluation, the benchmark revealed that leading frontier models scored below 50%. The top performer, Claude Opus 4.7 (Adaptive Reasoning, Max Effort), achieved a score of 47%, followed closely by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42%. This places ITBench-AA SRE among the least saturated benchmarks for agentic tasks, contrasting with higher scores seen in other benchmarks like Terminal-Bench.
Task Structure and Methodology
The benchmark comprises 59 SRE tasks, including 40 public and 19 new, held-out tasks. Each task presents a Kubernetes incident snapshot containing alerts, events, traces, metrics, logs, and application topology. Models must identify the minimal set of independent root-cause entities responsible for the incident, which can include various failure modes such as resource quota exhaustion and network partitions.
Models operate within the Stirrup reference harness, which provides shell access to a sandboxed file system for task execution. Each task is capped at 100 turns, with three repeats per task to ensure robust evaluation. Scoring is based on average precision at full recall, where a model must identify all ground-truth root causes to score above zero.
Cost and Performance Insights
Cost-effectiveness is also a key consideration in the benchmark. For instance, Gemma 4 31B (Reasoning) scored 37% at a cost of $0.14 per task, outperforming Gemini 3.1 Pro Preview, which scored 30% at $2.23 per task. Meanwhile, GLM-5.1 (Reasoning) achieved a score of 40% at $1.23 per task, demonstrating competitive performance at a lower cost compared to others.
The launch of ITBench-AA represents a significant advancement in the evaluation of AI capabilities within enterprise IT, providing a structured framework to assess and improve model performance in real-world scenarios.
This article was produced by NeonPulse.today using human and AI-assisted editorial processes, based on publicly available information. Content may be edited for clarity and style.








