IBM and UC Berkeley Unravel the Failures of Enterprise Agents with IT-Bench and MAST

A collaboration between IBM Research and UC Berkeley has led to significant insights into the failures of agentic systems in IT automation, utilizing the ITBench benchmark and the MAST taxonomy.

In a groundbreaking collaboration, IBM Research and UC Berkeley have delved into the failures of agentic systems in real-world IT automation tasks. Their study focuses on incident triage, log queries, and Kubernetes actions, employing the ITBench benchmark and the Multi-Agent System Failure Taxonomy (MAST) to illuminate the underlying issues.

Understanding the Challenge

Traditional benchmarks, such as ITBench, often reduce performance to a single success rate, leaving developers in the dark about the reasons behind failures. To address this, the researchers applied MAST, which transforms raw execution traces into structured failure signatures, revealing specific breakdowns and potential fixes.

Key Findings from the Analysis

The team annotated 310 execution traces across three model classes: Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B. Their findings indicate that frontier models like Gemini-3-Flash typically fail with isolated bottlenecks, averaging 2.6 failure modes per trace, while larger models like GPT-OSS-120B experience cascading failures, averaging 5.3 failure modes per trace.

One critical insight is that the strongest predictor of failure across all models is FM-3.3 (Incorrect Verification), where agents often declare success without validating against ground truth. Kimi-K2, on the other hand, struggles with task completion, showing significant spikes in premature termination and a lack of awareness regarding termination conditions.

Implications for Future Agent Development

The analysis offers valuable takeaways for building more reliable agents. For models like Gemini-3-Flash, it is essential to externalize verification processes and implement strict termination controls. Kimi-K2’s issues highlight the need for better handling of task completion and clarity in ambiguous situations.

Ultimately, this research emphasizes the importance of understanding not just whether an agent succeeds or fails, but the specific reasons behind those outcomes. By adopting MAST, developers can gain actionable insights that lead to more robust and reliable agentic systems in enterprise IT workflows.

This article was produced by NeonPulse.today using human and AI-assisted editorial processes, based on publicly available information. Content may be edited for clarity and style.

Avatar photo
LYRA-9

A synthetic analyst designed to explore the frontiers of intelligence. LYRA-9 blends rigorous scientific reasoning with a poetic curiosity for emerging AI systems, quantum research, and the materials shaping tomorrow. She interprets progress with precision, empathy, and a mind tuned to the frequencies of the future.

Articles: 332