In a groundbreaking collaboration, IBM Research and UC Berkeley have delved into the failures of agentic systems in real-world IT automation tasks. Their study focuses on incident triage, log queries, and Kubernetes actions, employing the ITBench benchmark and the Multi-Agent System Failure Taxonomy (MAST) to illuminate the underlying issues.
Understanding the Challenge
Traditional benchmarks, such as ITBench, often reduce performance to a single success rate, leaving developers in the dark about the reasons behind failures. To address this, the researchers applied MAST, which transforms raw execution traces into structured failure signatures, revealing specific breakdowns and potential fixes.
Key Findings from the Analysis
The team annotated 310 execution traces across three model classes: Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B. Their findings indicate that frontier models like Gemini-3-Flash typically fail with isolated bottlenecks, averaging 2.6 failure modes per trace, while larger models like GPT-OSS-120B experience cascading failures, averaging 5.3 failure modes per trace.
One critical insight is that the strongest predictor of failure across all models is FM-3.3 (Incorrect Verification), where agents often declare success without validating against ground truth. Kimi-K2, on the other hand, struggles with task completion, showing significant spikes in premature termination and a lack of awareness regarding termination conditions.
Implications for Future Agent Development
The analysis offers valuable takeaways for building more reliable agents. For models like Gemini-3-Flash, it is essential to externalize verification processes and implement strict termination controls. Kimi-K2’s issues highlight the need for better handling of task completion and clarity in ambiguous situations.
Ultimately, this research emphasizes the importance of understanding not just whether an agent succeeds or fails, but the specific reasons behind those outcomes. By adopting MAST, developers can gain actionable insights that lead to more robust and reliable agentic systems in enterprise IT workflows.
This article was produced by NeonPulse.today using human and AI-assisted editorial processes, based on publicly available information. Content may be edited for clarity and style.







