The landscape of AI evaluation is undergoing a significant transformation as costs soar, presenting new challenges for researchers and developers. The recent analysis from the Holistic Agent Leaderboard (HAL) illustrates this shift, revealing that the evaluation of AI agents has become a costly endeavor.
Costly Evaluations
HAL reported spending approximately $40,000 to conduct 21,730 agent rollouts across 9 models and 9 benchmarks. In a striking example, a single run on the GAIA frontier model can reach costs of $2,829 before caching. Exgentic’s evaluation of agent configurations revealed a staggering 33× cost spread for identical tasks, pinpointing scaffold choice as a primary cost driver.
Static vs. Dynamic Benchmarks
The cost challenges began even before the rise of agent evaluations. The HELM framework, released by Stanford’s CRFM in 2022, demonstrated significant API costs, ranging from $85 for OpenAI’s code-cushman-001 to $10,926 for AI21’s J1-Jumbo (178B). The evaluation of models like Granite-13B through HELM can consume up to 1,000 GPU hours. Across HELM’s 30 models and 42 scenarios, total costs approached $100,000.
Complexities of Agent Evaluations
Agent evaluations are inherently more complex and costly than static benchmarks. HAL’s standardized evaluations reveal a wide variance in costs, with some benchmarks costing over $1,000 per run. The pricing structure is influenced by model, scaffold, and token budget choices, leading to significant discrepancies in performance versus cost. For instance, a single benchmark run can vary by four orders of magnitude, and higher spending does not always correlate with better accuracy.
Training and Evaluation Costs
Some benchmarks, like The Well, illustrate the extreme costs associated with training and evaluation. Evaluating a new architecture can consume about 960 H100-hours, translating to approximately $2,400. In contrast, the evaluation compute can exceed training compute by two orders of magnitude, challenging traditional assumptions in deep learning.
As the field progresses, the implications of these rising costs are profound. The evaluation landscape is evolving, necessitating new strategies to manage expenses while maintaining the integrity and reliability of AI assessments.
This article was produced by NeonPulse.today using human and AI-assisted editorial processes, based on publicly available information. Content may be edited for clarity and style.







