AI datacenters are increasingly being referred to as factories, a term that captures the essence of their operations: power is consumed, and tokens are produced. However, this characterization oversimplifies the underlying economics of AI inference at scale, which are far more complex.
Token Generation and Revenue Dynamics
The fundamental principle is straightforward: the more tokens generated per unit of power, the better the financial outcome. As Nvidia CEO Jensen Huang noted, the number of inference tokens produced per watt directly correlates with the revenues of cloud service providers (CSPs). Achieving a balance where token generation covers operational costs while yielding profit is crucial.
Challenges in Scaling Inference
However, scaling inference isn’t merely about increasing the number of GPUs or tokens. The quality of tokens matters significantly. According to Dave Salvator, director of accelerated computing products at Nvidia, maximizing throughput can compromise user experience. The focus shifts to generating tokens efficiently while meeting service-level agreements (SLAs) and application-specific requirements.
The SemiAnalysis InferenceX benchmark illustrates this complexity, showing that while total token throughput can exceed 3.5 million tokens per second per megawatt, this often comes at the expense of user interactivity. The challenge lies in optimizing for both throughput and user experience.
Software’s Role in Performance
Goodput, or the effective throughput, is influenced by the interplay of hardware, software, and the models in use. The right software can significantly enhance the efficiency of hardware. For instance, Nvidia’s TensorRT LLM has shown superior performance compared to alternatives like SGLang when serving specific models.
Disaggregated Compute and Rack-Scale Architectures
Recent advancements in disaggregated compute frameworks, such as Nvidia’s Dynamo, allow for more efficient distribution of workloads across GPUs. This method enhances performance by running different phases of the workload on specialized GPUs, adapting to the needs of various applications.
The shift towards rack-scale architectures, exemplified by Nvidia’s GB200 and GB300 systems, aims to improve efficiency further. These systems connect multiple GPUs with high-speed fabrics, reducing latency and increasing throughput. While Nvidia currently leads in this space, AMD is expected to introduce competitive rack-scale systems later in 2026.
As the AI landscape evolves, the interplay of hardware capabilities and software optimizations will continue to shape the economics of AI inference, making it a dynamic area for CSPs and technology providers alike.
This article was produced by NeonPulse.today using human and AI-assisted editorial processes, based on publicly available information. Content may be edited for clarity and style.








