Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute
- Felipe Oviedo ,
- Fiodar Kazhamiaka ,
- Esha Choukse ,
- Allen Kim ,
- Amy Luers ,
- Melanie Nakagawa ,
- Ricardo Bianchini ,
- Juan M. Lavista Ferres
As AI inference scales to billions of queries and emerging reasoning and agentic workflows substantially increase token demand, reliable estimates of per-query energy use are increasingly important for capacity planning, emissions accounting, and efficiency prioritization. Yet many public estimates tend to be inconsistent and systematically overstate energy use, because they extrapolate from limited benchmarks and fail to reflect the efficiency gains achievable in at-scale deployments. In this perspective, we introduce a bottom-up methodology to estimate the per-query energy of large-scale LLM systems based on token throughput estimation. For models running on an H100 node under realistic workloads, GPU utilization and PUE constraints, we estimate a median energy per query of 0.34 Wh (IQR: 0.18–0.67) for frontier-scale models (>200 billion parameters). These results are consistent with measurements using production-scale configurations and show that non-production estimates and assumptions can overstate energy use by 4–20×. Extending to test-time scaling scenarios with 15× more tokens per typical query, the median energy rises 13-fold to 4.32 Wh, indicating that targeting efficiency in this regime will likely deliver the largest fleetwide savings. We explored how to target efficiency interventions to maximize impact. By quantifying achievable efficiency gains at the model, serving platform, and hardware levels, we find that individual levers yield median reductions of 1.5–3.5× in energy per query, while combined advances can plausibly deliver 8–20× reductions. To illustrate the system level impact, we estimate the baseline daily energy use of a deployment serving 1 billion queries to be 0.8 GWh/day. If 10% are long queries, demand could grow to 1.8 GWh/day. With targeted efficiency interventions, it falls to 0.9 GWh/day—similar to the energy footprint of web search at that scale. This echoes how data centers historically tempered energy growth through efficiency gains during the internet and cloud build-up.