Energy use of AI inference - estimates and efficiency opportunities

New estimates put frontier-model inference at 0.31 Wh per query, rising to 3.91 Wh for long reasoning and 0.73 GWh/day at 1B queries, but agentic workloads complicate the picture.

May 18, 2026

Over the last year, OpenAI, Mistral, and Google have each published numbers for AI energy use or emissions. The problem is that they are hard to compare.

Sam Altman cited 0.34 Wh per ChatGPT query. Google reported 0.24 Wh per typical Gemini Apps text prompt. Mistral published lifecycle emissions for a short chat session. These are useful disclosures, but they use different boundaries, workloads, and units.

Microsoft’s new paper, Oviedo, et al (2026), takes a different approach. Rather than adding another headline number, it proposes a bottom-up framework for estimating inference energy under production-like conditions: optimized serving, batching, concurrency, node power, token throughput, and data-center overhead.

The recurring problem is boundaries: what is included, what is excluded, what workload is being measured, and whether the result is an average, median, or representative case.

The result is a more nuanced answer. Standard frontier-model inference uses less energy than many public estimates suggest. The article’s key critique is that many estimates ignore optimizations already in place, such as batching, concurrency, steady-state serving, and optimized inference engines, which can make small-scale measurements misleading.

Long reasoning and agentic workloads, however, can change the picture quickly. Let’s dig in.

AI model energy consumption: simple query vs long inference

The median energy consumption for AI models over 200B parameters - such as GPT-4o or Claude 3.5 Sonnet - is 0.31 Wh per query.

The figure comes from the paper’s baseline for models above 200B parameters: DeepSeek-R1 671B, Llama 3.1 405B, and Llama 3.1 Nemotron Ultra 253B. The broader model comparison also includes Mixtral 8x22B and Llama 3.1 70B, which show much lower per-query energy because they are smaller or more efficient models.

The important caveat is output length. The paper’s standard query assumes a median of 300 output tokens. Its test-time scaling scenario assumes a median of 5,000 output tokens. Under that longer reasoning workload, median energy rises from 0.31 Wh to 3.91 Wh per query, roughly a 13x increase.

Energy per query for several LLMs across 10,000 queries. A = Standard query, with 300 median output tokens. B = Test-time scaling query, with 5,000 median output tokens. Source: Oviedo, et al (2026).

This is the same shift I described in my 2026 data center energy update where chat made inference visible, reasoning increased tokens per task, and agents turn one visible request into a chain of model calls, tool calls, retrieval, and verification. A user may experience one “request,” but the system may perform multiple model calls, tool calls, retrieval steps, code searches, and long reasoning traces behind the scenes. Per-query numbers can hide that expansion.

There are also important boundaries. The paper focuses on autoregressive text generation, not image or video generation. It also models single-node inference under high-utilization production conditions, and the authors note that very long-context workloads, agentic coding, multi-document summarization, and retrieval-heavy workflows can have extra overhead from prefill, orchestration, and tool use. The paper is useful for correcting bad per-query estimates, but it is not a universal model for every AI workload.

Efficiency opportunities

The paper groups near-term efficiency opportunities into three layers: model improvements, serving and workload management, and hardware/data center improvements. The largest improvements are likely to come from doing less work, not just better GPUs: using smaller models, generating fewer unnecessary tokens, routing queries more intelligently, and avoiding expensive reasoning when it is not needed.

Algorithm and architecture improvements

The biggest lever is the model itself. The paper estimates that algorithm and architecture improvements could reduce energy use by 1.5–10x. This includes distillation, where larger models are compressed into smaller specialist models; quantization, which allows more efficient inference and higher throughput; mixture-of-experts architectures, where only part of the model is active for each token; and more efficient reasoning architectures such as Llama-Nemotron or MiniMax-M1.

“Frontier model” does not have to mean “largest possible dense model for every request.” Smaller distilled models, specialist models, and efficient reasoning models can provide much of the capability at lower energy cost, especially for narrow or repeated tasks.

The serving system

The second layer is the serving system. This is where the paper’s production focus matters most.

Inference is not just a model running on a GPU - it is a serving stack that manages batching, concurrency, prefill, decoding, caching, routing, and latency targets. The paper estimates 1.5-5x efficiency improvements from serving and workload management.

Disaggregated serving separates prefill from decoding so each stage can be optimized independently. KV-cache management, using techniques such as LServe, KVQuant, and LMCache, improves long-context and long-generation workloads. Speculative decoding can increase throughput by drafting tokens with a smaller model before verifying them with a larger one. Routing can send simple queries to smaller models and reserve expensive reasoning models for tasks that need them.

Google’s Gemini disclosure made the same point from measurement: only 58% of energy went to TPUs/GPUs, with the rest in CPU/memory, redundancy, and data-center overhead.

This is important as we move into the era of agentic systems. A user sees one request, but the system may execute multiple model calls, retrieval steps, tool calls, code searches, and reasoning traces. The energy cost is therefore not just the visible answer - it is the whole workflow. Routing, caching, and reasoning controls become energy features, not just cost or latency optimizations.

Hardware and data centers

The third layer is hardware and data centers. The paper estimates 1.5-2.5x improvements from this category, excluding the more speculative gains from custom inference hardware. Blackwell-class GPUs improve tokens per second compared with H100s, and custom ASICs or FPGAs may offer large gains for specific inference workloads.

Power management, higher utilization, and better cooling can also help, but this is the least surprising category: hyperscale AI is already deployed in highly optimized data centers, so the more interesting gains are likely to come from model and serving choices rather than from PUE improvements alone.

These opportunities can stack. The paper gives a conservative example of roughly 1.5x from hardware/data-center improvements, 2x from serving/workflow optimization, and 2.5-3x from model improvements, for a combined 8-9x reduction. A more optimistic stack - 2x hardware, 2.5x serving, and 4x model improvement - gives about 20x.

The caveat is that these are not automatic gains. They depend on maintaining model quality, deploying the optimizations in production, and avoiding rebound effects where cheaper tokens simply lead to many more tokens.

Real production deployments

At 1 billion queries per day, the paper estimates 0.73 GWh/day for standard inference. If 10% of queries are long test-time scaling queries, energy demand rises to about 1.7 GWh/day - more than double the standard-query baseline. With conservative efficiency improvements, the paper’s model brings that mixed workload back down to about 0.8 GWh/day.

Lower energy per query helps, but it does not remove the infrastructure problem. The binding constraint is often whether power can be delivered to the right place at the right time.

As more agentic systems come online and “thinking” mode becomes the default, that percentage is likely to change significantly. These systems also make more use of CPU-driven compute for tool calls, retrieval, code execution, coordination, and orchestration, so the energy question cannot focus only on GPUs.

This is why “energy per query” is becoming a weak unit of analysis. It was useful when the product was a chatbot returning a short answer. It is less useful when the product is an agent that plans, searches, writes, checks, retries, and calls tools. A short chatbot answer, a long reasoning trace, and an agentic coding session are all “AI queries” in the product sense, but they have very different energy profiles.

The useful metric is shifting from energy per prompt to energy per completed task. That means accounting for model routing, tool calls, context length, retries, cache behavior, orchestration overhead, and the amount of reasoning generated. This is where the next round of energy estimates will need to improve.

/dev/sustainability

Discussion about this post

Ready for more?