Glossary / Deployment & Ops

AI Throughput

The capacity math behind 'can our system actually handle a million users.' Throughput is what breaks when latency looks fine in a demo.

Deployment & Ops

The Technical Definition

AI throughput is how much work your system can complete per unit of time. The two units that matter: tokens per second per GPU (for self-hosted models) and requests per second per cluster (for everything). A single high-end GPU running a 70-billion-parameter model produces somewhere between 30 and 200 tokens per second for a single user, but can serve 5 to 20 concurrent users at lower per-user speeds through batching.

Throughput trades against latency. Batching multiple requests onto the same GPU pass increases throughput but increases per-request latency. Running each request alone minimizes latency but wastes 90% of GPU capacity. Every production AI system is making this tradeoff, whether the team building it knows it or not.

What This Actually Means for Your Business

The pilot worked. Twenty users, response time under two seconds, everyone happy. You roll it out to 5,000 employees on Monday. By 9:15 AM the system is timing out, response times have climbed to 30 seconds, and your SaaS provider is sending you a rate-limit notice.

That’s a throughput problem, not a model problem. The model didn’t get worse. The infrastructure ran out of capacity to serve concurrent requests, and every additional user joined a queue that’s growing faster than it’s draining.

Throughput is the dimension CEOs underestimate because it doesn’t show up in demos. A demo serves one person at a time, with infinite GPU capacity behind it, and looks great. Production serves 5,000 people at once, with finite capacity, and the math gets ugly fast. If your model produces 100 tokens per second per GPU and your average response is 500 tokens, that GPU can serve roughly 12 users per minute. If 5,000 users want a response in the same hour, you need 7 GPUs minimum, and that’s before any safety margin for spikes.

The hidden multiplier is request shape. Long prompts and long outputs don’t just cost more — they consume more throughput. A workflow that pulls 4,000 tokens of retrieval context into every request is using 5x the GPU time of one that pulls 800 tokens. The same hardware serves one-fifth as many users. Most “we need more GPUs” conversations are actually “our prompts are too long” conversations in disguise.

For commercial APIs, throughput shows up as rate limits. OpenAI, Anthropic, and Google all throttle requests per minute and tokens per minute by tier. Your pilot fits inside the default tier. Your production deployment doesn’t. The negotiation for higher rate limits — and the deposits required to get them — is a procurement conversation that needs to happen before launch, not after the first outage.

Reality Check

What the vendor says: “Our platform scales to enterprise volumes.”

What that means in practice: It scales if you pay for the capacity. The default tier handles a few thousand requests per day. The enterprise tier handles a few million, costs 20x more, and requires a contract. “Scales” is not the same as “scales for free.”

What Operators Actually Do

The teams running AI at real scale do load testing before launch the same way they would for any other production system. They simulate peak concurrent load — not average, peak — and measure tokens per second, requests per second, and the queue depth that builds up under stress. If the system breaks at 60% of projected peak load, it isn’t ready.

They also tune the throughput vs. latency tradeoff explicitly. For batch workloads (overnight document processing, async classification), they max out batching and accept higher per-request latency to get more throughput per dollar. For real-time workloads (chatbots, search), they limit batching to keep latency under target, even though it costs more per request. Same model, different settings, very different unit economics.

The third move: shrink the request. Every token cut from the prompt is throughput recovered. Teams with disciplined prompt engineering get 2-3x more throughput out of the same hardware than teams that let prompts sprawl. That’s not a vendor feature. That’s an internal engineering practice.

The Questions to Ask

  1. What’s our system’s tokens per second and requests per second under peak concurrent load? If the team has only measured single-user latency, they haven’t measured throughput. Those are different numbers.

  2. What’s our rate limit on the underlying API, and have we tested against it? Most production outages on commercial APIs are rate-limit hits, not model failures. The rate limit is the real ceiling.

  3. What’s our plan when throughput maxes out — degrade, queue, or fail? Every system hits its capacity ceiling eventually. The teams that planned for it serve a “high demand, please retry in 30 seconds” message. The teams that didn’t serve a 504 error and lose the user.

Get the next Brief

One operator. Every other Wednesday.

Plus the AI Glossary and the Failure Museum.
Real names. Real numbers. Honest analysis.