Glossary / Deployment & Ops

Inference

Running the model in production. The cost center most CEOs don't see until the bill arrives — and the place where good pilots quietly die at scale.

Deployment & Ops

The Technical Definition

Inference is what happens every time the model is actually used. Training is the one-time process of building the model. Inference is the recurring cost of running it — every prompt your employees send, every customer chat message, every document the system processes. You pay for inference in tokens (the units of text) multiplied by volume (how often the system runs).

If training is buying the factory, inference is the electricity bill. The factory is paid for once. The electricity comes every month, forever, and scales with usage.

What This Actually Means for Your Business

Most AI pilots look cheap because the volume is small. A team of 20 people running 50 prompts a day each at fractions of a cent per prompt produces a $400 monthly bill. Easy to approve. Easy to ignore.

Then you scale. Five thousand employees, customer-facing chat, document processing pipelines, agentic workflows that send three or four follow-up requests for every user request. The same architecture that cost $400 a month at pilot now costs $40,000 — and the finance team is asking questions you don’t have answers to.

The math that kills pilots at scale: if your system uses 4,000 tokens of context per request (your data, the user question, the prompt instructions) and generates 1,000 tokens of output, that’s 5,000 tokens per call. At $5 per million input tokens and $15 per million output tokens, you’re at roughly $0.035 per call. Sounds tiny. Multiply by 100,000 customer interactions a day, and you’re at $3,500 daily. A million dollars a year on inference alone — for one use case.

This is before you factor in retries, agentic loops that call the model multiple times per task, or the bigger context windows enterprise vendors love to sell you on.

Latency is the other quiet killer. Inference takes time. A simple prompt might return in 800 milliseconds. A complex agentic workflow might take 30 seconds. Cold starts on self-hosted models can take minutes if no one warmed up the GPU. Customers will not wait. Employees will route around the system. The use case dies — not because the AI was bad, but because the inference layer wasn’t engineered for the actual workload.

Reality Check

What the vendor says: “Inference is included in your subscription.”

What that means in practice: It’s included up to a usage cap they wrote into the contract in fine print. Past that cap, you’re on overage pricing — which is often 3–5x the rate you’d pay going direct to the model provider. Read the fair-use clause before you sign.

What Operators Actually Do

The teams that survive scale-up build a unit economic model before they commit to architecture. They calculate the inference cost per user, per transaction, per workflow — and they stress-test what happens at 10x volume. If the math doesn’t work at scale, they redesign before they deploy.

They also tier their model usage. The most capable (and expensive) model only handles requests where reliability matters. Cheaper models handle bulk work — summarizing internal documents, drafting first passes, classifying tickets. A well-designed inference stack might use four different models depending on the task, with a routing layer in front. This kind of engineering routinely cuts inference cost by 60-80% versus sending everything to the flagship model.

The other operational discipline: caching. If your system answers the same question 500 times a day, cache the answer. If your prompts are mostly identical with small variations, use prompt caching (most providers offer it now). Done well, caching can pay for itself in the first month.

The Questions to Ask

  1. What’s the inference cost per transaction at our projected scale? Not at pilot. At full deployment. If the vendor can’t model this, they haven’t thought about it — and you’re going to find out the hard way.

  2. What’s the p95 latency under realistic load? Average latency is a marketing number. The 95th percentile is what your customers will actually experience when the system is busy. Get the number with concurrent traffic, not in a demo.

  3. What’s the contingency if the model provider has an outage? OpenAI, Anthropic, and the rest go down occasionally. Is there a fallback model? A graceful degradation path? Or does your business stop when the API does?

Get the next Brief

One operator. Every other Wednesday.

Plus the AI Glossary and the Failure Museum.
Real names. Real numbers. Honest analysis.