Inference Cost

The math that decides whether your $50K pilot becomes a $5M production system or a budget meeting you don't want to have.

Deployment & Ops

The Technical Definition

Inference cost is what you pay to run a trained model on real requests in production. With commercial APIs (OpenAI, Anthropic, Google), you pay per token — a unit of roughly three-quarters of an English word — split into input tokens (what you send the model) and output tokens (what the model generates back). Output tokens cost three to five times more than input tokens. With self-hosted models, you pay for GPU time, which means a fixed hourly cost whether the GPUs are busy or idle.

Two operational levers move the bill: prompt caching, which discounts repeated prompt prefixes by 75–90%, and output length, which compounds because every generated token is billed at the higher rate.

What This Actually Means for Your Business

The pilot looks cheap. A demo with 100 employees, hitting the model a few times a day, runs you a few hundred dollars a month. You sign off, the team moves to production, and four months later finance forwards an $80,000 invoice and asks who approved it.

Here’s what changed. In production, request volume is 100 to 1,000 times the pilot. Prompts get longer because real workflows need real context — your retrieval system stuffs 20 documents into every request “just in case.” Output length drifts up because nobody constrained it, and the model defaults to verbose. The same use case that cost $400/month at pilot scale costs $40,000/month at production scale. Same model, same architecture, different math.

The teams that survive this transition do the cost math before they build, not after. They estimate token volume per request, multiply by request volume per day, multiply by 30 days, and look at the number. If the number is uncomfortable, they redesign the prompt before they ship — not after they get the invoice.

The other thing nobody tells you: model choice matters less than prompt design. A well-engineered prompt on a cheap model often beats a sloppy prompt on an expensive one. Most production cost overruns come from prompts that retrieve too much context, generate too much output, or call the model when a cached result would do.

Reality Check

What the vendor says: “Our platform optimizes inference costs automatically.”

What that means in practice: They route some traffic to cheaper models and cache some prompts. They do not control how long your prompts are, how much context your retrieval pulls, or how verbose your outputs run. The 80% of your bill that’s driven by prompt design is still your problem.

What Operators Actually Do

The companies running AI at production scale treat inference cost as a unit economics problem, not a tech bill. They calculate cost per request, cost per user, cost per outcome — the same way they’d model a SaaS feature or a fulfillment channel. If the use case is “summarize a document for a customer,” they know what each summary costs and whether the value created justifies the spend.

They also enforce three operational disciplines. First, they set max output tokens on every API call — the model will fill whatever space you give it. Second, they turn on prompt caching for any system prompt or document context that repeats across requests, which is most of them. Third, they tier their model usage: cheap model for the easy 80% of requests, expensive model only when the cheap one fails a quality check.

The teams that ignore all three end up writing checks they didn’t budget for. The teams that run all three end up with bills that scale linearly with revenue, which is the whole point.

The Questions to Ask

What does a single representative request cost us, end to end? Input tokens, output tokens, retrieval overhead, retry rate. If your team can’t produce that number on demand, they don’t have control of the system.
What’s our cap on output tokens, and who set it? A model with no output cap will eventually generate a 4,000-token response when 200 would have done. That’s a 20x cost multiplier on a single bad request type.
What happens to our bill if usage grows 10x next quarter? Linear scaling is the answer you want. If the answer involves “we’d need to renegotiate” or “we’re not sure,” the system isn’t ready for production load.

The Technical Definition

What This Actually Means for Your Business

Reality Check

What Operators Actually Do

The Questions to Ask

One operator. Every other Wednesday.