Glossary / Deployment & Ops

AI Latency

The reason your chatbot gets abandoned at 3 seconds. Time-to-first-token is the metric your CX team will care about long before your CFO does.

Deployment & Ops

The Technical Definition

AI latency is the time between a user submitting a request and the model finishing its response. It splits into two numbers that matter separately. Time-to-first-token (TTFT) is how long the user waits before any output appears — typically 200ms to 2,000ms for commercial APIs. Total generation time is TTFT plus the time it takes to produce every additional token, which scales linearly with output length at roughly 30 to 100 tokens per second depending on the model.

Latency is dominated by three things: model size (bigger models are slower), prompt length (longer inputs take longer to process), and infrastructure (geographic distance, network hops, GPU contention).

What This Actually Means for Your Business

Here’s what your CX team will tell you the first month after launch: users abandoned the chatbot. Here’s what they won’t tell you: users abandoned it at the three-second mark, every time, in patterns that match the specific queries where your retrieval system pulls the most context and your model takes the longest to respond.

Latency is a UX problem disguised as an infrastructure problem. The Nielsen Norman Group has been writing about this since 1993 — a 1-second response feels instant, a 3-second response feels slow, a 10-second response means the user is gone. AI systems regularly take 5 to 15 seconds to produce a complete answer. If you make the user stare at a spinner for that long, they leave.

The trick the good teams use is streaming. Instead of waiting for the full response to render, you start showing tokens the moment the model produces them. The user sees text appearing in 400ms instead of waiting 8 seconds for a complete answer. The total generation time hasn’t changed — but the perceived latency has collapsed. That single UX choice is the difference between a chatbot that feels fast and one that feels broken.

The other place latency hits is workflow design. If your AI agent makes five sequential model calls, you’re not waiting on one response — you’re waiting on five. Latency stacks. A workflow that calls the model three times for “research, draft, and refine” looks elegant on a whiteboard and feels unbearable to a human waiting for it. Operators in production parallelize the calls that can be parallelized and cache the calls that can be cached, because two seconds times five is ten seconds, and ten seconds is too long.

Reality Check

What the vendor says: “Our model has industry-leading response times.”

What that means in practice: Their benchmark used a 200-token prompt and a 50-token output. Your production prompt is 4,000 tokens after retrieval, and your output runs 800 tokens. Your latency will be 6x their demo. Test on your own prompts, not theirs.

What Operators Actually Do

The teams shipping AI to real customers measure two latencies at the same time: TTFT and full-response time. They set targets for each — TTFT under 800ms, full response under 5 seconds for chatbot use cases — and they enforce those targets the same way a web team enforces page load times.

They also use streaming everywhere a human is waiting. Not just chatbots — internal tools, support assistants, anything where a person is staring at a screen. The cost is a small amount of frontend engineering. The benefit is the difference between a tool people use and a tool they avoid.

For workflows that can’t hide behind streaming — batch processing, agents acting on the user’s behalf, anything async — they redesign the prompt before they accept the latency. Shorter prompts, smaller models for sub-tasks, parallel calls instead of sequential ones. Latency is a budget, and you spend it on the parts of the workflow where the user actually cares.

The Questions to Ask

  1. What’s our TTFT and full-response latency on real production prompts? Not benchmarks, not demos — actual user requests with actual context lengths. If your team only has vendor benchmarks, you don’t know your latency.

  2. Are we streaming responses to every user-facing surface? If the answer is “we plan to” or “for some of them,” users are abandoning sessions you don’t even know about.

  3. What happens to latency at peak load? Most APIs slow down by 2-3x under heavy traffic. Is your peak hour latency budget the same as your off-hours number, or do users get a different product on Monday morning?

Get the next Brief

One operator. Every other Wednesday.

Plus the AI Glossary and the Failure Museum.
Real names. Real numbers. Honest analysis.