Glossary / Models & Architecture

Tokens & Tokenization

What vendors mean: a technical detail you don't need to worry about. What it actually means: the unit you're being billed in, and the reason your AI bill jumped 4x last month.

Models & Architecture

The Technical Definition

A token is a chunk of text — usually a short word, part of a word, or a punctuation mark. Tokenization is the process of cutting text into those chunks before an LLM reads it. As a rough rule, 1,000 tokens equals about 750 English words, or roughly three paragraphs. The word “tokenization” itself is two tokens. The word “the” is one. The word “antidisestablishmentarianism” is several.

Every interaction with an LLM is metered in tokens — input tokens (what you send the model) and output tokens (what it sends back). Pricing, speed, and the model’s working memory are all measured in tokens. If you only learn one technical concept about LLMs, learn this one.

What This Actually Means for Your Business

Tokens are the unit of cost. Every vendor pricing page, every API bill, every “AI infrastructure” line item ultimately resolves to a token count multiplied by a per-token rate. A frontier model might cost a few cents per thousand input tokens and a bit more per thousand output tokens. That sounds cheap. It is cheap, until your application starts processing customer documents at scale, and then it isn’t.

Here is the math that surprises CEOs. A single customer service conversation might run 5,000 tokens. A long document analysis might run 50,000. A RAG system that pulls 20 documents into context per query is sending 30,000+ tokens on every request. Multiply by ten thousand queries a day, and the difference between a five-cent model and a fifty-cent model is the difference between a $5K monthly bill and a $50K monthly bill. The pricing pages don’t show you that. Your CFO will.

The other thing tokens determine is what the model can actually see. LLMs have a context window measured in tokens — the maximum amount of text they can read at once. If your contract is 80,000 tokens long and the model’s context window is 32,000 tokens, the model is not reading your contract. It’s reading a slice of your contract, and the vendor’s wrapper is making decisions about which slice. Those decisions are usually invisible to you.

There’s also a hidden tokenization tax that operators discover late. Different models tokenize the same text differently. Code, numbers, foreign languages, JSON, and domain-specific terminology often produce more tokens than plain English. A query about financial data might cost 20% more tokens than a query of the same length about consumer products, because numerals and ticker symbols tokenize inefficiently in models trained primarily on prose. If your business is heavy on structured data, your token costs run higher than the marketing examples suggest.

The pricing implication is straightforward but rarely modeled correctly: token costs scale linearly with input length and roughly linearly with output length. They do not scale with value. A 1,000-token query about a $10 product costs the same as a 1,000-token query about a $10M deal. Designing your AI features means deciding where the tokens are worth spending and where they aren’t.

Reality Check

What the vendor says: “Pricing is simple — pennies per query.”

What that means in practice: Pricing is simple at demo scale. At production scale, with full system prompts, retrieved context, conversation history, and chain-of-thought reasoning, a single “query” can run tens of thousands of tokens. Get the vendor to walk you through the token cost of a real customer interaction, end to end. If they can’t, they don’t know what their own product costs.

What Operators Actually Do

The companies running LLMs profitably treat token efficiency as a first-class engineering discipline. They measure tokens per query the way SaaS companies measure infrastructure cost per user. They prune system prompts. They cap conversation history. They use cheaper models for routing and reserve expensive models for high-value steps. They cache common responses so frequent queries don’t pay full token cost twice.

Smart teams also build a token budget into the product itself. A customer-facing assistant has a hard limit on how many tokens it can spend per conversation, and the limit is set by what the conversation is actually worth to the business. An internal research tool used by twenty analysts gets a different budget than a consumer chatbot serving a million users. The budget is a product decision, not an infrastructure decision.

The other thing operators do: they monitor tokens like a P&L line. Daily dashboards by feature, by customer, by cohort. When token consumption spikes, somebody’s usage pattern changed, or somebody’s prompt got longer, or somebody added a feature without modeling the cost. Catching it in the dashboard the next day is cheap. Catching it in the invoice next month is not.

The Questions to Ask

  1. What’s the average token cost of a real production interaction with this product? Not the demo. The real one — system prompt, retrieved context, user message, conversation history, model output, all of it. If they can’t quantify it, you’ll find out from the bill.

  2. How does token usage scale as we grow? If we 10x usage, do costs go up linearly, sublinearly because of caching, or worse than linearly because of longer conversations? Get a real model.

  3. What controls do we have on token consumption? Per-user caps, per-conversation caps, model routing, caching — what knobs exist to keep costs predictable when usage patterns change?

Get the next Brief

One operator. Every other Wednesday.

Plus the AI Glossary and the Failure Museum.
Real names. Real numbers. Honest analysis.