Context Window

What vendors mean: our model has a 1 million token context, so it can read everything. What it actually means: the model technically accepts that much, but its attention thins out fast — and the long-context demos rarely survive contact with your real documents.

Models & Architecture

The Technical Definition

A context window is the maximum amount of text — measured in tokens — that an LLM can read at once when generating a response. Everything inside the window is what the model is “thinking about” for that query: the system prompt, the user’s question, retrieved documents, conversation history, and any other input. Anything outside the window does not exist to the model.

Frontier models in 2026 advertise context windows from 128,000 tokens (about 300 pages) up to 1-2 million tokens (a small library). Bigger windows let the model process more material in a single query. They also cost more, run slower, and degrade in quality more than the marketing suggests.

What This Actually Means for Your Business

The context window is the model’s working memory. It is the single most marketed spec on a frontier model launch, and it is one of the most misunderstood. Vendors love to pitch “1 million token context” because it sounds like the model can now read your entire knowledge base, your entire contract repository, your entire financial filing history. Technically true. Operationally, much narrower.

The first thing CEOs need to understand: long context does not mean uniform attention. A model with a 1M token window can accept that much input, but its ability to find and use information degrades the longer the input gets. A fact buried in the middle of a 500,000-token document is much more likely to be missed than the same fact placed near the start or the end. Researchers call this “lost in the middle.” Vendors do not put it on the spec sheet. Your team will discover it the first time the model confidently misses a clause buried on page 200 of a contract.

The second thing: cost scales with what’s in the window, not just what you ask. Every token the model reads on every query is billed. If your application stuffs 100,000 tokens of context into every query “just in case,” you are paying for that on every single interaction, whether or not the answer required all of it. This is the most common pricing surprise in enterprise LLM deployments.

The third thing: bigger context windows are often a substitute for thinking, not a replacement for engineering. The pitch “just put all your documents in context” usually loses to a properly built RAG system that retrieves only the relevant slice. A 1M token context is impressive at a conference. A 4,000 token retrieval that pulls the exact right paragraph is faster, cheaper, more accurate, and more debuggable. Many enterprise problems being sold as “we need a bigger context window” are actually retrieval problems wearing a costume.

There are real reasons to want a large context window. Single-document tasks where the document is genuinely large and you need the model to reason across it — long contracts, multi-quarter financial filings, full code repositories. Long-running agent workflows where the conversation history matters. Cases where retrieval is genuinely hard because the relevant information is distributed across the document in ways no chunking strategy captures cleanly. In those cases, a larger context window is a real upgrade. In most cases, it isn’t.

Reality Check

What the vendor says: “Our model supports a 1 million token context window — so you can analyze entire contracts, codebases, or document sets in a single query.”

What that means in practice: It accepts a million tokens. It does not pay equal attention to all of them. On long-context tasks, frontier models routinely miss facts placed deep in the input, especially in the middle. Get the vendor to run their model on your actual longest document and ask it five questions whose answers are scattered through the file. Score the answers. That number — not the spec sheet — is your real context window.

What Operators Actually Do

Smart teams treat context window as a budget, not a maximum. They aim to put the smallest amount of relevant context into every query, not the largest. They benchmark long-context performance on their own documents using a “needle in a haystack” test: hide a specific fact in a long document, ask the model to find it, repeat across many positions, and chart the accuracy curve. Vendors won’t show you that curve. You can build it in a week.

The other pattern that’s working: hybrid retrieval and long context. Use RAG to get the right 5,000 tokens. Use long context only when the query genuinely requires reasoning across the full document. The decision of which pattern to use is made per-query type by your engineering team, not blanket-applied by the vendor’s default settings.

Operators in regulated industries also treat context windows as an audit problem. If a model gave a wrong answer about a contract, was the relevant clause in the context window? If it was, where in the window? Logging the full context for every consequential query is now standard practice in financial services and healthcare deployments. Without it, you cannot answer “why did the model say that” — and that question is going to come up.

The Questions to Ask

What does this model’s accuracy look like on our documents at full context length? The spec sheet says 1M tokens. Show me a needle-in-a-haystack test on our contracts, our policies, our filings. What’s the accuracy curve at 100K, 500K, 1M?
What’s actually in the context window on every production query? The system prompt, the retrieved chunks, the conversation history, the user input. Get the breakdown, because that’s what we’re paying for on every call.
Where would a smaller context plus better retrieval beat this? If retrieval can solve 80% of our queries with 5,000 tokens, the long-context model is overkill for those queries. What’s the routing strategy?

The Technical Definition

What This Actually Means for Your Business

Reality Check

What Operators Actually Do

The Questions to Ask

One operator. Every other Wednesday.