Chain-of-Thought
The technique that makes reasoning models work — and the buzzword every vendor has now claimed, whether they actually do it or not.
The Technical Definition
Chain-of-thought (CoT) is a prompting and training technique that gets a language model to produce step-by-step reasoning before it produces a final answer. The phrase was coined in a 2022 Google paper that showed adding “Let’s think step by step” to a prompt could double or triple accuracy on math word problems. Since then, CoT has graduated from a prompting trick to a training method. Modern reasoning models are trained with reinforcement learning specifically to produce long internal chains of thought before answering.
CoT comes in two flavors. Visible CoT shows the user every step the model takes — useful for transparency, debugging, and trust. Hidden CoT keeps the reasoning internal and shows only the final answer — what OpenAI does with o1 and o3, partly to protect their training methods, partly because the reasoning is verbose and not always useful to display.
What This Actually Means for Your Business
CoT is the reason reasoning models exist. It’s also the reason your team’s prompts are getting longer. Almost every prompt-engineering best practice now includes some version of “ask the model to reason first, then answer.” Done right, it cuts hallucination rates and improves accuracy on anything that requires more than a one-step lookup. Done wrong, it doubles your token costs without improving anything.
Vendors love to claim “advanced chain-of-thought reasoning” as a feature. In practice, this means one of three things, and the difference matters. First: they’re using a reasoning model under the hood (real CoT, expensive, useful for hard problems). Second: they’re prompting a standard model to reason step-by-step before answering (cheap CoT, decent quality bump on medium-hard problems, easy to replicate). Third: they’re putting the words “step-by-step thinking” in their marketing copy and doing nothing of the sort.
The third category is more common than you’d think. The technique is well-understood and freely available — there’s no moat in claiming you do it. The moat is in how you tune it for your specific workflow, what data you train on, and how you measure whether it actually helps.
CoT also has a dark side worth understanding. A model producing visible reasoning can sound dramatically more confident than its underlying accuracy warrants. Five paragraphs of careful-looking step-by-step logic ending in a wrong answer is harder for a human reviewer to catch than a one-line wrong answer. The reasoning becomes a credibility halo.
Reality Check
What the vendor says: “Our platform uses chain-of-thought reasoning to deliver explainable, transparent AI decisions.”
What that means in practice: The model produces a paragraph of reasoning before each answer. That paragraph might be the actual reasoning the model used, or it might be a plausible-sounding rationalization generated after the fact. There’s no way to tell from the output alone, and “explainability” gets used loosely. Ask whether the visible reasoning is provably the same path that produced the answer.
What Operators Actually Do
Teams that get real lift from CoT treat it as a tool, not a feature. They use it for problems that benefit — multi-step analysis, code review, financial calculations, anything where one missed step changes the answer. They skip it for simple lookups, classification, and short-form content where the extra tokens just slow things down and inflate the bill.
The other pattern: CoT as an audit layer, not a customer-facing display. The model reasons internally, produces an answer, and an operator can pull the chain of thought when something goes wrong. That gives you the debugging benefit without the customer-facing risk of long, slow, sometimes-wrong-looking reasoning paragraphs.
Smart teams also test CoT empirically. They run their prompts with and without it on 50 representative tasks and measure whether accuracy actually improved enough to justify the cost. Often it does. Often it doesn’t. The only way to know is to measure on your work, not on someone else’s benchmark.
The Questions to Ask
-
Are you using a reasoning model, or just prompting a standard model to think step by step? Both are valid. They cost very different amounts and perform differently on hard problems. You should know which you’re paying for.
-
Is the visible reasoning the actual reasoning, or a post-hoc rationalization? If the vendor claims explainability, ask how they verify the displayed chain of thought matches the path the model actually took.
-
On our specific tasks, does CoT measurably improve accuracy enough to justify the extra cost? If the vendor can’t show you that comparison on your work, they’re selling a feature, not an outcome.