Model Distillation

Training a small model to copy a big one. The technique behind every cheap, fast tier you're using — and the reason closed-source pricing keeps falling.

Models & Architecture

The Technical Definition

Model distillation is a training technique where a smaller “student” model is trained to mimic the outputs of a larger “teacher” model. Instead of training the small model from scratch on raw text — which is expensive and produces mediocre results at small scale — you generate vast amounts of high-quality output from the large model and use that as training data for the small one.

The student doesn’t learn the world directly. It learns to imitate a teacher that already learned the world. The result, when it works, is a small model that punches well above its weight class on the kinds of tasks the teacher was good at, at a fraction of the inference cost.

GPT-3.5-turbo is widely understood to be a distilled descendant of larger OpenAI models. Claude Haiku and Claude Sonnet are distilled from larger Anthropic models. Most of the small open-source models that benchmark surprisingly well — Phi, Gemma, the smaller Llamas — got there via heavy use of distillation from frontier teachers.

What This Actually Means for Your Business

Distillation is the reason your AI costs keep falling without you doing anything.

When OpenAI launched GPT-4, the per-token price was high enough that any high-volume use case had to be triaged carefully. Two years later, GPT-4-class quality is available at a tenth of the price, sometimes a hundredth, in models specifically engineered to be small and fast. Anthropic, Google, and the open-source ecosystem followed the same arc. The technique that made it possible is distillation.

What this means commercially: the closed-source pricing premium is being eaten from below. A frontier lab charges a high price for its top model. A few months later, that same lab releases a cheaper distilled version. Months after that, an open-source team releases a distilled model that runs on your own hardware for the cost of electricity. The premium tier keeps moving, but the floor keeps rising.

For a CEO making multi-year commitments, this changes the math on long-term vendor lock-in. Whatever you are paying per token today, you will be paying meaningfully less in twelve months — either from the same vendor, a competitor, or a self-hosted distilled model that didn’t exist when you signed the contract. The ROI math on AI projects should assume the unit cost of intelligence is declining, not flat.

Reality Check

What the vendor says: “We’ve developed a proprietary small model optimized for your use case.”

What that means in practice: They almost certainly distilled it from a larger commercial model, possibly without permission. The capability is real. The “proprietary” framing usually isn’t. Ask what the teacher model was, and whether they have rights to use it that way.

What Operators Actually Do

Teams that have been running AI in production for more than a year stopped picking a model and standing still. They pick a model for now, and they re-evaluate every six months. The distilled version of last year’s frontier model is often this year’s right answer for production workloads — same quality on the tasks you care about, a fraction of the cost.

They also separate the workload by criticality. The hardest queries — novel reasoning, edge cases, anything customer-facing where mistakes hurt — go to the frontier model. The high-volume, well-understood queries — classification, summarization, routine Q&A — go to a distilled model. You pay frontier prices only where frontier capability earns its keep.

For internal tooling especially, distilled open-source models hosted on your own infrastructure are now a serious option. A Llama-class distilled model running on a single GPU node can handle most internal workflows at a per-query cost approaching zero. The capability gap to frontier closed-source has narrowed enough that the data control, cost, and latency benefits often win.

The Questions to Ask

What was the teacher model, and what is the licensing position? A distilled model is only as defensible as the rights to its teacher. If your vendor distilled from a model whose terms of service prohibit it, you may be inheriting a legal exposure.
How was the distillation evaluated against the tasks we actually run? Distilled models keep most of the teacher’s capability, not all of it. Make the vendor show you performance on your workload, not on public benchmarks.
What’s our re-evaluation cadence? The economic case for any given model has a shelf life of about six to twelve months right now. If you don’t have a re-evaluation discipline, you will be paying last year’s prices for last year’s capability long after the market has moved.

The Technical Definition

What This Actually Means for Your Business

Reality Check

What Operators Actually Do

The Questions to Ask

One operator. Every other Wednesday.