Scaling Laws

The empirical observation that justified $100B+ in AI capex. Why the curves are bending — not breaking — and what that does to your vendor pricing forecasts.

Models & Architecture

The Technical Definition

Scaling laws are the empirical observation that large language model performance improves predictably as you increase three inputs: compute (FLOPs spent on training), data (tokens seen during training), and parameters (the size of the model). The first formal version came from Kaplan et al. at OpenAI in 2020. Two years later, the Chinchilla paper from DeepMind corrected the recipe — showing that earlier models were under-trained on data relative to their size, and that compute-optimal models need roughly 20 tokens of training data per parameter.

The practical implication: if you spend 10x more on training compute and allocate it correctly between model size and data, the resulting model gets measurably better at a wide range of tasks — not because anyone hand-coded the improvement, but because the curve says it will.

What This Actually Means for Your Business

Scaling laws are the reason your AI vendor’s pricing has been falling and the reason it might stop falling soon.

For five years, every major lab bet that throwing more compute at the problem would keep producing better models. That bet was right. GPT-4, Claude, Gemini — all of them are products of the scaling-laws thesis. The capex commitments you’ve read about — Microsoft’s $80B, Meta’s $65B, the trillion-dollar collective spend through 2030 — are not speculation. They’re the labs running the curve forward.

What’s changed in the last 18 months: the pretraining curve is bending. Adding more training data and parameters is producing smaller gains than it did between GPT-3 and GPT-4. That doesn’t mean scaling is dead. It means the labs have shifted compute from pretraining to two new axes — post-training (reinforcement learning, fine-tuning) and inference-time compute (letting the model “think longer” at runtime). Both axes still scale. The cost curve you care about hasn’t broken; it’s redistributed.

For you, this matters for three reasons. First, the deflation in token pricing you’ve seen since 2023 — roughly 90% drop on equivalent capability — is slowing. Don’t model another 10x price drop in the next 24 months. Second, the most capable models now spend significant compute at inference time, which means your per-query cost on hard tasks goes up, not down. Third, the labs are increasingly differentiated by post-training expertise rather than raw model size, which makes “we use the best model” a harder claim to verify.

Reality Check

What the vendor says: “AI is going to keep getting cheaper and better forever — just sign the multi-year contract.”

What that means in practice: Token costs for current-tier models will keep falling. Costs for the frontier — the models you actually want for hard reasoning — will not fall the same way, because the frontier now consumes inference compute, not just training compute. Build your contract assumptions around what frontier capability costs, not what last year’s model costs.

What Operators Actually Do

The companies that read scaling laws correctly don’t bet on a single model tier. They build for substitution. They run their workloads against three models — typically a frontier model, a mid-tier model, and a cheap commodity model — and route each task to the cheapest model that clears the quality bar. As the curve moves, the routing rules get updated. The capability that needed GPT-4 in 2024 runs on a mid-tier model in 2026 for a tenth of the cost. The capability that needed three minutes of inference compute in 2026 will run instantly in 2028.

Smart finance teams also separate two budget lines: a capability budget (what the frontier costs for the tasks that need it) and a volume budget (what commodity inference costs across the rest of the business). The capability line stays roughly flat. The volume line keeps falling. Treating them as one line gets your forecasts wrong in both directions.

The pattern that fails: signing a three-year exclusive with one vendor based on this year’s pricing. The curve will move. Your contract should let you move with it.

The Questions to Ask

Which axis is your vendor scaling on? Pretraining, post-training, or inference-time compute? The cost behavior is different on each. If they can’t articulate this, they don’t know what they’re charging you for.
What does your task actually need? Frontier reasoning, or commodity classification? The gap in price between the two is now 50x. Most enterprise workloads don’t need frontier capability — but the ones that do can’t be served by anything else.
How portable is your stack? If a competing model gets 30% better or 50% cheaper next quarter, can you switch in a sprint, or are you locked in for two more years?

The Technical Definition

What This Actually Means for Your Business

Reality Check

What Operators Actually Do

The Questions to Ask

One operator. Every other Wednesday.