Glossary / Models & Architecture

Quantization

Compressing a model's weights from 32-bit numbers to 4-bit numbers. Four to eight times smaller, two to four times faster, and the reason a 70-billion-parameter model now fits on a laptop.

Models & Architecture

The Technical Definition

Quantization is the process of reducing the numerical precision of a model’s weights. A standard model stores each weight as a 32-bit floating point number. Quantization rewrites those weights using fewer bits — 16-bit, 8-bit, 4-bit, sometimes lower — at the cost of some precision.

The math is straightforward. A 70-billion-parameter model at 32-bit precision needs roughly 280 gigabytes of memory to load. At 8-bit, it needs 70 gigabytes. At 4-bit, it needs 35 gigabytes — small enough to run on a single high-end consumer GPU, or on a well-specced laptop. The same model, the same parameters, just stored more compactly.

The surprise, and it is genuinely surprising, is that the accuracy loss is small. A well-quantized 4-bit model typically retains 95 to 99 percent of the original’s benchmark performance. For most enterprise workloads, the difference is undetectable.

What This Actually Means for Your Business

Quantization is one of two techniques (the other being distillation) that broke the assumption that frontier-class AI requires data center infrastructure.

In 2023, running a 70-billion-parameter model required a multi-GPU cluster and serious cloud spend. In 2026, a quantized version of that same model runs on a single workstation. The model didn’t get smaller. The storage of the model got more efficient.

The commercial implications stack up in three places. First, cost: self-hosted quantized models can serve internal workloads at a per-query cost approaching electricity. The cloud-API line item that used to dominate AI budgets becomes optional for a growing share of use cases. Second, control: quantized models can run inside your own network, which moves the data residency, compliance, and audit conversation from “trust the vendor” to “we control the substrate.” Third, latency: an on-premises quantized model can respond in milliseconds instead of the half-second round trip to a cloud API, which matters for any embedded or real-time use case.

The catch is that quantization is a tradeoff dial, not a free lunch. Aggressive quantization (2-bit, sub-2-bit) starts to degrade performance on hard reasoning tasks in ways that don’t show up on simple benchmarks but do show up in production. The right precision for a customer-facing legal Q&A system is not the right precision for an internal log-summarization tool. Treating quantization as one-size-fits-all is the most common mistake.

Reality Check

What the vendor says: “We can run a 70B model on your existing hardware.”

What that means in practice: They will quantize it to 4-bit or lower, and it will run. Whether the quantized version performs well enough on your specific task is a separate question, and one you should answer empirically before signing anything.

What Operators Actually Do

Teams that have moved past pilot stage on self-hosted AI quantize aggressively for internal tooling and conservatively for anything customer-facing. A 4-bit model is fine for an engineer asking questions about an internal codebase. A 4-bit model is not fine for a regulated financial advisory chatbot, where the small accuracy loss can compound into a compliance issue.

They benchmark the quantized model against the full-precision version on their own workload, not on public datasets. Public benchmarks are designed to be hard but representative; your workload is neither. The only meaningful test is “does the quantized model get the answers right on the tasks our employees and customers actually run.”

They also pay attention to which quantization technique was used. GPTQ, AWQ, GGUF, and various proprietary methods produce meaningfully different accuracy profiles at the same bit-width. A 4-bit model quantized one way can score noticeably better than the same model quantized another way. This is an area where the team running the deployment needs actual technical judgment, not just procurement instinct.

The pattern that works: quantize to the most aggressive level your task can tolerate, measured against full-precision baseline on your actual data. Re-test whenever the underlying model is upgraded. The cost-and-latency benefits compound across thousands of queries per day.

The Questions to Ask

  1. At what bit-width was the model quantized, and how was that level chosen? If the answer is “the default” or “whatever fit on the hardware,” you are buying a configuration nobody validated against your workload.

  2. What’s the measured accuracy delta between the quantized and full-precision versions on our specific tasks? Public benchmarks are not your tasks. Make the team show you side-by-side outputs on real examples.

  3. What’s the failure mode when the quantization is too aggressive? Quantized models don’t crash gracefully — they get subtly worse at hard reasoning before they fail outright. Someone needs to be watching for that drift, and it should not be the same person who deployed the model.

Get the next Brief

One operator. Every other Wednesday.

Plus the AI Glossary and the Failure Museum.
Real names. Real numbers. Honest analysis.