Mixture of Experts (MoE)
An architecture where most of the model sits idle on every query. Cheaper to run, harder to deploy, and quietly behind half the frontier models you've heard of.
The Technical Definition
Mixture of Experts (MoE) is a model architecture where the network is split into many smaller sub-networks, called experts, and only a small subset of them activates for any given input token. A router inside the model decides which experts handle which token.
A 400-billion-parameter MoE model might only activate 30 billion parameters per token. The other 370 billion sit idle for that token, then a different subset wakes up for the next one. From the user’s perspective, you get the knowledge and capability of a very large model. From a compute perspective, you only pay for the experts that fired.
Mixtral 8x7B, DeepSeek-V3, and Databricks’ DBRX are open implementations. GPT-4 is widely believed to be a Mixture of Experts, though OpenAI has never confirmed it. Anthropic’s Claude family is not publicly characterized as MoE, but the boundary between dense and sparse architectures is increasingly fuzzy at the frontier.
What This Actually Means for Your Business
The reason MoE matters to anyone running AI at scale: it breaks the linear relationship between model size and inference cost. A dense 400B model costs roughly thirteen times more per token to run than a dense 30B model. An MoE model with 400B total parameters but 30B active per token costs closer to the 30B model at inference time, while approaching the quality of the 400B model on tasks where the right experts are available.
That is the entire pitch. More capability, less spend per query, at the cost of more complexity in training, deployment, and serving.
The complexity is real. MoE models are harder to train stably — the router has to learn to balance load across experts, and a misbehaving router collapses quality. They are harder to host because the full parameter count still has to live in memory somewhere, even if only a fraction is active per token. They are harder to fine-tune because not all techniques transfer cleanly across the expert routing.
For a CEO, the practical translation is this: when your team or vendor talks about model selection, MoE explains why a model that is “huge” can still be cheap to query, and why a model that is “small” can still be expensive to host. Total parameters tell you about hosting cost. Active parameters tell you about inference cost. They are not the same number anymore.
Reality Check
What the vendor says: “We use a 400-billion-parameter model — the largest in the industry.”
What that means in practice: Probably an MoE model where 30 to 50 billion parameters actually run per query. The 400B figure is real but partially marketing. What matters is active parameters per token, latency, and cost per million tokens — not the headline size.
What Operators Actually Do
Teams making serious model decisions stopped quoting raw parameter counts about a year ago. They quote three numbers: total parameters (drives hosting), active parameters per token (drives inference cost and latency), and benchmark performance on tasks that match their actual workload.
They also pay attention to deployment options. Some MoE models are practical to self-host on a single node. Others require multi-node setups that exceed the operational appetite of a mid-market IT function, which means you are committing to a managed inference provider whether you wanted to or not. That choice has implications for data residency, latency, and lock-in that most pilots skip past.
The pattern that works: treat MoE as a cost-and-capability story, not a status story. If your workload has predictable patterns — customer service, contract review, document summarization — an MoE model can deliver near-frontier quality at roughly mid-tier cost, which is exactly what most enterprise budgets need. If your workload is irregular or low-volume, the operational complexity of hosting MoE may not be worth the savings, and a smaller dense model is the cleaner answer.
The Questions to Ask
-
What are the active parameters per token, and what does that translate to in cost per million tokens for our expected volume? Total parameter count is a vanity metric. Active parameters and per-token cost are the operating metrics.
-
Are we self-hosting this, or are we committed to a managed inference provider? MoE models often push you toward managed hosting whether the pilot acknowledged it or not. Get clarity before you scale.
-
How does the router behave under our specific workload? MoE quality depends on whether the right experts get activated for your domain. If the router was trained on general web text and your queries are pharmaceutical compliance, performance can degrade in ways that don’t show up on public benchmarks.