Model Serving
Taking a trained model and actually making it answer requests in production. Harder than training it in most cases.
The Technical Definition
Model serving is the infrastructure that takes a trained machine learning model and makes it available to answer requests in production. It’s the bridge between a model that exists in a Jupyter notebook and a model that processes thousands of predictions per second for your customers.
A model serving system typically handles model versioning, request routing, scaling based on traffic, A/B testing different model versions, and monitoring prediction quality in real time. Popular frameworks include TensorFlow Serving, KServe, Seldon, and Ray Serve. But the framework is the least important part — the architecture decisions you make around batching, caching, and fallback logic are what determine whether your inference infrastructure is reliable or a liability.
What This Actually Means for Your Business
Here’s what you hear: “Train a model. Serve it to production. Done.”
Here’s what actually happens: Training a model is the easy part. Serving it reliably at scale is where everything gets complicated.
The moment your model goes live, you inherit a new set of problems that don’t exist during development. You need traffic-dependent scaling — inference doesn’t use CPUs evenly; it spikes. A marketing campaign drives 10x normal traffic to your recommendation model. Your serving infrastructure needs to add capacity in seconds, then scale back down when traffic drops. Oversizing is expensive. Undersizing causes timeouts.
You need version management and rollout safety. You want to deploy a new model because it’s 2% more accurate. But deploying to 100% of traffic at once means if something goes wrong, everything breaks. Smart serving infrastructure lets you gradually roll out the new model, monitor its performance in production, and instantly roll back if metrics drift. Most teams don’t have this, and they pay for it when a “better” model tanks production.
You need inference monitoring that actually works. In production, you can’t wait for monthly performance reviews. If your model starts making bad predictions (or if request patterns shift), you need to know in minutes. That means tracking prediction latency, error rates, data distribution drift, and business metrics (did the recommendation actually get clicked?) continuously.
You also need to handle the economics of inference. A cloud GPU costs $2-5 per hour. Processing 100 requests per second through that GPU costs $0.00002 per prediction. If your business model charges $0.0001 per prediction, you’re losing money. Smart serving infrastructure batches requests, caches results, and quantizes models to run on cheaper CPU-based hardware. Model serving is actually a cost optimization problem disguised as a technical one.
Reality Check
What the vendor says: “Deploy your model with one click and serve unlimited inference scale.”
What that means in practice: The model runs fine for the first hour. When traffic spikes at 8 AM, latency goes from 50ms to 2 seconds. At noon, a new model version deploys and breaks something subtle — predictions are technically correct but systematically biased in a way that breaks downstream logic. Rollback takes 15 minutes. The “one click” deployment cost you 8 hours of lost revenue.
What Operators Actually Do
The serving patterns that work start with clear tiering of inference load. High-volume, low-latency predictions (real-time recommendations) go to dedicated, optimized infrastructure. Medium-volume, medium-latency work (batch scoring overnight) runs on cheaper batch hardware. One-off, complex inferences (audit requests, retrospective analysis) run on-demand without dedicated resources. Most teams try to serve everything on the same infrastructure and end up with nothing performing well.
For traffic-critical models, the second pattern is canary deployments and feature flags. New model versions deploy to 5% of traffic, not 50%. Monitoring watches for any degradation — latency, error rate, business metrics. Only if everything looks good does it gradually roll to 100%. If anything goes wrong, the flag flips back in seconds. This feels slow compared to the “deploy everything immediately” approach, but it prevents catastrophic failures.
The third pattern is aggressive optimization before scaling. Most teams try to solve performance problems by adding hardware. The companies that win optimize first: quantize the model (reduce precision from float32 to int8 — often no accuracy loss), use model distillation (train a smaller model to mimic the large one), batch requests aggressively (waiting 100ms to batch 32 requests beats running 32 separate GPU passes), and cache results (if the same request appears twice in an hour, serve the cached prediction). These move the needle more than adding a second GPU.
The Questions to Ask
-
What’s your model deployment frequency, and how long does a rollback take? If you deploy monthly and rollback takes hours, you can’t iterate fast enough to catch problems. If you deploy daily with 5-minute rollbacks, you can afford to learn in production.
-
How do you monitor inference quality in real time without ground truth? You can measure latency and error rates easily. You can’t measure accuracy until users tell you the predictions were wrong. Smart teams monitor proxy signals — are users clicking the recommendations? Are they converting? Is fraud detection catching known-bad patterns? If you’re not monitoring these, a model can degrade without you knowing.
-
What’s your inference cost per prediction, and is it economically viable? Calculate total infrastructure cost divided by monthly predictions. If that number exceeds your margin per prediction, your serving architecture is killing your business. Most teams don’t calculate this until it’s too late.