Glossary / Deployment & Ops

MLOps (Machine Learning Operations)

The infrastructure to manage models in production. It's DevOps meets data engineering, and it's the cost nobody budgets for.

Deployment & Ops

The Technical Definition

MLOps (Machine Learning Operations) is the practice of applying DevOps principles to machine learning systems. It encompasses the infrastructure, tools, and processes required to take a model from development to production and keep it running reliably. This includes data pipeline management, model versioning, continuous training, monitoring for model drift, retraining triggers, and automated testing for model performance degradation.

What This Actually Means for Your Business

You built an AI model that works great in your notebook. It predicts customer churn with 88% accuracy. Now you need to run it 10,000 times a day in production against live customer data. That’s when MLOps stops being theoretical and becomes mandatory.

Here’s what most companies underestimate: models degrade in production. Your churn model was trained on historical data from 2024. Now it’s 2026 and customer behavior has changed. The model’s performance drops to 82%, but nobody noticed because you didn’t build monitoring to alert you. Your AI system is now confidently making wrong decisions at scale.

MLOps is the infrastructure that prevents this. It includes automated retraining pipelines so your model gets fresh data regularly. It includes monitoring dashboards so you see when model performance drops. It includes data quality checks so you catch corrupted input data before it reaches the model. It includes version control so you know which model version is running in production and can roll back if something breaks.

The operational cost is substantial. You need data engineers to build pipelines. You need monitoring infrastructure. You need tooling to manage model versions. You need testing frameworks to validate before deployment. Most teams discover they need 1-2 FTEs dedicated to MLOps per production AI system. For a company running three models, that’s a significant ongoing cost.

There’s also the catch-22: you can’t know what you need until your model is in production and failing. Your first year is usually spent building the infrastructure you should have built before launch. Budget for that. The companies that do this well treat MLOps as a first-class engineering discipline, not an afterthought.

Many also underestimate data dependencies. Your model’s training pipeline depends on data from three different source systems. When one of them changes its schema or goes down, your retraining breaks. You didn’t cause the problem, but now your model hasn’t been updated in weeks and you don’t know because nobody was monitoring it. MLOps means you see that break and respond before it impacts production.

Reality Check

What the vendor says: “Deploy your model to production in one click. We handle the infrastructure.”

What that means in practice: They handle the easy part (putting the model behind an API). You still own monitoring when it degrades, retraining when it drifts, debugging when the input data changes unexpectedly, and managing the costs of running it continuously.

What Operators Actually Do

Companies getting this right create clear ownership: who owns the data pipeline feeding the model? Who owns the monitoring? Who responds when the model degrades? These questions need answers before you go to production.

They also build monitoring early. They don’t wait until the model is failing to add dashboards. They track prediction output distribution, input data quality, model performance against holdout test sets, latency, and cost per prediction. They set thresholds so they get alerted before customers are impacted.

The pattern that works: run a pre-production pilot where you treat it like production. Put real monitoring in place. Discover what breaks. Fix those things. Then expand to full production. That pilot often reveals that you need another data engineer or a feature store or a better data quality framework.

Smart teams also build model registries and versioning so they can track what went wrong and reproduce it. When a model fails in unexpected ways, you need to roll back to a known-good version quickly.

The Questions to Ask

  1. How will you monitor this model in production? What metrics will you track? What threshold triggers a retraining or rollback? Who gets alerted?

  2. What’s the data dependency map? This model depends on data from which systems? If one of them breaks or changes, what’s the impact? How quickly will you detect it?

  3. What’s the plan when the model degrades? Models always degrade eventually. What triggers retraining? How long does retraining take? Can you roll back while you’re retraining?

Get the next Brief

One operator. Every other Wednesday.

Plus the AI Glossary and the Failure Museum.
Real names. Real numbers. Honest analysis.