Glossary / Data & Infrastructure

Data Quality (for AI)

Five dimensions everyone tracks, one nobody does. Why 'we'll clean the data later' is the single most expensive sentence in enterprise AI.

Data & Infrastructure

The Technical Definition

Data quality, in the traditional view, is measured across five dimensions. Accuracy — does the data match reality? Completeness — are required values present? Consistency — does the same fact appear the same way across systems? Timeliness — is the data current enough for the decision it informs? Validity — does the data conform to defined formats and ranges?

For AI specifically, there’s a sixth dimension most quality programs don’t measure: representation. Does the dataset reflect the real distribution of the cases the model will encounter in production? A customer dataset that’s 95% accurate but 100% drawn from one geography will train a model that fails the moment it sees a customer from anywhere else. Accuracy was high. The model is still useless.

What This Actually Means for Your Business

For 30 years, data quality has been treated as a hygiene issue — something the data engineering team handles, periodically, when someone complains. The investment was modest. The accountability was diffuse. The cost of a quality issue was a wrong number on a report and an embarrassing meeting.

AI changes the math. A quality issue in the data layer doesn’t just produce a wrong number — it produces a model that confidently makes wrong decisions at scale, every minute, until someone notices. The cost is no longer one bad meeting. It’s six weeks of customer impact, a regulator wanting to understand how it happened, and a board asking why the AI initiative is in the news.

Here’s what shifts at the executive level. Data quality stops being a hygiene issue and becomes a strategic capability. The companies deploying AI that works invest in quality programs that look more like manufacturing process control than IT housekeeping — continuous monitoring, statistical baselines, automated alerts on drift, named owners for every critical dataset, and quality SLAs that the business actually signs off on.

The sentence “we’ll clean the data later” is the most expensive sentence in enterprise AI. Companies that say it during the strategy phase pay for it during deployment, when the model is technically working but producing outputs nobody trusts because the underlying data has issues nobody owns. The cleanup project that should have cost $400K runs $4M because it now has to happen with the model in production, with auditors watching, and with a board that wants weekly updates.

Reality Check

What the vendor says: “Our AI handles data quality issues automatically — it’s robust to noisy inputs.”

What that means in practice: It tolerates some noise during inference. It does not protect you from systematic bias in your training data, missing fields that correlate with specific customer segments, or stale source systems that drift over time. “Robust to noise” is a property of well-designed models. It is not a substitute for a quality program. The model will absorb your data problems and amplify them.

What Operators Actually Do

The companies running AI well have built quality into the data layer as continuous monitoring, not periodic audits. They use observability platforms (Monte Carlo, Bigeye, Soda, Anomalo, dbt tests in CI) that watch every critical pipeline for freshness, volume, distribution, and schema drift. They alert before the dashboard goes wrong, not after.

For AI specifically, they do three things differently. They monitor training data quality with the same rigor as production data — because the model is only as good as what it learned. They monitor inference inputs for distribution shift — because real-world data changes and a model trained on last year’s customer mix degrades silently. And they monitor outputs for plausibility — flagging when the model starts producing unusual answers, which is often the first sign that an input pipeline broke.

The other shift: quality ownership moves from the central data team to the domain that produces the data. The supply chain team owns supply chain data quality. The CRM team owns CRM data quality. The central team builds the platform. This sounds like data mesh because it is — quality and ownership are the same problem.

The Questions to Ask

  1. What’s the quality baseline for every dataset feeding the AI, and who’s accountable when it slips? “We checked it once at the start of the project” is not a baseline. A baseline is a continuous metric with thresholds, alerts, and a named owner.

  2. How do you detect distribution shift in production inputs? A model trained on six months of historical data will see new patterns within weeks of deployment. What’s the monitoring that catches the drift before the business does?

  3. Is your representation honest, or just your accuracy? Most quality programs report 99% accuracy on a dataset that under-represents half the population the model will serve. The accuracy number is fine. The representation gap is the failure mode. Who’s measuring it?

Get the next Brief

One operator. Every other Wednesday.

Plus the AI Glossary and the Failure Museum.
Real names. Real numbers. Honest analysis.