Data Lineage
The audit trail that lets your team — and your regulator — answer 'where did this number come from?' The thing that breaks when AI starts producing derived data.
The Technical Definition
Data lineage is the documented record of data’s full path through your systems: where it originated, every transformation it passed through, every system it touched, and every output that depends on it. It answers two questions. Forward: if I change this source field, what breaks downstream? Backward: this number on the executive dashboard — where did it come from, and through what calculations?
Modern lineage tools (OpenLineage, Atlan, Collibra, Alation, DataHub, Monte Carlo) capture this automatically by instrumenting pipelines, parsing SQL, and watching schema metadata. The output is a graph: source systems on one side, dashboards and applications on the other, every join and aggregation in between.
What This Actually Means for Your Business
For most of the last 20 years, lineage was a compliance checkbox. SOX, GDPR, HIPAA, BCBS 239 — the regulators wanted to know you could trace a number back to its source. Most companies satisfied this with documentation that was 60% accurate on the day it was written and 30% accurate six months later.
AI changed the cost of being wrong about lineage. Three things happened simultaneously. First, AI models started consuming data from dozens of sources at once, so the lineage graph got an order of magnitude more complex. Second, AI models started producing derived data — predictions, scores, classifications, generated text — that other systems then consumed, creating lineage chains where part of the path runs through a model whose internals nobody can fully document. Third, regulators started asking specific questions about AI outputs that require provenance: was this customer denied credit because of a protected attribute that leaked into a feature six joins upstream?
The companies that didn’t invest in lineage before are now finding out what it costs to reconstruct it under audit pressure. The answer is usually six to nine months of forensic work and a remediation plan written under regulatory observation.
Reality Check
What the vendor says: “Our platform gives you full lineage across your data estate.”
What that means in practice: It captures lineage for the pipelines that run on the platform. The Excel exports your finance team uses, the SQL one analyst writes ad-hoc in a notebook, the API call your customer success tool makes to your warehouse — none of those show up in the lineage graph unless someone explicitly instruments them. Lineage tools cover what you connect them to. The blind spots are exactly where the next compliance issue lives.
What Operators Actually Do
The companies handling lineage well treat it as infrastructure, not documentation. They use OpenLineage or a comparable open standard so lineage events flow from pipelines automatically, not from someone updating a spreadsheet. They tag every data product with its sensitivity classification (PII, PHI, financial, public) and let lineage propagate the classification downstream — so a dashboard built on PII data inherits the PII classification without anyone having to remember.
For AI specifically, the operators ahead of the curve treat the model itself as a node in the lineage graph. The training dataset, the model version, the inference outputs, and the downstream systems consuming those outputs are all linked. When the model is retrained, the lineage records it. When an AI-generated field gets used in a regulatory report, the lineage shows it. This is the only way to answer regulator questions about AI provenance with anything other than a panicked all-hands.
The other pattern: they build lineage in before the AI deployment, not after. Adding lineage to an AI system that’s already in production is technically possible and operationally miserable. Adding it on day one costs a fraction.
The Questions to Ask
-
Can your team produce a complete lineage graph for the top 20 numbers on the executive dashboard, in under a day? If the answer is no, your lineage is documentation, not infrastructure. The first regulatory ask will become a fire drill.
-
What happens to lineage when AI is in the path? Model outputs are derived data. If a pricing recommendation from an AI model feeds into a customer-facing quote, lineage has to capture the model version, the input features, and the version of the training data. Most lineage tools handle this poorly without explicit instrumentation.
-
How do you handle the unmanaged paths? Spreadsheet exports, ad-hoc notebooks, vendor APIs pulling from your warehouse. Every one of these is a lineage break. What’s your policy for either bringing them into the managed graph or eliminating them?