Data Lake
The 'we'll structure it later' promise. Why most data lakes become data swamps within 18 months — and what separates the ones that don't.
The Technical Definition
A data lake is a central repository that stores raw data in its native format — structured tables, semi-structured JSON, unstructured text, images, audio, video, log files, sensor readings, anything. There’s no schema enforced on write. You dump the data in, decide later what to do with it. The storage layer is usually cheap object storage (Amazon S3, Azure Data Lake Storage, Google Cloud Storage), with metadata catalogs and query engines layered on top.
The pitch is “store everything, figure it out later.” The architecture works because storage is cheap and compute is decoupled — you only pay to process the data you actually use.
What This Actually Means for Your Business
Every consultant pitching an AI strategy will tell you the first step is “consolidate your data into a lake.” That sentence hides about three years of work and roughly forty open questions about ownership, governance, and quality.
Here’s what actually happens. Year one: every team starts dumping data into the lake because it’s cheap and the central team said to. Year two: nobody can find anything. Multiple teams have ingested the same data three different ways with three different field names. Year three: someone tries to build an AI model on top of it and discovers that the customer ID field has seven distinct formats across the 400 datasets that mention customers. The lake is now a swamp. The data is technically there. None of it is usable.
The companies that get this right treat the lake as a tier, not a destination. Raw data lands in the lake. Curated, documented, owned datasets get promoted to a warehouse or a managed data product where they’re actually queryable. The lake is the inbox. The warehouse is the filing cabinet. Confusing the two is how you end up with a $4M annual storage bill and a CFO who wants to know why the AI initiative is still in pilot.
Reality Check
What the vendor says: “Build a data lake to give your AI access to all your enterprise data.”
What that means in practice: You’ll spend 18 months ingesting data and another 12 months figuring out which of it is trustworthy enough for an AI to consume. The lake is the easy part. The metadata catalog, the access controls, the quality monitoring, and the team that owns curation are the hard parts — and they’re not in the vendor’s quote.
What Operators Actually Do
The companies running data lakes that don’t rot follow a few patterns. They enforce a medallion structure (bronze for raw, silver for cleaned, gold for production-ready) so consumers know what tier they’re querying against. They require every dataset to have a named owner before it gets ingested — no orphans. They invest in a metadata catalog (Unity Catalog, Atlan, Collibra, DataHub) on day one, not year three. And they treat data ingestion as a product engineering discipline, not a one-time migration project.
The operators getting AI value out of their lake don’t point models at raw data. They point models at the curated tier, where someone has signed their name to “this dataset is correct, current, and documented.” Everything else stays in the lake for exploratory work, not production.
The Questions to Ask
-
Who owns each dataset, and what’s the named contract for keeping it current? A lake without ownership is a swamp by year two. If your team can’t tell you who owns the customer transaction feed, the AI built on top of it has no foundation.
-
What’s the promotion path from raw to production-ready? Raw data should never feed a customer-facing model directly. What’s your process for moving a dataset from “we have it” to “we trust it”?
-
What’s the cost of the data you’re not using? Most lakes accumulate datasets that nobody has queried in 18 months. Storage is cheap, but compliance exposure on data you forgot you had is not. What’s your retention and deletion policy?