Glossary / Data & Infrastructure

Unstructured Data

The 80% of your enterprise data that didn't fit in a database. LLMs made it usable. That's the opportunity and the trap.

Data & Infrastructure

The Technical Definition

Unstructured data is information that doesn’t live in the rows and columns of a relational database. Documents, emails, contracts, PDFs, slide decks, meeting transcripts, support call recordings, images, video, scanned forms, chat logs, voicemails. It carries meaning, but no schema. A traditional query engine can’t filter it, join it, or aggregate it without significant pre-processing.

By most enterprise estimates, unstructured data accounts for 80 to 90 percent of the information a company holds. Until recently, most of it was effectively dead — searchable by keyword if you were lucky, ignored otherwise.

Large language models changed the mechanics. An LLM can read a contract, summarize a transcript, classify a support ticket, extract entities from an email, and reason across thousands of documents at once. Combined with vector embeddings and retrieval systems, unstructured content becomes queryable in something close to natural language.

What This Actually Means for Your Business

Every executive deck on AI strategy in 2025 contains some version of the line: “We have enormous amounts of unstructured data we’re not using.” It’s true. It’s also not a strategy.

Here’s what that line usually hides. The unstructured data is scattered across SharePoint sites nobody owns, email archives gated by retention policies, recorded calls in three different platforms, contracts in a DMS that requires VPN, and a shared drive last reorganized in 2019. The “we have a lot of it” framing assumes the volume is the asset. In practice, the volume is the problem.

The companies producing real value from unstructured data narrow the question hard. Instead of “how do we use our unstructured data,” they ask “what decision is being made badly today because the relevant document isn’t being read.” That reframe surfaces the use cases worth funding: a renewal team that can’t read every contract before a customer call, a claims adjuster summarizing 200-page medical files by hand, a compliance officer reviewing call recordings for keyword triggers, a procurement team comparing supplier quotes line by line.

Each of those is a specific job with a measurable cycle time. Each one survives the move from pilot to production. “Make our unstructured data useful” does not.

Reality Check

What the vendor says: “Our platform indexes all your unstructured data and makes it searchable with AI.”

What that means in practice: It indexes whatever you point it at. It will happily ingest the four-year-old policy that was superseded last quarter, the contract draft that was never executed, the meeting notes from the failed initiative, and the customer email written by an employee who left in anger. The AI will cite all of them with equal confidence.

What Operators Actually Do

The pattern in companies getting return on unstructured data starts with curation, not ingestion. Before anything is indexed, somebody — usually a domain expert paired with a data engineer — decides what is in scope, what is authoritative, what is historical reference, and what is noise to exclude. That decision gets versioned and reviewed.

They also distinguish between read-once and read-often content. A 2019 board deck does not need to be in a vector database. The current version of the supplier risk policy does. The discipline: index narrowly, expand deliberately, and assume every shortcut on the front end produces a hallucination on the back end.

The other thing they do — and this is the one most companies skip — they fund a content owner for the corpus. Somebody whose job is to keep the indexed material current, retire stale content, and audit what the AI is citing. That role usually didn’t exist before. It does now.

The Questions to Ask

  1. What specific decision gets better when this unstructured data is readable by AI? If the answer is general — “better insights,” “more visibility” — there is no use case. If it’s specific — “renewal reps see the auto-renewal clause before the call” — there is.

  2. Who decides what gets indexed, and how often is that decision revisited? A corpus that ingests automatically is a corpus that hallucinates automatically. The curation role is the deliverable, not the index.

  3. What’s the audit trail when the AI cites a document? Can a human click through to the source, see the version, see when it was last reviewed, and see who owns it? If not, the AI is generating answers your legal team cannot defend.

Get the next Brief

One operator. Every other Wednesday.

Plus the AI Glossary and the Failure Museum.
Real names. Real numbers. Honest analysis.