Optical Character Recognition (OCR)
What vendors mean: we can read text in images. What it actually means: a 50-year-old technology that just got rebuilt by multimodal LLMs and now actually works.
The Technical Definition
Optical Character Recognition converts images of text — scanned documents, photos, screenshots, PDFs — into machine-readable text. The underlying technology has gone through three eras: rule-based pattern matching (1970s–1990s), deep-learning models trained on labeled text images (2010s), and multimodal large language models that read documents the way a person does (2023 onward). The third era didn’t replace the second; the two now coexist in most production systems.
What This Actually Means for Your Business
If your business runs on paper, scanned forms, faxes, photographs, or any document that started life outside a database, OCR is somewhere in your stack — whether you bought it directly or it’s inside an IDP, claims, lending, or RPA platform.
For most of the last two decades, OCR was a solved-but-frustrating technology. Tesseract, ABBYY, Google Cloud Vision, AWS Textract did the basics well: clean printed text on a clean page got recognized at 95%+ accuracy. The trouble started at the edges — handwriting, rotated scans, low-contrast checks, faxed copies that had been faxed three times, and structured layouts where the model needed to know which number was the invoice total versus the line-item subtotal.
What changed in 2024-2026 is multimodal LLMs. GPT-4V, Claude with vision, and Gemini can read a document, understand its structure, extract specific fields, and answer questions about it in a single API call. The handwriting that used to require a specialized model now gets read by Claude with reasonable accuracy. The structured layout that used to require template-matching is handled by the model’s understanding of “this is an invoice, the total is in the bottom right.”
This doesn’t mean legacy OCR is dead. Multimodal LLMs are slower, more expensive per page, and less predictable than purpose-built OCR engines. For high-volume, narrow document types — millions of bank deposits per day, retail receipt processing, license plate recognition — specialized OCR still wins on cost and latency. The pattern that’s emerging in 2026 is hybrid: cheap OCR runs first, the LLM picks up the hard cases. Cost per document drops by 80% versus LLM-only.
The other shift is field-level extraction. Old OCR returned a wall of text; the application code did the work of finding “invoice number.” New multimodal models return structured JSON directly. The downstream code is dramatically simpler. The companies that haven’t refactored their document pipelines for this yet are doing four times the engineering work for the same outcome.
Reality Check
What the vendor says: “Our OCR uses cutting-edge AI to achieve 99.9% accuracy on any document.”
What that means in practice: They’re either using a wrapper around AWS Textract or Google Document AI, or they’re using a multimodal LLM and not telling you. The “99.9%” is on a clean benchmark, not on your fax-of-a-handwritten-claim-form. Ask which model is doing the reading and what the accuracy is on documents like yours.
What Operators Actually Do
Companies running OCR at scale benchmark on their own documents before they sign. They take 200 representative documents — including the messy ones — and run them through three or four OCR engines. The results are almost always different from the vendor’s marketing numbers and almost always reorder the vendor ranking.
They also pay attention to the cost-per-page math. A multimodal LLM at $0.01-0.05 per page is fine for low-volume specialty documents but ruinous at 10 million pages per month. Specialized OCR at $0.001 per page is the right answer at that scale. The right architecture is usually a tiered pipeline: cheap first, expensive only when needed.
The teams getting real value treat OCR as one component in a larger document understanding system, not a standalone product. The OCR output flows into a layout-aware model that understands tables, key-value pairs, and document type. That flows into an LLM that reasons over the extracted content. Each stage is replaceable as the technology improves, which it will.
The Questions to Ask
-
What’s the accuracy on documents that look like ours? Don’t accept benchmark numbers. Send 200 representative documents and ask for the actual rate, including the breakdown on hard cases (handwriting, rotated, low-contrast).
-
What’s the cost per page at our volume? OCR pricing is wildly variable across providers and document types. Get a real quote at your projected volume, not a “starting at” number.
-
What happens when the OCR engine gets retired or repriced? Cloud providers deprecate models on their schedule, not yours. If your application is tightly coupled to AWS Textract or a specific Google API, what’s the migration path when it changes?