Re-ranking

A second-pass model that reorders search results by relevance. A strong reranker can recover from a weak retriever — and it's one of the cheapest quality wins in any RAG system.

Data & Infrastructure

The Technical Definition

A re-ranker is a second-pass model that takes a list of candidate documents from a first-pass retriever and reorders them by how well they actually answer the query. The first-pass retriever (semantic search, BM25, or hybrid) is optimized for speed and recall — it has to score every document in the index, so it uses cheap methods that scan thousands of candidates fast. A re-ranker runs only on the top 50–100 candidates that survive that first pass, and uses a more expensive model to score each query-document pair directly.

The technical name is “cross-encoder.” Where the first-pass retriever embeds the query and documents separately and compares vectors, the re-ranker feeds the query and each candidate document into the same model together, producing a relevance score that reflects deeper interaction between the two. The standard production options in 2026 are Cohere Rerank, BGE rerankers (open-source), and Voyage rerank.

What This Actually Means for Your Business

Most enterprise RAG systems are running with a weak retriever and no re-ranker, and they don’t know it. The first-pass retrieval gets the right document into the top 30 most of the time. It gets it into the top 5 — where the LLM actually sees it as context — much less reliably. Without re-ranking, the LLM is generating answers from a context window full of plausible-but-wrong documents, and the quality of the output reflects that.

A re-ranker fixes the second problem cheaply. The cost is a few hundred milliseconds of additional latency per query and a per-query API charge in the range of fractions of a cent. The benefit, in published benchmarks and in the production systems we’ve audited, is a 15–30% improvement in answer quality measured by relevance to the query. That’s the cheapest quality lever in the RAG stack.

The deeper point: a strong re-ranker partially compensates for a weak retriever. If your first-pass retrieval is mediocre but you pull a top-50 candidate set and re-rank it, you can recover most of the quality you’d otherwise lose. This is why teams that can’t easily fix their chunking or upgrade their embeddings still see meaningful gains from adding re-ranking. It’s the most forgiving lever in the pipeline.

The cost-quality tradeoff is straightforward. Re-ranking 50 candidates costs roughly 50× the per-document scoring of first-pass retrieval. Re-ranking 200 candidates is mostly wasted — the right document is almost always in the top 50 if first-pass retrieval is competent. So the standard pattern is: retrieve top 50–100 with hybrid search, re-rank to get top 5–10, send those to the LLM.

Reality Check

What the vendor says: “Our search returns the most relevant results.”

What that means in practice: Probably their first-pass retrieval is OK, and the right document is somewhere in the top 30. Without re-ranking, “somewhere in the top 30” doesn’t help, because the LLM only sees the top 5. Ask whether they re-rank, and if not, why not.

What Operators Actually Do

The teams running production RAG systems with measurable quality almost always have a re-ranking step in the pipeline. The standard architecture: hybrid retrieval to top 50, re-rank to top 5–10, those become the LLM context. This stack outperforms anything without re-ranking on every published benchmark and in every internal eval we’ve seen.

They also evaluate the re-ranker on their own data. Generic re-rankers (Cohere Rerank v3, BGE-reranker-v2) work well on general English. Domain-specific corpora — legal contracts, clinical notes, parts catalogs — sometimes benefit from fine-tuning a smaller open-source re-ranker on labeled query-document pairs. This is engineering work, not a config switch, but the quality lift can be substantial.

The other operator move: they monitor re-rank scores in production. When the top re-rank score for a query is low, the system probably doesn’t have a good answer in the corpus, and the LLM should refuse rather than fabricate. Wiring re-rank scores into a “I don’t know” gate is one of the simplest hallucination defenses available.

The Questions to Ask

Does your retrieval pipeline include a re-ranker? If not, ask why. Re-ranking is one of the cheapest quality wins available, and the absence of it usually signals a system that wasn’t optimized past the demo.
Which re-ranker, and how was it evaluated on our data? Cohere Rerank is the default commercial choice. BGE rerankers are the default open-source choice. The right one depends on data sensitivity (self-hosted vs. API), domain match, and budget.
Are re-rank scores being used to gate LLM responses? When the top re-rank score is low, the corpus probably doesn’t have a good answer. Using that signal to refuse instead of fabricate is one of the strongest hallucination defenses in production today.

The Technical Definition

What This Actually Means for Your Business

Reality Check

What Operators Actually Do

The Questions to Ask

One operator. Every other Wednesday.