24.7 Rerankers: Cross-Encoder Models for Precision

Right, so you’ve got your initial set of documents from your vector store. You’re feeling pretty good. You typed in “best practices for pruning apple trees,” and your retriever dutifully came back with 20 documents about fruit, shears, and branches. But let’s be honest: some of those are probably about Apple stock options or, god forbid, a recipe for apple pie. This is where the brute-force approximation of your bi-encoder (the thing that powered your initial search) starts to show its limits.

Enter the reranker, our precision scalpel. Think of your initial retriever as a wide-net fisherman; it’s great at getting a bunch of potentially relevant fish in the boat. The reranker is the crew that sorts through the catch, throws back the junk, and lines up the prize tuna for the chef. Its job is not to find more documents, but to take your existing list and reorder it based on a much more sophisticated, computationally expensive understanding of your query relative to each text chunk.

Why a Separate Reranking Step?

You might ask, “Why didn’t we just use this super-smart model for the search in the first place?” Excellent question. The answer is speed and scale.

Your initial vector search uses a bi-encoder architecture. It pre-computes embeddings for all your documents and stores them in an index. When you query, it quickly converts your query into an embedding and finds the nearest neighbors in that pre-computed space. It’s fast—milliseconds fast—even over millions of documents. But it’s a bit of a blunt instrument; it’s comparing single vector representations, which can lose nuance.

A reranker typically uses a cross-encoder architecture. This thing is a different beast. It doesn’t pre-compute anything. It takes your query and a single document text, concatenates them, and feeds them both simultaneously into the transformer model. This allows the model’s attention mechanism to perform deep, token-level comparisons between the query and the document. It’s incredibly accurate, but it’s also slow because it has to do this intensive computation for every single (query, document) pair you give it. You’d never want to run this over your entire corpus for every query. But running it on the top 20 or 50 results from your fast first-stage retriever? That’s a game-changer.

The Mechanics of a Cross-Encoder

Let’s get concrete. We’ll use the sentence-transformers library, which provides a fantastic and easy-to-use interface for these models. A popular choice is the ms-marco-MiniLM-L-6-v2 model, a tiny but mighty model trained on the massive MS MARCO dataset specifically for ranking.

from sentence_transformers import CrossEncoder
import numpy as np

# Load the model. This will download it the first time.
# Note: This is a CrossEncoder class, not a SentenceTransformer class.
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', max_length=512)

# Your user's query
query = "best practices for pruning apple trees"

# Let's say these are four documents your initial retriever found.
# One is good, one is okay, and two are junk.
documents = [
    "A detailed guide on the correct time of year and techniques for pruning apple trees to maximize fruit yield and tree health.",  # Relevant
    "The stock price for Apple Inc. (AAPL) has seen significant growth after their latest earnings call.",  # Irrelevant
    "Using sharp shears is important for any pruning job to avoid damaging the tree's branches.",  # Somewhat relevant
    "This recipe for classic apple pie requires six cups of thinly sliced apples and a dash of cinnamon."  # Irrelevant
]

# So we create our list of (query, document) pairs
pairs = [[query, doc] for doc in documents]

# Get the scores for each pair
scores = model.predict(pairs)

# Now, let's zip the scores with the documents and sort them from best to worst
ranked_results = sorted(list(zip(scores, documents)), key=lambda x: x[0], reverse=True)

print("Ranked Results (Highest score first):")
for rank, (score, doc) in enumerate(ranked_results):
    print(f"\nRank {rank+1}, Score: {score:.4f}")
    print(f"Text: {doc}")

The output will look something like this. Notice how the scores are not probabilities between 0 and 1, but rather raw similarity scores where a higher number is better.

Ranked Results (Highest score first):

Rank 1, Score: 7.8871
Text: A detailed guide on the correct time of year and techniques for pruning apple trees to maximize fruit yield and tree health.

Rank 2, Score: 5.2342
Text: Using sharp shears is important for any pruning job to avoid damaging the tree's branches.

Rank 3, Score: -9.1243
Text: The stock price for Apple Inc. (AAPL) has seen significant growth after their latest earnings call.

Rank 4, Score: -10.9981
Text: This recipe for classic apple pie requires six cups of thinly sliced apples and a dash of cinnamon.

Boom. The reranker has perfectly reordered the list, pushing the financial and culinary disasters to the bottom with strongly negative scores. The somewhat relevant document about shears gets a middling positive score, and the most relevant document shoots to the top with a very high score.

Best Practices and Pitfalls

Don’t Go Overboard: The biggest mistake is reranking too many documents. The cost is linear: reranking 100 documents takes 10x longer than reranking 10. Your initial retriever should be good enough that the truly relevant stuff is in the top k (e.g., 20-50). Reranking 200 documents is usually a waste of time and money. Find the sweet spot.

Mind the Context Window: Cross-encoders have a maximum sequence length (often 512 tokens). If your document chunks are long, the model might truncate them, potentially cutting off the most relevant part. Your chunking strategy is paramount. If you have long documents, consider smaller chunks with some overlap to avoid losing context.

The Goldilocks Zone of Reranking: You need to decide where in your pipeline reranking happens. The most common pattern is:

Retrieve: Use your vector store to get k candidates (e.g., 50).
Rerank: Use the cross-encoder to score and reorder all 50.
Select: Take the top n reranked results (e.g., 5) to feed into your LLM for answer synthesis.

This ensures the final LLM only sees the crème de la crème, drastically reducing the chance of it getting distracted by irrelevant information and hallucinating an answer about stock tips based on a pie recipe.

It’s Not a Silver Bullet: A reranker can’t find information that isn’t there. If your initial retriever completely whiffed and didn’t bring back any relevant documents, the reranker can only tell you which of the irrelevant documents is the least bad. Garbage in, garbage out still applies. Your first priority is always to make your initial retrieval as strong as possible; the reranker is there to polish the results, not save a doomed search.