Right, so you’ve got BM25, the grizzled veteran of keyword search, and you’ve got your shiny new dense retrieval model that’s all about semantic meaning. They’re both good at their jobs, but they’re also hilariously bad at each other’s jobs. BM25 will completely whiff on a query for “canine companion” if your document only says “dog.” Your dense retriever, on the other hand, might decide that a document about the planet Saturn is highly relevant to a query for “best car for a family” because, hey, Saturn made a car. It’s a mess.

This is why we don’t pick one. We use both and let them vote. This is Hybrid Search, and the specific voting system we’re going to talk about is Reciprocal Rank Fusion (RRF). It’s stupidly simple, remarkably effective, and will make you look like a wizard.

Why Fusion Beats Picking a Winner

Think of it like asking two experts for advice. One is a literal-minded linguist (BM25), the other is a big-picture conceptual thinker (dense retriever). If you only listen to one, you get a biased view. But if you ask both for their top 10 suggestions and then combine those lists, you get the best of both worlds: precise keyword matches and deep semantic understanding.

The naive way to do this would be to just average their scores. This is a terrible idea. BM25 scores can be in the hundreds, while your dense model’s similarity scores are probably between 0 and 1. They’re on completely different scales. You can’t average them without some gnarly normalization, which is its own can of worms. RRF cuts through all this nonsense by ignoring the scores entirely and just looking at the ranks. A document ranked #1 by BM25 and #100 by the dense retriever is probably more relevant than a document ranked #50 by both.

How Reciprocal Rank Fusion Works (The Math, It’s Easy)

The genius of RRF is its simplicity. For each document, you calculate a single RRF score based on its rank in each individual list. The formula is:

score = sum( 1 / (rank + k) )

For each list the document appears in, you take its rank in that list, add a constant k (to avoid dividing by zero for the top-ranked item), take the reciprocal, and sum that value across all lists. That’s it. k is a smoothing constant, usually set to somewhere between 50 and 100, that controls how much weight we give to lower ranks. A higher k value makes the tail of the list more important.

Let’s say we have two lists: one from BM25, one from a dense retriever. Document A is rank 1 in BM25 and rank 3 in the dense list. Document B is rank 2 in BM25 and rank 1 in the dense list. With k=60:

  • Document A RRF score = (1/(1+60)) + (1/(3+60)) = (1/61) + (1/63)0.0164 + 0.0159 = 0.0323
  • Document B RRF score = (1/(2+60)) + (1/(1+60)) = (1/62) + (1/61)0.0161 + 0.0164 = 0.0325

Document B wins! It had a stronger combined rank presence. Notice that we never once cared what their actual BM25 or cosine similarity scores were.

A Practical Implementation with Python

Let’s make this concrete. Assume you have two lists of document IDs from your two retrieval systems. Here’s how you’d fuse them.

# Define your two ranked lists (these would come from your separate retrieval systems)
list_bm25 = ["doc_3", "doc_1", "doc_4", "doc_2"]  # BM25's top 4, best first
list_dense = ["doc_2", "doc_3", "doc_1", "doc_5"] # Dense retriever's top 4, best first

k = 60  # Standard smoothing constant

# Create dictionaries to store the ranks for each document in each list
ranks_bm25 = {doc_id: rank for rank, doc_id in enumerate(list_bm25, start=1)}
ranks_dense = {doc_id: rank for rank, doc_id in enumerate(list_dense, start=1)}

# Get the union of all unique documents from both lists
all_docs = set(list_bm25 + list_dense)

# Calculate the RRF score for each document
rrf_scores = {}
for doc in all_docs:
    score = 0.0
    # If the doc is in the BM25 list, add its reciprocal rank
    if doc in ranks_bm25:
        score += 1 / (ranks_bm25[doc] + k)
    # If the doc is in the dense list, add its reciprocal rank
    if doc in ranks_dense:
        score += 1 / (ranks_dense[doc] + k)
    rrf_scores[doc] = score

# Sort the documents by their RRF score, descending
sorted_docs = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)

print("Final fused ranking:")
for rank, (doc_id, score) in enumerate(sorted_docs, start=1):
    print(f"Rank {rank}: {doc_id} (RRF score: {score:.4f})")

This would output something like:

Final fused ranking:
Rank 1: doc_2 (RRF score: 0.0325)
Rank 2: doc_3 (RRF score: 0.0323)
Rank 3: doc_1 (RRF score: 0.0313)
Rank 4: doc_4 (RRF score: 0.0161)
Rank 5: doc_5 (RRF score: 0.0159)

Best Practices and Pitfalls

First, the k value. Don’t overthink it. 60 is a great starting point. It effectively means only the top ~100 documents from each list have any real influence on the final score. If you set k very low (e.g., 1), you’re saying only the very top ranks matter, which can make the fusion too brittle.

Second, how deep do you go? How many results should you pull from each system before fusing? This is crucial. If you only get the top 10 from BM25 and the top 10 from your dense retriever, the fused list can only contain documents from that pool. If a critical document is ranked #11 in the dense list, it’s lost forever. My advice? Pull a deep candidate set from each model—think top 100 or 200. RRF is cheap to compute, so the overhead is negligible, and it saves you from missing crucial results that were just outside the arbitrary cutoff of a single model.

The biggest pitfall is assuming this is a magic bullet. It’s not. If your BM25 strategy is broken or your dense embeddings are poorly tuned, RRF just gives you a more robustly bad result. It combines the strengths of both models, but it also combines their weaknesses. You still need to get the fundamentals right. But once you do, RRF is the final piece that makes the whole system sing, ensuring you catch both the literal matches and the intelligent semantic ones. It’s the mediator between our two bickering expert friends, and it gets the best possible answer out of them.