24.8 Advanced RAG: HyDE, Multi-Query, and RAPTOR

Right, so you’ve got the basics of RAG down. You chuck a query at a retriever, it finds some relevant docs from a vector store, and you hand those to an LLM to synthesize an answer. It’s a game-changer, but let’s be honest, the vanilla version can be a bit…dumb. It’s a glorified “CTRL-F” on steroids. The retriever is looking for lexical similarity, not conceptual understanding. If your query uses different words than your documents? Tough luck. If the answer requires synthesizing information from ten different places? Goodnight.

This is where we level up. We stop just retrieving and start orchestrating. We’re going to make the system think harder before it even starts its search. Strap in.

HyDE: The Art of the Imaginary Document

Here’s the problem: your user asks, “What’s the best way to optimize a neural network for sentiment analysis on short texts?” Your vector store is filled with brilliant papers titled “Stochastic Gradient Descent Optimization Techniques” and “BERT for Micro-Expression Classification.” The lexical overlap is minimal. The retriever returns garbage.

HyDE (Hypothetical Document Embeddings) fixes this with a brilliantly simple trick. Before you even touch your vector database, you ask an LLM to imagine what a perfect answer to the query would look like. You’re basically having the LLM write the ideal document you wish you had.

You then take that hypothetical document, generated purely from the LLM’s parametric knowledge, and use it as the query for your vector store. You’re not searching for what the user asked; you’re searching for what the user meant. It’s a semantic bridge.

from openai import OpenAI
import os

# Initialize your LLM and embedding clients
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

def hyde_retrieval(user_query, embedding_model="text-embedding-3-small"):
    # Step 1: Generate the hypothetical document
    prompt = f"""You are a helpful AI. Based on your knowledge, generate a detailed, hypothetical paragraph that would be the ideal answer to the following query. The paragraph should be informative and written in a professional tone.

Query: {user_query}
"""
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=300
    )
    hypothetical_document = response.choices[0].message.content
    print(f"HyDE Document: {hypothetical_document}")

    # Step 2: Embed the hypothetical document, NOT the original query
    hyde_embedding = client.embeddings.create(
        input=hypothetical_document,
        model=embedding_model
    ).data[0].embedding

    # Step 3: Use this new embedding to query your vector database (e.g., Pinecone, Chroma)
    # ... your vector DB lookup code here ...
    # results = vector_db_index.query(hyde_embedding, top_k=5)
    # return results

# Example usage
user_question = "What's the best way to optimize a neural network for sentiment analysis on short texts?"
hyde_retrieval(user_question)

Pitfall: The quality of your HyDE document is everything. A vague or hallucinated document will lead you astray. You need a strong, reliable LLM for this step. Also, this adds latency and cost (an extra LLM call), so you have to decide if the accuracy boost is worth it. For complex, conceptual queries, it almost always is.

Multi-Query: For When One Question Isn’t Enough

The designers of your average retriever clearly never had a conversation with a curious human. A single question can have five angles. “How does RAG work?” could mean its architecture, its benefits, its drawbacks, how to implement it, or who invented it.

Multi-Query automates this line of thinking. You use an LLM to take the user’s original query and generate multiple (typically 3-5) diverse interpretations or rephrasings of it. You then retrieve documents for each of these queries and aggregate the results. This casts a much wider net, ensuring you don’t miss a crucial document just because it used a synonym.

def generate_queries(original_query, num_queries=3):
    prompt = f"""You are a helpful research assistant. Your task is to generate {num_queries} different versions of the given query to improve semantic search results. Make each version distinct, using synonyms and varying the phrasing while preserving the core meaning.

Original Query: {original_query}

Generated Queries:
1."""
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",  # Perfectly adequate for this task
        messages=[{"role": "user", "content": prompt}],
        max_tokens=150
    )
    # Simple parsing: split by numbered lines
    generated_text = response.choices[0].message.content
    queries = [original_query]  # Always include the original!
    lines = generated_text.split('\n')
    for line in lines:
        if line.strip() and '. ' in line:
            query = line.split('. ', 1)[1].strip()
            queries.append(query)
    return list(set(queries))  # Remove duplicates

def multi_query_retrieval(user_query):
    all_retrieved_docs = []
    expanded_queries = generate_queries(user_query)

    for query in expanded_queries:
        # Embed and query your vector DB for each generated query
        # embedding = get_embedding(query)
        # results = vector_db_index.query(embedding, top_k=3)
        # all_retrieved_docs.extend(results['matches'])
        print(f"Querying for: {query}")
        # ... your retrieval logic ...

    # Now, you have a large pool of documents from all queries.
    # The key next step is to re-rank them to de-duplicate and find the most relevant.
    # This is where a cross-encoder re-ranker (like from SentenceTransformers) becomes essential.
    # reranked_docs = my_reranker(user_query, all_retrieved_docs)
    # return reranked_docs

multi_query_retrieval("How does RAG work?")

Best Practice: You must re-rank the aggregated results. You’ll have duplicates and marginally relevant docs. A re-ranker model scores each doc against the original query, pushing the truly best ones to the top. Without this step, you’re just dumping a messy pile of context onto your poor LLM.

RAPTOR: Thinking in Hierarchies

This is the big one. Standard RAG is flat. Your document is a chunk of text, end of story. But knowledge is hierarchical! A research paper has an abstract, an introduction, sections, subsections, and a conclusion. The key insight might be buried in a subsection, but your stupid retriever is just looking at a flat chunk that contains the words “related work.”

RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) is a beast that builds a tree of summaries. You chunk your document. You summarize small groups of chunks. Then you summarize groups of those summaries. You create a multi-level tree where the root node is a top-level summary of the entire document.

When a query comes in, you start at the root. Is this summary relevant? Yes? Traverse down its children. You keep going until you hit the leaf nodes (the original chunks). This allows for pinpoint accuracy, finding the most relevant part of a document without getting lost in its weeds. It’s computationally expensive to set up, but it’s arguably the most sophisticated and effective retrieval method out there right now. Implementing it fully is a blog post in itself, but the core idea is to structure your data not as a flat list, but as a tree of embeddings.