26.5 Query Engines and Retrievers

Right, so you’ve loaded your data into LlamaIndex. Congratulations, you’ve performed the digital equivalent of moving boxes into a new house. Now comes the fun part: actually finding anything. This is where Query Engines and Retrievers come in—they’re your movers who, instead of just pointing at the pile of boxes, actually open them, find your favorite coffee mug, and hand it to you. But some movers are better than others, and you need to know which to hire for the job.

At a high level, think of it this way: a Retriever is the part that sifts through your data and finds the relevant bits (the “nodes”). It’s your search function. A Query Engine, then, takes those retrieved nodes and sends them to the LLM to synthesize a coherent, human-readable answer. A query engine uses a retriever. It’s a simple but crucial distinction.

The Retriever: Your Digital Bloodhound

The retriever’s one job is to fetch. You give it a query string, and it returns a list of NodeWithScore objects, which are essentially your text snippets with a relevance score attached. The most common type is the VectorIndexRetriever, which uses cosine similarity in a vector space to find text that’s semantically close to your query. It’s shockingly effective, until it isn’t.

from llama_index.core import VectorStoreIndex

# Assume you've already built your index and stored it in `index`
index = VectorStoreIndex.from_documents(documents)
retriever = index.as_retriever(similarity_top_k=3)

# Now use the retriever directly
nodes = retriever.retrieve("What was the company's revenue in 2023?")
for node in nodes:
    print(f"Score: {node.score:.3f} | Text: {node.text[:100]}...")

Why similarity_top_k=3? Because blindly fetching ten nodes by default is a great way to waste money and confuse the LLM with irrelevant context. Start small. You can always increase it later if the answers are too thin. The most common pitfall here is assuming the top-k nodes are perfectly relevant. Vector search is fantastic for semantic similarity, but it can be hilariously bad at precise, keyword-driven lookup. For that, you might want a hybrid approach.

The Query Engine: The Synthesizer

The query engine takes the retriever’s work and completes the job. It stuffs those retrieved nodes into a prompt and sends the whole mess to the LLM with instructions like, “Here’s some context, now answer this question.” The default RetrieverQueryEngine is your workhorse.

from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine

# Create a response synthesizer
response_synthesizer = get_response_synthesizer(response_mode="compact")

# Assemble the query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer
)

# Now get a real, synthesized answer
response = query_engine.query("What was the company's revenue in 2023?")
print(str(response))

Notice the response_mode="compact". This is a critical choice. The “compact” mode tries to fit as much context as possible into the LLM’s prompt within the token limit, which is generally what you want. The other modes, like “tree_summarize” or “refine”, are more complex and useful for truly massive documents, but they burn through API credits faster because they require multiple LLM calls. My advice? Use “compact” until you have a specific reason not to.

When Your Bloodhound Gets Distracted: Postprocessors

Sometimes your retriever fetfers pure garbage. A node might have a high semantic score but be completely irrelevant to the specific question. This is where node postprocessors save you. You can chain them to your retriever to rerank, filter, or deduplicate results after the initial vector search.

A classic one is the SimilarityPostprocessor, which filters out nodes with a similarity score below a threshold. But my personal favorite is the LLMRerank postprocessor. It uses a smaller, cheaper LLM (like GPT-3.5) to score each retrieved node for true relevance to the query. It’s more expensive and slower, but it dramatically improves answer quality by cutting the cruft.

from llama_index.core.postprocessor import LLMRerank

# Create the postprocessor
rerank = LLMRerank(top_n=3, choice_batch_size=3)

# Attach it to the retriever to create a "custom" query engine
custom_query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[rerank]  # This is the key
)

response = custom_query_engine.query("What was the company's revenue in 2023?")

The top_n here tells the postprocessor, “Take the top 10 results from the retriever, rerank them all for true relevance, and then only pass the top 3 to the synthesizer.” It’s a best-of-both-worlds approach: the speed of vector search for the initial broad fetch, and the precision of an LLM for the final cut. It’s the single biggest upgrade you can make to a naive RAG system.

The takeaway? Don’t just use the default index.as_query_engine() and call it a day. That’s like buying a sports car and never taking it out of first gear. Think about your retriever’s configuration, your synthesizer’s mode, and for the love of all that is good, use postprocessors. They are the difference between a system that kinda works and one that works brilliantly.