26.6 Response Synthesizers: Compact, Refine, Tree Summarize
Alright, let’s talk about the part of LlamaIndex that actually gets the words on the page: the Response Synthesizer. You’ve done the hard part—you’ve ingested a mountain of data, chunked it up, indexed it, and retrieved the most relevant nodes with a query. Now what? You don’t just want to shove a pile of raw text nodes at the LLM and say “good luck.” That’s like handing a brilliant chef a bin of pre-chopped ingredients without a recipe. The synthesizer is the recipe. It’s the strategy for combining those retrieved “ingredients” (your text nodes) into a coherent, final answer.
The default choice is often fine, but knowing your options here is the difference between a passable answer and a brilliant one. You have three main modes: compact, refine, and tree_summarize. They represent a classic trade-off: cost (number of LLM calls) vs. quality (coherence and completeness of the final answer).
The Default: compact Mode (The Pragmatist)
This is the workhorse and the default for a reason. It’s smart about balancing cost and quality. Here’s the deal: LLMs have a context window limit. If you retrieve 10 nodes that, when combined, exceed that limit, you can’t just send them all. compact handles this by stuffing as many nodes as possible into a single prompt, up to the context limit. If there are leftover nodes, it makes another call with the next batch, and so on. Finally, it takes all these intermediate summaries and combines them into a final answer.
It’s efficient, but the final summary is based on summaries, which can sometimes lead to a loss of nuance. The LLM never sees all the raw context in one go.
from llama_index.core import VectorStoreIndex
from llama_index.core.response_synthesizers import get_response_synthesizer
from llama_index.llms.openai import OpenAI
# Let's get a synthesizer object we can configure
synth = get_response_synthesizer(response_mode="compact")
llm = OpenAI(model="gpt-4-turbo")
# Assuming you have an index built already
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(response_synthesizer=synth, llm=llm)
response = query_engine.query("Explain the concept of quantum entanglement.")
print(response)
The Thorough: refine Mode (The Perfectionist)
This is for when you absolutely, positively need the best possible answer and are willing to pay the LLM token tax for it. refine is iterative and meticulous. It takes the first node and generates an initial answer. Then, for each subsequent node, it says to the LLM: “Here’s the answer I have so far. Now, here’s some new context. Refine and improve your existing answer based on this new piece.”
The huge advantage is that every new piece of context gets to directly influence and improve the evolving answer. The disadvantage? A separate LLM call for every single node you retrieved. If you retrieved 15 nodes, that’s 15 calls. This gets expensive and slow very quickly, but my goodness, the results can be comprehensive. Use this for critical tasks where accuracy is paramount and you’re not retrieving a huge number of nodes.
# The code is identical, just change the mode. The cost... is not.
synth = get_response_synthesizer(response_mode="refine")
query_engine = index.as_query_engine(response_synthesizer=synth, llm=llm)
# This will likely make multiple LLM calls. You've been warned.
response = query_engine.query("Give me a detailed, point-by-point breakdown of the causes of the 2008 financial crisis.")
The Hierarchical: tree_summarize Mode (The Scalable Compromise)
This one is clever and often overlooked. It tries to find a middle ground between compact and refine. Instead of processing nodes sequentially, it groups them and summarizes them hierarchically, like a tournament bracket. It summarizes pairs of nodes, then takes those summaries and summarizes them into pairs, and so on, until it condenses everything down into a single, final summary.
Why bother? This parallelization can be significantly faster than the sequential refine approach, and it often produces more balanced answers than compact because the summarization happens in a structured way. It’s excellent for synthesizing a large number of disparate opinions or facts into a unified whole. The number of LLM calls scales logarithmically with the number of nodes, which is a fancy way of saying “way more efficient than refine for large retrievals.”
synth = get_response_synthesizer(response_mode="tree_summarize")
query_engine = index.as_query_engine(response_synthesizer=synth, llm=llm)
# Great for synthesizing many viewpoints or facts.
response = query_engine.query("Summarize the critical reception of the film 'Avatar' based on these reviews.")
Best Practices and Pitfalls
- Don’t Blindly Use
refine: Its cost scales linearly. If your query retrieves 50 nodes, you’re making 50 LLM calls. That’s a great way to turn a $20 experiment into a $200 bill. Monitor your retrieval and know what you’re working with. compactCan Lose the Plot: If the most crucial information is in the last node of a large retrieval, thecompactmode might summarize it away in an early batch before it ever gets to the final combination step. If your answers feel like they’re missing key points, check your retrieval first, then consider switching torefineortree_summarize.- The Secret Weapon:
ResponseMode.SIMPLE: There’s a fourth option,simple. It just crams every retrieved node into a single prompt and hopes for the best. It’s cheap and fast… until you hit a context window limit and it all falls apart. Use it only when you’re absolutely certain about the length of your retrievals. It’s the “hold my beer” of response modes. - Synthesizers Are for Query Engines: Remember, this is about the
querymethod. Chat engines and other interfaces might use them differently under the hood. Always know what component you’re actually configuring.
Choose your synthesizer like you’d choose a tool: compact for most jobs, refine for precision craftsmanship, and tree_summarize when you need to wrangle a large, messy set of ideas into something coherent without going bankrupt.