24.2 RAG Architecture: Indexing, Retrieval, and Generation

Right, so you want to build a RAG system. Good choice. It’s the duct tape and WD-40 of the AI world—a shockingly effective way to stop your LLM from confidently hallucinating facts straight out of its own digital derriere. The core idea is gloriously simple: instead of asking the model to pull answers from its static, pre-trained memory (which is like asking a friend for movie trivia they last studied in 2022), you first go find the relevant information in your own trusted data, then shove that context into the prompt. The model’s job shifts from “knowing” to “synthesizing,” which is what it’s actually good at.

The entire RAG pipeline breaks down into three distinct phases, and if you screw up any one of them, the whole elegant contraption falls apart. Let’s walk through them.

The Indexing Phase: Turning Your Data into a Library the LLM Can Actually Browse

Before you can retrieve anything, you need something to retrieve from. This is where you take your raw data—be it PDFs, docs, wiki pages, whatever—and turn it into a queriable format. The key concept here is the vector embedding. We’re converting text into high-dimensional numerical vectors (think a list of hundreds of numbers) that represent its semantic meaning. The magic is that sentences with similar meanings will have vectors that are close to each other in this mathematical space. It’s like plotting all your documents on a map where “cat” and “feline” are right next to each other, while “tax law” is in a different country altogether.

Here’s the basic workflow, which you’ll typically run offline to build your knowledge base:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# 1. Load your data from a source
loader = WebBaseLoader(["https://example.com/my-knowledge-base"])
documents = loader.load()

# 2. Split it into manageable chunks. This is CRITICAL.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # Size of each chunk in characters
    chunk_overlap=200     # Avoids splitting ideas awkwardly in half
)
chunks = text_splitter.split_documents(documents)

# 3. Generate embeddings and store them in a vector database
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./my_rag_db"
)
vectorstore.persist()  # Save it to disk

Why the chunking? Because you can’t just embed a 100-page PDF as one blob. Retrieval will be impossibly slow and messy. You need smaller, focused chunks. The size and overlap are art as much as science; too small, you lose context; too large, you retrieve irrelevant paragraphs. This is your first major pitfall. Tune this carefully.

The Retrieval Phase: Finding the Right Needle in the Haystack

Now, when a user asks a question (a “query”), we don’t just blindly dump the whole database into the prompt. That would be expensive and, frankly, stupid. We perform a similarity search.

We take the user’s query, convert it into an embedding using the same model we used for indexing, and then ask our vector database: “Which chunks in your collection have vectors most similar to this query vector?”

# ... continuing from the previous setup
query = "What's the company's policy on remote work?"

# This performs the similarity search for us
relevant_docs = vectorstore.similarity_search(
    query,
    k=3  # Retrieve the top 3 most relevant chunks
)

# 'relevant_docs' is now a list of documents that we can pass to the LLM

But here’s the rub: naive similarity search (what we just did) is often dumb as a bag of hammers. It only looks at semantic similarity, not actual keyword matching. If your query uses the word “WFH” but your policy doc says “working from home,” you might miss it. This is why hybrid search (combining vector and keyword-based retrieval) is becoming the gold standard for anything serious. Also, always set k based on your context window; retrieving 10 chunks is useless if you can only fit 2 in the prompt.

The Generation Phase: The LLM’s Crowning Moment of Synthesis

This is the payoff. We take the user’s original query and the retrieved relevant documents and stitch them together into a well-structured prompt for the LLM. The model’s instruction is essentially: “Here’s the user’s question, and here are some excerpts from our internal documentation. Based only on this provided context, write a coherent answer. If the answer isn’t in there, say so.”

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# Define a prompt template that forces the LLM to cite the context
template = """
You are a helpful assistant. Answer the user's question using *only* the context provided below. Do not use any prior knowledge.

If you cannot find the answer in the context, simply state "I cannot find an answer in the provided documentation."

Context:
{context}

Question:
{question}

Answer:
"""
prompt = ChatPromptTemplate.from_template(template)

# Set up the LLM
llm = ChatOpenAI(model="gpt-4-turbo")

# Chain it all together: prompt | LLM
rag_chain = prompt | llm

# Invoke the chain with our retrieved documents and the user query
# We have to format the list of docs into a single text block for the context
context_text = "\n\n".join([doc.page_content for doc in relevant_docs])
response = rag_chain.invoke({
    "context": context_text,
    "question": query
})

print(response.content)

The beauty of this final step is that it makes the LLM honest. You’ve cut off its ability to make things up (mostly) by tethering it to the context you provided. The most common failure mode here is the model getting “creative” and ignoring the context if the prompt isn’t stern enough. Be direct in your instructions. The other failure mode is overstuffing the context with irrelevant chunks, which just confuses the model. Good retrieval is the absolute foundation of good generation. Without it, you’re just building a very expensive, very well-read stochastic parrot.