Vector-Search | mikePietsch.com

24.9 Evaluating RAG: RAGAS Framework

Right, so you’ve built your RAG pipeline. You’ve got your vector store humming, your embeddings are pristine, and your LLM isn’t hallucinating nearly as much. You pat yourself on the back. But then the terrifying question hits: How good is it, actually? You can’t just eyeball a few responses and call it a day. That’s like testing a parachute by jumping out of a plane and saying “Seemed fine!” on the way down. We need metrics. We need a framework. Enter RAGAS.

24.8 Advanced RAG: HyDE, Multi-Query, and RAPTOR

Right, so you’ve got the basics of RAG down. You chuck a query at a retriever, it finds some relevant docs from a vector store, and you hand those to an LLM to synthesize an answer. It’s a game-changer, but let’s be honest, the vanilla version can be a bit…dumb. It’s a glorified “CTRL-F” on steroids. The retriever is looking for lexical similarity, not conceptual understanding. If your query uses different words than your documents? Tough luck. If the answer requires synthesizing information from ten different places? Goodnight.

24.7 Rerankers: Cross-Encoder Models for Precision

Right, so you’ve got your initial set of documents from your vector store. You’re feeling pretty good. You typed in “best practices for pruning apple trees,” and your retriever dutifully came back with 20 documents about fruit, shears, and branches. But let’s be honest: some of those are probably about Apple stock options or, god forbid, a recipe for apple pie. This is where the brute-force approximation of your bi-encoder (the thing that powered your initial search) starts to show its limits.

24.6 Hybrid Search: BM25 + Dense Retrieval with Reciprocal Rank Fusion

Right, so you’ve got BM25, the grizzled veteran of keyword search, and you’ve got your shiny new dense retrieval model that’s all about semantic meaning. They’re both good at their jobs, but they’re also hilariously bad at each other’s jobs. BM25 will completely whiff on a query for “canine companion” if your document only says “dog.” Your dense retriever, on the other hand, might decide that a document about the planet Saturn is highly relevant to a query for “best car for a family” because, hey, Saturn made a car. It’s a mess.

24.5 Vector Databases: Chroma, Pinecone, Weaviate, Qdrant, pgvector

Right, let’s talk about where your AI’s brain gets an external hard drive: the vector database. This isn’t just some fancy storage locker; it’s the core of making RAG actually work. Without it, your large language model is just a brilliant, know-it-all savant with severe amnesia. It knows its training data but has no clue about your company’s latest Q3 report or the fact that your API documentation was updated yesterday.

24.4 Embedding Models: OpenAI, Sentence Transformers, and BGE

Alright, let’s talk about the unsung hero of the RAG pipeline: the embedding model. This is the part that takes your brilliant, messy, human-language queries and documents and squishes them down into a list of numbers—a vector—that a computer can actually reason about. Get this right, and your RAG system sings. Get it wrong, and you’re just doing a very expensive, very slow keyword search. We’re not here for that.

24.3 Document Chunking Strategies: Fixed Size, Semantic, and Recursive

Alright, let’s get our hands dirty. You’ve got your documents, you’ve got your embedding model, and you’re ready to build a RAG system. But if you think you can just shove a 300-page PDF into a vector database in one go and call it a day, you’re in for a rude awakening. The single biggest lever you have to pull for RAG performance isn’t your fancy LLM or your hyper-optimized embeddings—it’s how you chunk your documents. Get this wrong, and your brilliant retrieval system will be about as useful as a chocolate teapot.

24.2 RAG Architecture: Indexing, Retrieval, and Generation

Right, so you want to build a RAG system. Good choice. It’s the duct tape and WD-40 of the AI world—a shockingly effective way to stop your LLM from confidently hallucinating facts straight out of its own digital derriere. The core idea is gloriously simple: instead of asking the model to pull answers from its static, pre-trained memory (which is like asking a friend for movie trivia they last studied in 2022), you first go find the relevant information in your own trusted data, then shove that context into the prompt. The model’s job shifts from “knowing” to “synthesizing,” which is what it’s actually good at.

24.1 Why RAG: Overcoming Knowledge Cutoffs and Hallucination

Right, let’s talk about why we’re even bothering with this RAG nonsense. You’ve probably seen the demos: a chatbot that can perfectly answer questions about your company’s internal docs, a research assistant that cites actual papers. It feels like magic, but the problem it solves is one of the most fundamental flaws of the big Large Language Models (LLMs) you’re used to: they’re brilliant idiots. They have two crippling weaknesses. First, they have a knowledge cutoff. Ask GPT-4 about the winner of the 2024 World Cup and it’ll politely make something up, because its training data stopped at a certain point. It’s like hiring a world-class historian who hasn’t read a newspaper since 2023. Second, and far more dangerously, they hallucinate. When they don’t know something, their primary directive—to generate plausible-sounding text—takes over, and they confidently present fiction as fact. I’ve seen them invent academic papers with real-sounding titles and fake authors, create entirely non-existent API endpoints, and cite legal cases that never happened. This isn’t a bug; it’s an inherent byproduct of how they work. They’re probabilistic, not databases.