25.6 Document Loaders and Text Splitters
Right, let’s talk about the part of the job that feels most like actual work: getting your text out of its comfy little files and into your LLM’s brain in a way it can actually digest. This isn’t just busywork; doing this poorly is the single fastest way to make your multi-million parameter AI model dumber than a bag of hammers. We’re going to fix that.
The core problem is simple: LLMs have a painfully short-term memory, called a ‘context window’. You can’t just shove the complete works of Shakespeare into the prompt and ask for a sonnet about your cat. You have to break your documents into smaller, semantically meaningful chunks. This is a two-step dance: first, you load the documents (the DocumentLoader), and then you split them (the TextSplitter). Mess up either step, and you’re building a Rube Goldberg machine of failure.
The Art of the Loader: More Than Just open()
Document loaders are your data gateways. Their job is to take a source—a PDF, a URL, a directory of text files—and return a list of LangChain Document objects. A Document is basically a dictionary with page_content (the text) and metadata (where it came from, page number, etc.). This metadata is your best friend later on for citations.
The key insight here is that loading is rarely just about reading bytes. It’s about parsing. A TextLoader is trivial. But a PyPDFLoader? It’s wrestling a greased pig. PDFs are for printing, not for programmatic extraction. The text is often placed absolutely on the page, leading to weird word ordering, and forget about tables.
from langchain.document_loaders import TextLoader, PyPDFLoader, WebBaseLoader
# Simple, reliable, boring. Your favorite.
text_loader = TextLoader("./my_novel.txt")
docs = text_loader.load()
# Enter the thunderdome. Hope your PDF is mostly text.
pdf_loader = PyPDFLoader("./specification.pdf")
docs = pdf_loader.load() # Returns a list of Docs, one per page.
# Scraping the web. A beautiful nightmare of inconsistent HTML.
web_loader = WebBaseLoader("https://example.com/some_blog_post")
docs = web_loader.load()
The pro move is to always check the metadata after loading. That PyPDFLoader gives you the page number. Keep that. When your chain later hallucinates an answer and you need to figure out which page it misread, you’ll thank me.
Splitting Text: Where the Real Magic Happens
This is the most important concept in this chapter. You don’t just split text by character count. If you break a sentence in half, you’ve created two chunks of grammatical nonsense. The goal is to create chunks that are self-contained and make sense on their own.
This is why we use RecursiveCharacterTextSplitter. It’s less intimidating than it sounds. It doesn’t recursively call itself until your stack overflows; it simply tries to split your text along a hierarchy of separators. First, it tries to split by double newlines ("\n\n"). If that creates chunks that are too big, it tries single newlines ("\n"). If those are still too big, it tries spaces (" "), and finally, if all else fails, it splits by character.
from langchain.text_splitter import RecursiveCharacterTextSplitter
# This is a great starting configuration. Tweak these like a chef seasoning a soup.
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Aim for ~1000 characters per chunk.
chunk_overlap=200, # The most important parameter. Add 200 chars of overlap between chunks.
length_function=len, # Use the standard `len` function to measure chunk size.
)
# 'docs' is the list of Documents we loaded above.
chunks = text_splitter.split_documents(docs)
The chunk_overlap is your secret weapon against context death. Imagine a key piece of information sits right on the split between two chunks. Without overlap, chunk #2 has no idea what chunk #1 was talking about. The overlap acts as a buffer, ensuring that critical context is carried over from one chunk to the next. Not using it is the most common rookie mistake.
Tuning Your Splits: It’s Not Magic, It’s a Knob
Those parameters aren’t set in stone. You need to think about your use case.
chunk_size: Think about your model’s context window. If you’re using a model that only handles 4k tokens, sending a 3k token chunk for retrieval leaves almost no room for the question and the answer. Smaller chunks are often better for precision.chunk_overlap: A good rule of thumb is 10-20% of yourchunk_size. More overlap is cheaper than missing crucial context.- The Separators: For code, you might want to add Python-specific separators (like
classordef) to the list to keep functions and classes intact in a single chunk. This is where you graduate from user to wizard.
# A more advanced splitter for Python code
python_splitter = RecursiveCharacterTextSplitter.from_language(
language="python",
chunk_size=800,
chunk_overlap=150,
)
# This will do a much better job of keeping function definitions whole.
The absolute best practice? Look at your chunks. No, seriously. After you split your documents, print out a few chunks. You will immediately spot if your overlap is too small, if your splits are breaking sentences weirdly, or if your PDF loader created a chunk that’s just a page footer repeated 50 times. This isn’t a “set and forget” process; it’s the first and most important tuning step for any performant RAG application. Get this right, and everything that follows becomes infinitely easier.