24.3 Document Chunking Strategies: Fixed Size, Semantic, and Recursive

Alright, let’s get our hands dirty. You’ve got your documents, you’ve got your embedding model, and you’re ready to build a RAG system. But if you think you can just shove a 300-page PDF into a vector database in one go and call it a day, you’re in for a rude awakening. The single biggest lever you have to pull for RAG performance isn’t your fancy LLM or your hyper-optimized embeddings—it’s how you chunk your documents. Get this wrong, and your brilliant retrieval system will be about as useful as a chocolate teapot.

Think of it this way: chunking is how you create the “memory cards” for your system. Make them too big, and the relevant info is buried in a sea of irrelevant text. Make them too small, and they lack the necessary context to be meaningful. Your goal is the Goldilocks zone: chunks that are self-contained enough to be accurately retrieved and understood.

The Trusty (and Dumb) Workhorse: Fixed-Size Chunking

This is the default, the classic, the “let’s not overcomplicate this” approach. You pick a number of characters or tokens, and you slice your text into pieces that size, usually with a small overlap to prevent context loss at the boundaries.

Why would you use this? Because it’s stupidly simple and incredibly robust. It makes no assumptions about your text structure, so it works on everything from legal code to Shakespearean sonnets. It’s your go-to for a first pass.

The obvious pitfall? It will happily slice a sentence—or even a word—right in half. The overlap is your Band-Aid for this, but it’s still a crude solution. Let’s see it in action with LangChain.

from langchain_text_splitters import CharacterTextSplitter

# Let's be honest, you probably found this in a tutorial.
text = "Your long document text goes here. It has multiple sentences. Some are short. Others, however, can be quite long and meandering, containing numerous clauses and perhaps even a semicolon or two; which this splitter will blindly ignore."

# 100 characters is too small for real use, but it makes the demo obvious.
text_splitter = CharacterTextSplitter(
    separator=" ",  # Split on spaces? Better than nothing, I guess.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

chunks = text_splitter.split_text(text)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk}\n")

You’ll get chunks of ~100 chars, but the splits will be grammatically chaotic. It’s fast, it’s predictable, and it’s often good enough for a proof-of-concept. But we can do better.

This is where we try to actually respect the natural boundaries of language. Instead of counting characters, we split on meaningful separators: paragraphs, sentences, bullet points, chapter headings, etc. The RecursiveCharacterTextSplitter in LangChain is the king here, and it’s what you should probably be using by default. It’s called “recursive” because it tries a hierarchy of separators. If its first choice (like "\n\n" for double newlines) doesn’t create chunks of the desired size, it moves to the next separator (e.g., "\n") and tries again, all the way down to splitting on individual words.

It’s less likely to create monstrosities that split mid-sentence. It’s the difference between using a scalpel and a chainsaw.

from langchain_text_splitters import RecursiveCharacterTextSplitter

# This is the one you actually want.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,  # A more generous overlap is wise here.
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]  # The default hierarchy
)

chunks = text_splitter.split_text(text)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk}\n")

This will aggressively try to keep paragraphs together, then sentences, then words. It’s a massive upgrade in coherence. The best practice? Tailor your separators to your data source. Is it Markdown? Use ["#", "##", "###", "```", "\n\n", "\n"]. Is it code? Well, that’s a whole other can of worms.

When Your Documents are Hierarchical: Recursive Chunking

This is the most sophisticated approach, and frankly, it’s where you separate the RAG rookies from the veterans. The idea is that documents have a structure—chapters, sections, subsections—and you want to preserve that context in your chunks.

You don’t just split; you build a tree. A chunk for a section might include the title of its parent chapter. This provides incredible context to the embedding model and the LLM later, drastically reducing the chance of a “hanging reference” (e.g., retrieving a chunk that says “as we discussed in the previous chapter” with no clue what that chapter was).

Implementing this from scratch is more art than science, often requiring custom parsing. But here’s a conceptual blueprint:

# Pseudo-code for a markdown-aware recursive chunker
def chunk_recursively(content, max_size, current_header=""):
    chunks = []
    # If the content is small enough on its own, chunk it with its header context
    if len(content) <= max_size:
        chunk = f"{current_header}\n\n{content}" if current_header else content
        chunks.append(chunk.strip())
    else:
        # Split the content based on the next level of headers (e.g., ##)
        sections = split_on_header_pattern(content, pattern="\n## ")
        for section in sections:
            # Extract the new header from this section
            section_header = extract_header(section)
            full_header = f"{current_header} > {section_header}" if current_header else section_header
            # Recurse! Split this smaller section further.
            chunks.extend(chunk_recursively(section, max_size, full_header))
    return chunks

# You'd need to write split_on_header_pattern and extract_header, likely with regex.

This creates chunks that know their place in the world. A retrieved chunk might look like “Chapter 3 > Safety Guidelines > Proper Handling…”. This is pure, uncut context, and your LLM will thank you for it. The downside? It’s complex and requires deep understanding of your document format.

The Uncomfortable Truth: There is No Perfect Chunk Size

I can’t give you a magic number. Anyone who does is lying. The “best” chunk size is a function of your documents, your embedding model, and your query type.

Your Documents: Legal contracts need large chunks to capture entire clauses. Slack messages can be tiny.
Your Embedding Model: Most models have a context window (e.g., 512 tokens). A chunk significantly larger than that can’t be effectively embedded. You’re just throwing away text.
Your Query Type: Factual, “what is X” queries do well with small, precise chunks. Analytical, “summarize the arguments for Y” queries need much larger chunks to contain the necessary narrative.

The only best practice that matters: experiment. Build your pipeline, run a battery of test queries, and evaluate the results. Is it retrieving the right stuff? Tweak your chunk size and strategy, and try again. It’s not glamorous, but it’s the work. Welcome to the trenches.

The Trusty (and Dumb) Workhorse: Fixed-Size Chunking

The Smart(er) Cookie: Semantic Chunking

When Your Documents are Hierarchical: Recursive Chunking

The Uncomfortable Truth: There is No Perfect Chunk Size