26.3 Node Parsing: Chunking and Metadata Extraction
Right, let’s talk about node parsing, which is a fancy term for the gloriously tedious but utterly critical task of taking your data and chopping it into pieces an LLM can actually swallow. Think of it as pre-chewing food for a baby bird with a context window. You can’t just shove a whole PDF into GPT-4 and say “figure it out.” It’ll choke, you’ll waste money, and the results will be nonsense. Our job is to be the responsible parent here.
The core idea is simple: a Document is your raw data (a PDF, a blog post, a database record). A Node is a chunk of that document, enriched with metadata that gives it meaning and context. A node parser automates the chunking and metadata extraction. Get this right, and your RAG system sings. Get it wrong, and you’re just building a very expensive, very confident nonsense generator.
The Default Parser: Simple, Stupid, and Sometimes Sufficient
LlamaIndex provides a default node parser because, well, it has to. It’s the culinary equivalent of boiling pasta. It works, but it’s not exactly inspiring. It uses a chunk_size and a chunk_overlap to split text. The overlap is crucial—it prevents ideas from being cleaved in half at a chunk boundary, which can absolutely murder retrieval accuracy.
from llama_index.core.node_arser import SimpleNodeParser
# The default. chunk_size=1024, chunk_overlap=20
parser = SimpleNodeParser.from_defaults()
# Or, be specific. Let's get a bit more overlap for denser concepts.
custom_parser = SimpleNodeParser.from_defaults(
chunk_size=512,
chunk_overlap=64
)
documents = [/* ... your loaded documents here ... */]
nodes = custom_parser.get_nodes_from_documents(documents)
This is your starting point. It’s fine for a quick prototype, but you’ll quickly run into its limitations. It splits on sentences and whitespace, which means a massive JSON object or a markdown table will get brutally massacred mid-line. It’s dumb, and we have to accept that.
Smarter Chunking: Enter the Sentence Window
The SentenceWindowNodeParser is a massive step up for prose-heavy content. Instead of arbitrary chunks, it first splits everything into individual sentences. Each sentence becomes its own node. The genius part is the metadata: it also stores the surrounding sentences as a “window.” This means during retrieval, you get the perfect, precise sentence that answers the query, and the LLM gets the surrounding context for full comprehension. It’s retrieval precision with generation context. Brilliant.
from llama_index.core.node_parser import SentenceWindowNodeParser
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3, # 3 sentences on each side for context
window_metadata_key="window",
original_text_metadata_key="original_sentence",
)
nodes = node_parser.get_nodes_from_documents(documents)
# Now, each node has .text (the single sentence) and .metadata["window"] (the surrounding context)
This is one of my go-to parsers for anything that looks like an article, report, or book. The only downside is it creates a lot of nodes, which means your vector store gets bigger. It’s a trade-off: storage cost for accuracy.
The Semantic Chunker: Let the LLM Decide Where to Split
This one feels like black magic. The SemanticSplitterNodeParser uses an embedding model to understand the semantic cohesion of the text and splits it precisely when the topic meaningfully shifts. It doesn’t care about sentence count; it cares about ideas.
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding()
node_parser = SemanticSplitterNodeParser(
buffer_size=1, # A small buffer to smooth the edges
breakpoint_percentile_threshold=95, # How significant a change requires a split
embed_model=embed_model
)
nodes = node_parser.get_nodes_from_documents(documents)
This is incredibly powerful for heterogeneous documents. A textbook chapter that moves from “Introduction to Calculus” to “Practical Examples” will be split right at that conceptual boundary. The pitfall? It’s the most computationally expensive option (every chunk needs an embedding) and the parameters (buffer_size, percentile) require tuning. It’s not a default; it’s a scalpel.
Metadata: The Secret Sauce
Chunking is only half the battle. Metadata is what makes your nodes smart. A node without metadata is a ghost in the machine. The parser can automatically inject crucial details like the node’s position in the document (page_label, if it came from a PDF), the file name, the document type, or the section header it was under.
This metadata is your lifeline later for filtering. You can tell your retriever, “Only get nodes from the PDF ‘2024 Annual Report.pdf’ that are under the ‘Risks’ section header.” This is how you move from dumb search to precise data retrieval.
# This happens automatically with the default parsers for supported file types.
for node in nodes:
print(f"Text: {node.text[:100]}...")
print(f"Metadata: {node.metadata}")
print("---")
Best Practice: Always add custom metadata before parsing if you can. Got a user ID associated with these documents? A product category? Add it to the Document.metadata before you run the parser. The node parser will automatically carry it forward to every node created from that document, saving you a world of pain later.
The rough edge? Not all parsers handle all file types perfectly. A complex PDF with multi-column layouts might have its text extracted out-of-order, making a mess of your page_label. You have to validate. This isn’t a LlamaIndex problem; it’s a “the world of unstructured data is messy” problem. Your job is to choose the right parser for your specific data and test, test, test. Don’t assume it worked. Open up the resulting nodes and look at them. It’s the only way to be sure.