26.2 Data Connectors: Loading from Files, Databases, and APIs

Right, let’s talk about getting your data into LlamaIndex. This is the part where we stop admiring the shiny LLM from a distance and actually make it useful. The entire premise of this framework is that your LLM application is only as good as the data you feed it. You can’t just whisper a SQL query into ChatGPT’s ear and hope for the best. You need structure. You need Data Connectors.

Think of these as the bouncers for your data’s exclusive nightclub. They don’t just let any raw, messy data waltz in. They check it, format it, and hand it off to the bouncer inside (the NodeParser) who breaks it into manageable chunks. We’ll start with the simplest, most common bouncer: the file loader.

The Simple, Glorious File Loader

LlamaIndex has a small army of dedicated loaders for different file types. You don’t just have a “file loader”; you have a PDFReader, a DocxReader, even a JupyterNotebookReader. This is brilliant because a PDF’s text is buried under a mountain of layout information, a Word doc has styling metadata, and a Jupyter notebook is a JSON structure with code and markdown cells. Using the right loader means the tool, not you, does the grunt work of extracting the actual text.

Here’s how you load a straightforward text file. It’s almost comically simple.

from llama_index.core import SimpleDirectoryReader

# Point it at a directory. It will try to load everything it recognizes.
documents = SimpleDirectoryReader("./data").load_data()

Why is this a function and not some instantiated class? Because its only job is to do a thing once: load data. It’s a verb, not a noun. The load_data() method returns a list of Document objects. Each file typically becomes one Document. This is your first point of control.

But what if your directory is a mess with images and random binaries? Use filename_as_id to help with debugging later, and be specific with your file extensions.

documents = SimpleDirectoryReader(
    "./data",
    filename_as_id=True,
    required_exts=[".txt", ".pdf", ".md"]
).load_data()

The Database Crawler

Loading from a database is where you move from simple scripts to serious engineering. You’re not just reading bytes; you’re executing queries and translating relational data into a flat document structure. This is where most people face-plant, because they don’t think about the context loss when a row becomes a paragraph of text.

Let’s use the DatabaseReader to connect to a SQLite database (the same principles apply to Postgres or MySQL via SQLAlchemy).

from llama_index.readers.database import DatabaseReader
import sqlite3

# Establish a connection. This is passed straight to SQLAlchemy.
connection = sqlite3.connect("example.db")
db_reader = DatabaseReader(connection=connection)

# This query defines everything. Be SPECIFIC.
query = "SELECT id, title, content FROM articles WHERE category = 'tech'"
documents = db_reader.load_data(query=query)

What just happened? The reader executed your query. For each row in the result set, it created a Document object. By default, it just jams all the column values into a string, which looks something like "id: 123, title: 'AI is Cool', content: 'Lots of text here...".

This is functional but often dumb. The LLM has no idea what id: 123 means. It’s just noise. This is your first pitfall: schema awareness. You must craft your query to return only the data that provides meaningful context. Sometimes, that means writing a more complex query that joins tables to get a complete narrative for each document.

-- A better query for an article with author context
SELECT
    a.id,
    a.title,
    a.content,
    u.name as author_name
FROM articles a
JOIN users u ON a.author_id = u.id
WHERE a.category = 'tech'

Now your document text will include the author’s name, which is valuable context for the LLM.

The API Dance

Loading from an API is the wild west. Every API is a unique snowflake of pagination, authentication, and data formatting. LlamaIndex can’t possibly pre-build a loader for all of them, so it gives you the most powerful tool: a generic SimpleWebPageReader for public URLs and a lower-level JSONReader for when you need to get your hands dirty.

For a simple public API (no auth):

from llama_index.readers.web import SimpleWebPageReader

# This is great for a known, specific URL
urls = ["https://my-api.com/endpoint?id=123"]
documents = SimpleWebPageReader().load_data(urls)

But most real APIs aren’t that simple. They require tokens, pagination, and processing the JSON response. For this, you’ll often write a custom function. This isn’t a LlamaIndex limitation; it’s just the reality of working with APIs.

import requests
from llama_index.core import Document

def load_from_my_api():
    all_documents = []
    headers = {"Authorization": f"Bearer {API_KEY}"}
    next_page = True
    url = "https://api.example.com/v1/posts"

    while next_page:
        response = requests.get(url, headers=headers)
        response.raise_for_status() # Always check for errors!
        data = response.json()

        # Transform each item in the response into a Document
        for item in data["results"]:
            text = f"Title: {item['title']}\nContent: {item['body']}"
            all_documents.append(Document(text=text))

        # Handle pagination. This logic is always API-specific.
        url = data.get("next_page_url")
        next_page = url is not None

    return all_documents

documents = load_from_my_api()

The key here is the transformation step. You’re taking the raw JSON and crafting a text representation that makes sense for your LLM. You are the context engineer here. Don’t just dump the JSON string; turn it into coherent prose.

The Universal Pitfall: Assuming Clean Data

Here’s the brutal truth none of the documentation shouts loudly enough: Garbage in, garbage out. Your loader fetches bytes. It doesn’t magically fix encoding errors, strip out useless HTML boilerplate, or correct the fact that your PDF is a scanned image with OCR artifacts. A Document object filled with “Lorem Ipsum” layout text from a PDF footer is worse than useless; it actively confuses your model.

Your best practice is to load a small sample first and look at the .text attribute of your Document objects. Inspect them. Are they clean? Is the information relevant? If not, you need to pre-process your data before it gets to the loader or clean it up immediately after loading. This isn’t optional. It’s the most important step in the entire pipeline.

You’ve now got the data loaded into Document objects. Pat yourself on the back. Then immediately start worrying about how you’re going to split them all up. But that’s a problem for the NodeParser, which is a story for another section.