27.4 Chroma: Lightweight Embedded Vector Store

Let’s be honest, you don’t always need a distributed, planet-scale vector database humming away in its own Kubernetes cluster. Sometimes you just need to stash some vectors on your local machine for a prototype, a personal project, or to avoid the sheer overhead of a full-blown database server. That’s where Chroma comes in. It’s the vector database equivalent of a trusty, lightweight backpack—not built for a cross-continental expedition, but perfect for the day hike. It’s an open-source embedded store, meaning it runs right in your Python application, no external servers or complex setup required.

The Core Triad: Collections, Embeddings, and Documents

Chroma organizes your data into Collections, which are essentially named buckets for your vectors and their associated metadata. The magic of Chroma is that it handles the embedding process for you, but you have to understand its model. It deals with three fundamental concepts:

IDs: Unique strings you provide for each item. If you don’t provide one, it’ll generate a UUID, but you’ll want to provide your own to easily update or reference things later.
Embeddings: The actual vectors. You can either pass in raw text and let Chroma’s built-in default Sentence Transformers model handle the embedding, or you can pass in your own pre-computed vectors for more control.
Documents: The original text that an embedding represents. This is crucial because when you get a search result back, you probably want to see the actual text “The company reported strong quarterly earnings,” not just a list of numbers that vaguely smell like profit.

Here’s how you bring this triad to life:

import chromadb

# Persistent, serverless mode. Your data gets saved to the `./my_chroma_db` directory.
client = chromadb.PersistentClient(path='./my_chroma_db')

# Create a collection. `embedding_function` is optional; if you omit it, you must provide embeddings yourself.
collection = client.get_or_create_collection(
    name="my_articles",
    embedding_function=chromadb.utils.embedding_functions.DefaultEmbeddingFunction()
)

# Add your data. Notice the triad: ids, documents, and (optional) metadatas.
# Chroma automatically generates embeddings from the documents.
collection.add(
    documents=[
        "The latest smartphone model features a revolutionary camera.",
        "Financial markets reacted positively to the new legislation.",
        "The soccer team secured a place in the finals after a thrilling match.",
        "A breakthrough in renewable energy promises cheaper solar power."
    ],
    metadatas=[
        {"category": "technology", "word_count": 98},
        {"category": "finance", "word_count": 120},
        {"category": "sports", "word_count": 87},
        {"category": "science", "word_count": 105}
    ],
    ids=["id1", "id2", "id3", "id4"]
)

Querying: The Whole Point of This Exercise

Querying is where you see the payoff. You ask Chroma for the n most similar items to your query text or vector. The beauty is that it handles the embedding of your query for you, ensuring it’s comparable to the vectors in the collection.

# Let's find documents similar to a query about technology.
results = collection.query(
    query_texts=["innovative tech gadgets"],
    n_results=2
)

print(results['documents'])
# Outputs:
# [['The latest smartphone model features a revolutionary camera.',
#   'A breakthrough in renewable energy promises cheaper solar power.']]

Notice what happened? Our query was about “tech gadgets,” but it returned the document about renewable energy. That’s not a mistake—it’s the embedding model understanding semantic similarity. “Innovative tech gadgets” and “breakthrough in renewable energy” live close together in vector space because they conceptually relate to technological innovation.

Where the Shoe Pinches: Common Pitfalls and Best Practices

Chroma is brilliant for what it is, but it has rough edges. The designers made some… interesting choices.

Pitfall 1: The Silent Default Embedding Function. The default DefaultEmbeddingFunction() uses the all-MiniLM-L6-v2 model from Sentence Transformers. It’s a good general-purpose model, but it will download the first time you use it (~80MB). The “silent” part is the issue. For a production application, you absolutely must explicitly define your embedding function. Relying on a magical default that downloads things is a recipe for disaster at 3 AM.

Best Practice: Explicit is Better Than Implicit. Always specify your embedding function, even if it’s the default one. This makes your code self-documenting and stable.

from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

# Be explicit. This is the same model, but now everyone (including Future You) knows what's happening.
embedding_function = SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
collection = client.get_or_create_collection(name="my_explicit_collection", embedding_function=embedding_function)

Pitfall 2: The get Method’s Bait-and-Switch. collection.get() exists. You might think, “Great, I’ll get my data back!” But what you actually get is a dictionary of IDs, embeddings, documents, and metadata. If you just want the text, you have to pluck it out yourself. This isn’t a bug, it’s just an API quirk you need to know.

# Gets everything. The embeddings are there, which you probably don't need to look at.
all_data = collection.get()
print(all_data.keys()) # dict_keys(['ids', 'embeddings', 'documents', 'metadatas'])

# If you just want the text, you have to be specific.
just_the_docs = collection.get(include=["documents"])

Pitfall 3: It’s Not a Battle-Tested Beast. Chroma is relatively young. It’s fantastic for local development and medium-sized projects, but I wouldn’t bet the entire fortune of a publicly traded company on its persistence layer just yet. For that, you’d want a more mature system like Qdrant or Pinecone. Chroma’s strength is its simplicity and speed of setup, not its ability to handle petabytes of data with five-nines uptime.

The key takeaway? Use Chroma when you need to get something working now without the friction of external infrastructure. It’s the duct tape and WD-40 of the vector database world—unbeatable for quick fixes and prototypes, even if you’ll eventually swap it out for something more heavy-duty.