27.6 Weaviate: Vector Database with Hybrid Search

Right, so you’ve got your data embedded into these beautiful, high-dimensional vectors. You can find similar stuff, which is magical. But let’s be real, you’re not just going to search by vector similarity. You’re going to want to say, “Hey, find me sci-fi books from the 80s that are similar to Dune.” That’s where Weaviate struts onto the stage. It’s a vector database that doesn’t force you to choose between the fuzzy magic of vector search (“like this”) and the rigid, logical precision of traditional keyword filtering (“from the 80s”). You can have both. It’s called a hybrid search, and it’s the main reason I reach for Weaviate for so many projects.

Think of it this way: a keyword search (sci-fi, 1980s) finds things in a specific, Boolean, “is it in the set or not?” way. A vector search finds things in a continuous, “how close is it?” way. Weaviate’s party trick is fusing these two scores—the keyword BM25F score and the vector similarity score—into a single, ranked list. It’s like having a librarian who’s both a relentless categorizer and a deeply intuitive curator.

The Core Architecture: Classes, Properties, and Vectors

Weaviate structures data into “Classes” (think of them as tables or collections). Each data object belongs to a class and has “Properties” (the fields, like title, year, genre). The crucial bit is that you define which property (or properties) gets vectorized. This is where the embedding happens.

Let’s get our hands dirty. First, we define a schema. This is where you tell Weaviate exactly what you’re storing and how to handle it.

import weaviate
from weaviate.classes.config import Configure, Property, DataType

client = weaviate.connect_to_local()

# Check if the class exists and delete it to start fresh (for this example)
if client.collections.exists("Book"):
    client.collections.delete("Book")

# Define the schema for our Book class
client.collections.create(
    name="Book",
    properties=[
        Property(name="title", data_type=DataType.TEXT),
        Property(name="author", data_type=DataType.TEXT),
        Property(name="year", data_type=DataType.INT),
        Property(name="genre", data_type=DataType.TEXT),
    ],
    vectorizer_config=Configure.Vectorizer.text2vec_transformers(), # Use a local model
    generative_config=Configure.Generative.openai() # Optional, for generative search
)

client.close()

See that vectorizer_config? I’m telling Weaviate, “Hey, use this text2vec-transformers model to automatically vectorize the text you ingest.” You could also use OpenAI, Cohere, or others, or even pass in your own pre-computed vectors. The generative_config is a bonus—it lets us do cool RAG stuff later, which we’ll get to.

Ingesting Data: It’s Just JSON

Now, let’s shove some data in. Weaviate’s data model is JSON-based, which is a breath of fresh air.

import weaviate
client = weaviate.connect_to_local()
books = client.collections.get("Book")

with books.batch.dynamic() as batch:
    batch.add_object(properties={"title": "Dune", "author": "Frank Herbert", "year": 1965, "genre": "sci-fi"})
    batch.add_object(properties={"title": "Neuromancer", "author": "William Gibson", "year": 1984, "genre": "sci-fi"})
    batch.add_object(properties={"title": "The Left Hand of Darkness", "author": "Ursula K. Le Guin", "year": 1969, "genre": "sci-fi"})
    batch.add_object(properties={"title": "The Shining", "author": "Stephen King", "year": 1977, "genre": "horror"})

client.close()

The batch inserter is your friend. Use it. Don’t hammer the database with individual inserts unless you enjoy watching paint dry.

The Main Event: Hybrid Querying

This is what you came for. The hybrid parameter is your gateway. You pass it a query string for the keyword part (BM25) and the alpha parameter to control the blend. alpha=1 is pure vector search. alpha=0 is pure keyword search. Anything in between is a weighted fusion. It’s a dial, not a switch.

import weaviate
client = weaviate.connect_to_local()
books = client.collections.get("Book")

# The magic line: a hybrid search for 80s sci-fi similar to "space opera"
response = books.query.hybrid(
    query="space opera", # The keyword query for BM25
    filters=(weaviate.classes.query.Filter.by_property("year").greater_equal(1980)
            & weaviate.classes.query.Filter.by_property("genre").equal("sci-fi")),
    alpha=0.5, # A 50/50 blend of keyword and vector relevance
    limit=3
)

for obj in response.objects:
    print(f"{obj.properties['title']} ({obj.properties['year']}) - by {obj.properties['author']}")
    print(f"Hybrid score: {obj.metadata.score:.3f}\n")

client.close()

This query is the whole point. It’s filtering on year and genre with hard rules (no horror novels from 1999 are getting in here), while the hybrid search is looking for items whose text properties are both keyword-relevant and vector-similar to “space opera”. The alpha=0.5 means we’re giving equal weight to both strategies.

Pitfalls and The Alpha Dial

The biggest mistake I see? People just set alpha=0.5 and forget it. Don’t. The right value is entirely dependent on your data and use case. If your keywords are highly precise (e.g., product SKUs, unique IDs), lean towards keyword (alpha near 0). If you’re searching based on conceptual similarity with messy, natural language queries (e.g., “sad songs about rain”), lean on the vectors (alpha near 1). You must test and tune this. It’s not a “set it and forget it” thing; it’s the core of your search’s personality.

Another gotcha: Weaviate’s BM25 implementation is powerful, but it’s still lexical. It won’t find “SF” if you search for “sci-fi” unless you’ve configured synonyms. This is where the hybrid approach saves you—the vector search will make that conceptual connection, even if the keywords don’t match.

Weaviate isn’t perfect. The learning curve for its module system (like setting up a custom vectorizer) can be steep. But its raw power for building semantically-aware applications without forcing you to abandon the tried-and-true tools of database filtering is, frankly, a game-changer. It’s the database that finally gets that meaning and metadata are two sides of the same coin.