27.7 Qdrant: Rust-Powered Vector Search

Right, so you’ve got your embeddings. A beautiful, high-dimensional representation of your data, probably spat out by some massive transformer model that’s costing you a small fortune in cloud credits. Wonderful. But a list of numbers is useless unless you can do something with it—specifically, find the other lists of numbers that are closest to it. That’s your vector search, and that’s where Qdrant comes in.

Think of it like this: if your embedding model is the brilliant, slightly unhinged artist who sees the world in 11-dimensional space, Qdrant is the ruthlessly efficient, hyper-organized librarian who can find you a near-identical painting in a gallery of billions before you’ve finished your coffee. And this librarian is built in Rust, which means it’s fast, memory-safe, and doesn’t crash when you look at it funny.

Why Rust? It’s Not Just for Hype

You might be wondering why you should care about the implementation language. It’s a database; it should just work, right? Well, yes, but the “how” dictates the “how well.” Vector search is a brutally performance-sensitive problem. You’re calculating distances (Euclidean, Cosine, Dot Product) across potentially billions of vectors, and you’re doing it in real-time. This requires fine-grained control over memory, CPU cache lines, and thread management.

C++ has traditionally owned this space, but it lets you shoot yourself in the foot with a cannon. Rust gives you that same low-level control but with a compiler that acts like a supremely pedantic but brilliant co-pilot. It won’t let you have data races, dangling pointers, or a whole class of memory-related bugs. The result? You get the raw speed of a systems language without the existential dread of a midnight call that your production database segfaulted. Qdrant leverages this to do things like performant SIMD operations and clever memory-mapping, which brings me to…

The Storage Smarts: In-Memory vs. Memmap

This is a classic trade-off: speed vs. resource usage. Qdrant is smart about it.

from qdrant_client import QdrantClient
from qdrant_client.http import models

# For all-out, hold-nothing-back speed, you keep everything in RAM.
# This is your "money is no object" scenario.
client = QdrantClient(":memory:")  # Ephemeral, for testing

# For a more realistic production setup, you'd connect to a server.
# But let's talk storage config on the collection itself.
client.create_collection(
    collection_name="my_embeddings",
    vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE),
    # This is the key bit: Optimize for high-speed search
    optimizers_config=models.OptimizersConfigDiff(
        memmap_threshold=20000  # Keep vectors under 20k in RAM. Above that, mmap them.
    )
)

The memmap_threshold is your dial. Vectors for collections smaller than this threshold live entirely in RAM—blazingly fast. For larger collections, Qdrant seamlessly switches to memory-mapping files. This uses the OS’s virtual memory system to only load the bits of the data it needs into RAM as it searches. It’s slower than pure RAM, but it’s a graceful degradation that lets you work with datasets vastly larger than your available memory without your server melting into a puddle of slag.

The Real Magic: Quantization and Scalar Quantization

Throwing hardware at the problem is expensive and boring. Being clever is cheap and cool. Qdrant’s killer feature for massive datasets is quantization. It sounds complex, but the concept is simple: instead of using 32-bit floats for every single number in your vectors, you can downsize them to 8-bit integers. This shrinks your index by 4x, which fits more into your CPU’s fast caches and speeds up distance calculations immensely.

The “scalar” part means it does this on a per-dimension basis, which is simpler and faster than more complex methods. The best part? You can often enable this with almost no perceptible loss in accuracy.

# Continuing from the previous example, let's create a collection with quantization
client.create_collection(
    collection_name="my_quantized_embeddings",
    vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE),
    optimizers_config=models.OptimizersConfigDiff(memmap_threshold=20000),
    # Here's the quantization config
    quantization_config=models.ScalarQuantization(
        scalar=models.ScalarQuantizationConfig(
            type=models.ScalarType.INT8,
            quantile=0.99,  # Use the 0.99 quantile as the max value for scaling
            always_ram=True,  # Keep the quantized vectors always in RAM for speed
        ),
    ),
)

The quantile parameter is the real insight here. You don’t want a single extreme outlier value to dictate the scaling for your entire dataset, making everything else less precise. By using the 0.99 quantile, you ignore the top 1% of outliers, leading to a much more representative and accurate scaling transformation. This is the kind of thoughtful design that shows the Qdrant team actually runs this stuff themselves.

The Pitfall: Filtered Search and the Curse of Cardinality

Here’s where everyone stumbles. You don’t just want similar vectors; you want similar vectors that also have a specific user_id, and are published=True, and were created in the last week. This is filtered search.

The pitfall is filter cardinality. Applying a filter after doing a full vector search is inefficient. Applying it before can be disastrous if your filter is too strict.

Imagine you have a billion vectors, but your filter WHERE user_id = '123' only returns 10 vectors. If you filter first, your super-optimized vector search is now only running on 10 points, which is utterly useless. The query planner might even choose a completely wrong index.

Qdrant handles this by using a two-pronged approach with its search API:

from qdrant_client.models import Filter, FieldCondition, MatchValue

# This is how you do it right.
results = client.search(
    collection_name="my_quantized_embeddings",
    query_vector=[0.1, 0.2, 0.3, ...],  # your query embedding
    query_filter=Filter(  # Apply a structured filter
        must=[
            FieldCondition(key="user_id", match=MatchValue(value="user_123")),
            FieldCondition(key="is_published", match=MatchValue(value=True))
        ]
    ),
    limit=10,
)

The trick is that Qdrant’s engine is designed to intertwine the filtering and search processes efficiently. It uses the filter to narrow down the candidate list as it’s searching, not just before or after. But you still need to be mindful. If your filter is extremely selective (returning < 0.1% of points), you might be better off storing a separate, smaller collection for that user. It’s a trade-off between storage duplication and query latency, and you have to measure it for your own use case. There’s no free lunch, but Qdrant at least gives you a well-stocked kitchen to cook in.