27.9 Choosing Embedding Dimensions and Distance Metrics
Alright, let’s get into the weeds. You’ve got your data ready to be vectorized, and now you’re staring at the configuration for your embedding model. Two questions immediately slap you in the face: “How many dimensions?” and “Which distance metric?” These aren’t just academic preferences; they’re fundamental choices that will dictate your system’s performance, cost, and sanity. Get them wrong, and you’ll be chasing weird accuracy issues for weeks. Let’s get them right.
The Goldilocks Problem of Dimensionality
More dimensions sound better, right? More information, more nuance. Sure, but only up to a point. Then you hit the law of diminishing returns followed by a brick wall of computational pain.
Low dimensions (e.g., 128) are fast and cheap. Your vectors are tiny, so storage and memory are minimal, and distance calculations are blisteringly fast. The problem? They might not have enough “room” to capture the intricate relationships in your data. It’s like trying to describe the plot of Inception using only ten words. You’ll lose the good stuff.
High dimensions (e.g., 2048) can capture incredible nuance. But now you’re flirting with the Curse of Dimensionality. In super high-dimensional space, everything starts to become equidistant. It’s counterintuitive, but the math checks out. This makes it harder for your database to find meaningful neighbors because the concept of “closeness” gets blurry. It’s also expensive. Every extra dimension adds to your index size, memory footprint, and query latency.
So what’s the sweet spot? You don’t choose it. I’m serious. The model designers did. For most modern pre-trained models (think OpenAI’s text-embedding-ada-002, Sentence Transformers, etc.), the dimension size is baked in. ada-002 gives you 1536 dimensions. Full stop. That’s the product. The choice, therefore, is which model to use, and you base that on benchmarks for your specific type of data (text, image, etc.). The model’s performance is the validation of its dimensionality.
# This is what you'll actually do. You don't set the dims, you choose the model.
from sentence_transformers import SentenceTransformer
# This model outputs 384-dimensional vectors. I didn't choose 384, the all-MiniLM-L-v2 model did.
model = SentenceTransformer('all-MiniLM-L-v2')
embeddings = model.encode(["Your text here", "More text here"])
print(f"Embedding dimension: {embeddings.shape[1]}")
# Output: Embedding dimension: 384
Your job is to ensure your chosen model’s dimensionality is supported by your vector database. Most handle up to 2000-ish without breaking a sweat.
Picking Your Poison: Distance Metrics
This is where you do have a choice, and it matters. The distance metric is the rulebook for how your database measures “similarity.” Get it wrong, and your results will be nonsense.
Cosine Similarity is the default for a reason, especially for text. It measures the angle between vectors, ignoring their magnitude. This is brilliant because it focuses on direction, which corresponds to semantic meaning. A long article and a short tweet about the same topic will have very different magnitudes (lengths) but similar direction. Cosine similarity nails this.
Dot Product is related to cosine similarity but gets influenced by magnitude (Cosine Similarity = Dot Product / (Magnitude(A) * Magnitude(B))). If your embeddings are normalized (length of 1), then Dot Product and Cosine Similarity are identical. Some models are trained with normalized embeddings, making this a computationally slightly faster option. But if they’re not normalized, a huge article will always seem “more similar” to everything than a short one, which is usually bad.
Euclidean Distance (L2) is your classic “as-the-crow-flies” distance. It cares about magnitude. It’s often the go-to for CV (computer vision) and other non-text data where the actual vector position is intrinsically meaningful.
Here’s the pragmatic choice tree:
- Working with text? Use Cosine Similarity. Just do it. It’s the default for a reason.
- Using a model that produces normalized embeddings? You can use Dot Product for a tiny speed boost.
- Working with image or audio data where Euclidean distance is standard? Use Euclidean (L2).
Never assume. Check your model’s documentation. The biggest pitfall here is mismatching the metric your model was optimized for. Some models are trained explicitly with a cosine objective. Using L2 with them will give you subpar results.
# Important: You must configure your index to use the right metric!
# This is for Weaviate, but the concept is universal across all DBs.
import weaviate
from weaviate.classes.config import Configure
# Creating a collection with cosine distance configured
client.collections.create(
name="MyCollection",
properties=[...], # your data properties here
vectorizer_config=Configure.Vectorizer.text2vec_openai(), # model choice
vector_index_config=Configure.VectorIndex.hnsw(
distance_metric="cosine" # <- This is the critical choice
)
)
The Normalization Trap
Here’s a fun “gotcha.” Let’s say you pick Cosine Similarity. Most vector databases will expect your vectors to be normalized for this to work correctly. If you’re using a model that doesn’t output normalized vectors, you must normalize them yourself before ingestion. If you don’t, the math is wrong. It’s like trying to measure an angle with a ruler.
import numpy as np
from sklearn.preprocessing import normalize
# Suppose your model outputs non-normalized vectors
raw_embeddings = model.encode(["Your text here"])
# Normalize them to unit length before putting them in the DB
normalized_embeddings = normalize(raw_embeddings, norm='l2')
print(f"Original magnitude: {np.linalg.norm(raw_embeddings)}")
print(f"Normalized magnitude: {np.linalg.norm(normalized_embeddings)}")
# Output: Original magnitude: 5.32...
# Output: Normalized magnitude: 1.0
Always, always check if your embedding model outputs normalized vectors. It’s a one-line check (np.linalg.norm(vector)). If the magnitude is ~1.0, you’re golden. If not, normalize. It saves you from a world of confusing, terrible search results.
The bottom line: Your distance metric isn’t a preference; it’s a contract between how your model represents data and how your database queries it. Honor the contract.