24.4 Embedding Models: OpenAI, Sentence Transformers, and BGE

Alright, let’s talk about the unsung hero of the RAG pipeline: the embedding model. This is the part that takes your brilliant, messy, human-language queries and documents and squishes them down into a list of numbers—a vector—that a computer can actually reason about. Get this right, and your RAG system sings. Get it wrong, and you’re just doing a very expensive, very slow keyword search. We’re not here for that.

The core idea is simple: we want sentences or phrases with similar meanings to be close together in this high-dimensional number space (what we call the “vector space”). Dissimilar things should be far apart. It’s like a nightclub: the goths are in one corner, the tech bros are by the crypto punch, and the indie kids are… well, they’re too cool to be in any specific corner. Your embedding model is the bouncer with a PhD in semantics, deciding who belongs where.

Your Three Main Contenders

You’ve essentially got three choices here, ranging from “easy but costs money” to “free and brilliant but needs a bit more work.”

OpenAI’s text-embedding-ada-002: This is the default for a reason. It’s the convenience store of embeddings—always open, pretty good, and you just pay with an API call. It’s a solid, general-purpose model that does a lot of things well without any fuss. The biggest perk? You don’t host it. The biggest downside? You pay per token, and your data gets sent over the wire to OpenAI. For prototypes, production systems where you don’t want the ops headache, or if you’re not privacy-conscious, it’s a fantastic starting point.

from openai import OpenAI
client = OpenAI(api_key='your_key_here')

def get_ada_embedding(text):
    """Get an embedding from OpenAI's ada-002 model."""
    response = client.embeddings.create(
        model="text-embedding-ada-002",
        input=text
    )
    return response.data[0].embedding

# Usage
vector = get_ada_embedding("Why did the chicken cross the road?")
print(f"Vector dimension: {len(vector)}") # Spoiler: it's 1536

Hugging Face’s Sentence Transformers (all-MiniLM-L6-v2): This is the workhorse of the open-source world. You run it on your own hardware (even your laptop!), it’s free, and it’s blazingly fast. The all-MiniLM-L6-v2 model is a masterpiece of distillation—it’s been trained to mimic the behavior of a much larger, slower model, but it’s tiny and efficient. It’s the go-to for getting started locally. The dimension is only 384, which is both its strength (speed, smaller vector database) and its weakness (slightly less nuanced than larger models).

# First: pip install sentence-transformers
from sentence_transformers import SentenceTransformer

# Load the model (downloads on first run)
model = SentenceTransformer('all-MiniLM-L6-v2')

def get_local_embedding(text):
    """Get an embedding using the local Sentence Transformer model."""
    return model.encode(text).tolist()

# Usage
vector = get_local_embedding("To get to the other side.")
print(f"Vector dimension: {len(vector)}") # 384

BAAI’s BGE Models (e.g., BAAI/bge-base-en-v1.5): This is where things get spicy. The Beijing Academy of Artificial Intelligence (BAAI) has been pumping out models that consistently top the leaderboards (like the MTEB). They are powerhouses. If you need state-of-the-art performance and you’re willing to use a slightly larger, slightly slower model than all-MiniLM, you use BGE. For serious production RAG that you host yourself, this is often the best choice.

# pip install sentence-transformers
from sentence_transformers import SentenceTransformer

# Note the specific model name
model = SentenceTransformer('BAAI/bge-base-en-v1.5')

def get_bge_embedding(text):
    # BGE models require instructions for query encoding for best performance!
    # This is their secret sauce and a common pitfall. For passages/documents, you don't need it.
    # For queries, you prepend this instruction.
    if "is a query" in globals(): # You'd have a better way to check context
        text = "Represent this sentence for searching relevant passages: " + text
    return model.encode(text).tolist()

vector = get_bge_embedding("Chicken road crossing motivations")
print(f"Vector dimension: {len(vector)}") # 768

The Devil’s in the Details: Pitfalls & Best Practices

Normalization is Non-Negotiable: I don’t care which model you use, for the love of all that is good in vector search, normalize your embeddings. This means scaling the entire vector to have a length of 1. Why? Because similarity search (cosine similarity) becomes ludicrously simple and fast—it’s just a dot product. Unnormalized vectors can have different magnitudes, which screws with your distance calculations. It’s like trying to compare the brightness of lightbulbs without controlling for distance.
```
import numpy as np

def normalize_vector(vector):
    vector = np.array(vector)
    return vector / np.linalg.norm(vector)

normalized_ada_vector = normalize_vector(ada_vector)
# Now you can efficiently compare it to other normalized vectors.
```
The Instruction Quirk (BGE): Notice that quirk in the BGE code? These models are instruction-tuned. For search, you must prepend a specific instruction to your query string ("Represent this sentence for searching relevant passages: ") to get the best possible representation. It’s a bit weird, but it works shockingly well. Forgetting to do this is like using a race car but never getting out of first gear.
Dimension Mismatch Mayhem: You cannot compare a 384-dim vector from MiniLM to a 1536-dim vector from Ada-002. They live in completely different universes. Pick one model and use it for everything—embedding your documents and your queries. Consistency is key. Your vector database is built for one specific geometry; don’t try to change the rules of physics halfway through.
Batch Everything: Never, ever call an embedding model one sentence at a time in a loop. It’s horrifically inefficient. Both the Sentence Transformers library and the OpenAI API accept lists of texts. Process them in batches to maximize throughput and minimize cost/latency.
```
# GOOD
texts = ["sentence 1", "sentence 2", "sentence 3", ...]
all_embeddings = model.encode(texts, batch_size=32) # Large batch size

# BAD
for text in texts:
    vector = model.encode(text) # Painfully slow
```

So, which one should you use? Start with all-MiniLM-L6-v2 for prototyping on your machine. Ramp up to BGE-base or BGE-large for a serious, self-hosted production system. Use text-embedding-ada-002 if you value convenience over cost and data privacy. There’s no single right answer, only the right answer for your specific problem. Now go make some vectors.