36.1 CLIP: Contrastive Language-Image Pretraining

Alright, let’s talk about CLIP. You know how most AI models are specialists? The image guy only labels images, the text guy only generates text. They’re like savants at a party who can only talk about one thing. CLIP (Contrastive Language-Image Pre-training) from OpenAI is the charming polymath who can actually connect the two. It’s the model that made “multimodal” a buzzword you couldn’t escape, and for good reason.

The core idea is so brilliantly simple you’ll kick yourself for not thinking of it first: instead of training a model to predict captions from images (or vice versa) directly, we train it to simply understand which pieces of text go with which images. It’s the ultimate “match the caption to the picture” game, played with hundreds of millions of examples scraped from the internet. This is called contrastive learning. The model isn’t learning to generate a description; it’s learning a shared representation space where paired images and text are close together, and unpaired ones are far apart. Think of it as teaching the model the concept of “this goes with that” at a fundamental level.

The Core Idea: Contrast is Key

Here’s the magic trick. CLIP consists of two separate encoders: a Vision Encoder (often a Vision Transformer or a modified ResNet) and a Text Encoder (often a Transformer like GPT-2). They don’t talk to each other during training. They just do their own jobs.

The image encoder takes an image and spits out a feature vector. The text encoder takes a caption and spits out a feature vector. These vectors are then normalized onto a unit hypersphere (fancy talk for making them all the same length). The training objective is to maximize the cosine similarity between the vector of an image and the vector of its true, matching text caption, while simultaneously minimizing the similarity between that image’s vector and the vectors of all the other text captions in the batch (the “negatives”). The model gets really, really good at pulling the right pairs together and shoving the wrong ones apart.

This is why it’s so powerful. It learns a rich, joint embedding space. The vector for a picture of a corgi sitting in a field and the vector for the text “a corgi sitting in a field” end up being very close neighbors in this high-dimensional space.

Putting CLIP to Work: Zero-Shot Classification

This is where CLIP flexes its muscles and makes traditional computer vision models look a bit old-fashioned. Since CLIP understands images and text in the same space, you can perform classification without ever training on your specific labels.

Here’s how it works. Instead of having a fixed set of class outputs like “cat” or “dog”, you define your classes as text prompts. You then let CLIP’s text encoder generate feature vectors for each of these prompts. For your input image, you get a feature vector from the image encoder. You then calculate the cosine similarity between the image vector and every single text prompt vector. The prompt with the highest similarity wins. It’s a nearest-neighbor search in the embedding space you just learned about.

Let’s see it in action. First, install the darn thing: pip install ftfy regex tqdm torch torchvision and pip install git+https://github.com/openai/CLIP.git.

import torch
import clip
from PIL import Image

# Load the model and the preprocessing pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)  # 'ViT-B/32' is a good default

# Load and preprocess your image
image = Image.open("your_corgi_photo.jpg")
image_input = preprocess(image).unsqueeze(0).to(device)  # Add a batch dimension

# Define your candidate labels. Be creative, this is the magic.
text_labels = ["a photo of a corgi", "a photo of a cat", "a photo of a dog", "a diagram", "a painting of a landscape"]
# Tokenize them and move to device
text_inputs = clip.tokenize(text_labels).to(device)

# Get features from both encoders
with torch.no_grad():
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_inputs)

# Calculate similarity: softmax over cosine similarities
logit_scale = model.logit_scale.exp()  # Learned parameter to scale the logits
logits_per_image = logit_scale * image_features @ text_features.t()
probs = logits_per_image.softmax(dim=-1).cpu().numpy()

# Print results
print("Label probabilities:")
for label, prob in zip(text_labels, probs[0]):
    print(f"{label}: {prob:.4f}")

The Art of the Prompt: Your New Superpower

You’ll notice I didn’t just use “corgi”, I used “a photo of a corgi”. This is the single most important trick with CLIP. The text encoder was trained on natural language captions from the web, not on single words. “a photo of a corgi” is a much better match for its training data than just “corgi”. The choice of prompt has a massive impact on accuracy.

This is both a feature and a rough edge. You have to prompt-engineer, almost like you would for a language model. For ImageNet, the paper found “a photo of a {label}, a type of pet” works better than just the label. It’s absurd that we have to guess this, but it works. Best practice? Ensemble your prompts. Try several related prompts (“a photo of a corgi”, “a cute corgi”, “a pembroke welsh corgi”) and combine the results. It makes the system far more robust.

Common Pitfalls and Where CLIP Stumbles

CLIP is not a panacea. It has biases, blind spots, and can be hilariously wrong.

Data Biases: It was trained on internet data. The internet is a messed-up place. CLIP will inherit all its biases, from over-representing Western concepts to reinforcing gender and racial stereotypes. You must test your application thoroughly for these failures.
Abstract Concepts: It’s great at concrete objects (“dog”, “car”) but can struggle with abstract concepts (“freedom”, “melancholy”) or complex relational tasks (“is the person pointing at the dog?”). Its understanding is often superficial.
Fine-Grained Details: It’s not great at counting things or reading super fine text in images. Don’t use it as an OCR engine.
The “Unknown Unknown”: It will always give you an answer, even if it’s completely wrong. The probability distribution over your labels might look confident even when the image is nothing like any of them. There’s no inherent “I don’t know” signal. You might need to set a minimum similarity threshold to handle this.

Despite its quirks, CLIP is a foundational model. It’s the backbone of image generation models (like DALL-E and Stable Diffusion, which use it for guidance), revolutionary zero-shot classifiers, and powerful image retrieval systems. It taught us that pairing vision and language through contrastive learning is one of the most effective ways to build a general-purpose visual understanding machine. Now go prompt it wisely.