34.6 Segment Anything Model (SAM): Zero-Shot Segmentation

Alright, let’s talk about the Segment Anything Model, or SAM. You’re going to hear a lot of hype about this one, and for once, a lot of it is actually justified. Think of SAM as that incredibly talented, slightly eccentric artist friend who can look at a canvas they’ve never seen before and immediately start painting perfect outlines of whatever you point to. It’s a zero-shot segmentation monster.

What makes SAM so bizarrely powerful is its training data. Meta basically created a segmentation data engine, generating a dataset of over 1 billion masks. Let that number sink in. That’s not a typo. This is the reason it can segment objects it has never seen during training. It’s not recognizing a “cat” or a “car”; it’s recognizing the fundamental concept of a “coherent, separate thing” based on patterns and boundaries. It’s less about semantics and more about geometry.

The Three Inputs: Prompting a Segmentation AI

SAM doesn’t just magically know what to segment. You have to prompt it, and it understands three main types of prompts. This is the core of its interactivity.

Spatial prompts: These are your clicks. A foreground click says, “Segment this thing.” A background click says, “No, not that thing, you idiot.” It’s remarkably good at resolving ambiguity from just a few clicks.
Box prompts: Draw a bounding box. SAM will then segment the object within that box. It’s like giving it a much clearer search area. This is often the most reliable and fastest method.
Text prompts: Okay, this one is a bit of a letdown. The original SAM paper and model from Meta do not support text prompts. Everyone gets this wrong. The text-to-segmentation capability you might be thinking of came later from other projects that combined SAM with a CLIP-like model. Don’t blame SAM for this; it’s just a case of its fame overshadowing the details.

Getting Your Hands on the Model

First, you need the model weights and the library. The segment_anything Python package is your gateway. You’ll also want a hefty GPU if you plan on running the larger models without wanting to watch paint dry.

pip install git+https://github.com/facebookresearch/segment-anything.git

Now, download the model weights. There are three variants: ViT-H (Huge), ViT-L (Large), and ViT-B (Base). The trade-off is simple: accuracy vs. speed. For most experimentation, ViT-B is a good start.

import torch
import numpy as np
from segment_anything import sam_model_registry, SamPredictor

# Check for GPU, because you really, really want one.
device = "cuda" if torch.cuda.is_available() else "cpu"
model_type = "vit_b"
sam = sam_model_registry[model_type](checkpoint="path/to/sam_vit_b_01ec64.pth")
sam.to(device)

predictor = SamPredictor(sam)

The Two Workflows: Automatic and Prompt-Based

SAM offers two main ways to generate masks. Confusing them is a common pitfall.

1. The SamPredictor and set_image method: This is for prompt-based segmentation. You first “set” an image, which runs the image through the ViT encoder once, caching the embedding. Then, you can provide prompts (points, boxes) and get masks in real-time. This is incredibly fast for interactive use.

import cv2
image = cv2.imread('your_image.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # OpenCV uses BGR, SAM expects RGB
predictor.set_image(image)

# Define a prompt (e.g., a single point click at coordinates [x, y])
input_point = np.array([[500, 375]]) # Format as a 2D array
input_label = np.array([1]) # 1 = foreground, 0 = background

masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True, # Let SAM offer multiple options
)

print(f"Generated {len(masks)} mask options with scores: {scores}")
# The mask with the highest score is usually masks[scores.argmax()]

2. The SamAutomaticMaskGenerator: This thing goes brrrrr. It generates a segmentation mask for everything in the image without any prompts. It’s what people use for that “segment the whole world” demo. It works by sampling a grid of points over the image and prompting the model with each one. It’s computationally expensive and can be overkill, but it’s fantastic for discovering all possible objects.

from segment_anything import SamAutomaticMaskGenerator

mask_generator = SamAutomaticMaskGenerator(sam)
masks = mask_generator.generate(image)
# `masks` is a list of dicts, each containing details about one mask
print(f"Found {len(masks)} objects. That's a lot of stuff.")

Best Practices and Pitfalls

multimask_output=True is your friend: When prompting, always set this to True initially. SAM will return 3 possible masks for your prompt. You can then choose the best one based on the returned confidence scores. It handles the inherent ambiguity of a single click brilliantly.
The Box is King: For the cleanest, most precise mask, use a bounding box prompt. It removes almost all ambiguity. Combine a box with a single foreground point for the best results.
Beware the Blob: On complex, textured backgrounds or images with many overlapping objects, the automatic mask generator can turn into a paranoid artist, seeing “things” in the noise. You’ll get a million tiny, meaningless segments. Always inspect its output before using it for anything serious.
It’s not a magician: SAM struggles with things that have fuzzy or non-existent boundaries (think fire, smoke, hair against a similar background) because the concept of a “thing” becomes ill-defined. It’s a geometry model, not a physics simulator.
Post-processing is mandatory: The masks SAM returns are often… “noisy.” They can have disjointed parts, holes, and jagged edges. You will almost always need to run them through a morphological operation (like cv2.morphologyEx) or connected component analysis to clean them up for production use.

SAM is a foundational tool, not a finished product. It’s the raw, powerful engine upon which you’ll build your specific application. Its true power isn’t in what it segments, but in how easily you can tell it what to segment. Now go point at some stuff.