34.7 SAM 2: Video Segmentation

Alright, let’s talk about SAM 2. You remember the original Segment Anything Model (SAM), right? That glorious, promptable image segmentation engine that felt like magic? Well, Meta decided it was too much fun to leave in the static image world and dropped SAM 2 on us. The core idea is as brilliant as it is obvious: extend that promptable segmentation magic to video. The results are, frankly, both impressive and occasionally a bit unhinged. This isn’t just running SAM frame-by-frame; that would be computationally suicidal and give you a jittery mess that would induce migraines. SAM 2 is smarter than that, and we’re going to tear into how.

The Core Architecture: It’s All About the Tracks

The fundamental shift from SAM to SAM 2 is the introduction of a tracking mechanism. SAM was a genius at understanding spatial relationships within a single frame. SAM 2 adds a temporal dimension, meaning it understands how a pixel or a mask moves over time.

Under the hood, SAM 2 uses a spatio-temporal feature space. Think of it as a 3D cube where the Z-axis is time. Instead of just looking at features in one image slice, it looks at features across multiple slices (frames) simultaneously. This allows it to answer the critical question: “Is this pixel in frame 10 the same object as that pixel in frame 11?” It does this by computing similarity between features across time. High similarity means it’s probably the same thing, so its mask should be linked—or “tracked”—from one frame to the next.

This is why it’s so much better than a naive frame-by-frame approach. That approach would segment a dog in frame 1, then segment a slightly different-looking dog in frame 2, and have no idea they’re the same entity. SAM 2 knows they’re the same dog, so it produces a single, consistent mask track for the entire sequence. It’s the difference between a flipbook and a movie.

Prompting in Time: The Real Magic

The promptable interface is SAM’s killer feature, and SAM 2 keeps it intact. You can prompt with clicks, boxes, or text (if you’re using the variant that includes the text encoder) on a single frame, and it will propagate that segmentation through the entire video. This is where it feels like pure sorcery.

Let’s look at some code. Imagine you have a video of your cat, Professor Fluffypants, menacing a potted plant. You want to track the cat.

import numpy as np
from PIL import Image
import torch
from sam2.build_sam import build_sam2_video
from sam2.automatic_mask_generator import Sam2AutomaticMaskGenerator
from sam2.predictor_video import Sam2VideoPredictor
import cv2

# Load the model (this assumes you've downloaded the weights)
sam2_checkpoint = "PATH_TO/sam2.pt"
model = build_sam2_video(sam2_checkpoint)
predictor = Sam2VideoPredictor(model)
mask_generator = Sam2AutomaticMaskGenerator(model)

# Load your video frames into a list of numpy arrays
video_path = "professor_fluffypants.mp4"
cap = cv2.VideoCapture(video_path)
frames = []
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    frames.append(frame_rgb)
cap.release()

# Convert frames to a tensor: [T, H, W, C] -> that's Time, Height, Width, Channels
video_tensor = torch.from_numpy(np.array(frames)).permute(0, 3, 1, 2).float()

# Let's say you've identified a point on the cat in the first frame.
# Coordinates are (x, y) and the label '1' means "this is the object"
input_point = np.array([[550, 300]])  # Point on the cat's head
input_label = np.array([1])  # 1 for foreground point

# Set the video for the predictor
predictor.set_video(video_tensor)

# Predict the masks for the entire video based on the prompt in the first frame
masks, scores, _ = predictor.predict(
    frame_idx=0,  # The frame we're prompting on
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=False,  # Just give us the best mask, not multiple guesses
)

# `masks` is now a list of masks for each frame in the video. Boom.

The Rough Edges and Questionable Choices

Now, let’s be the brilliant friend who tells you the unvarnished truth. SAM 2 is not perfect. The designers made some choices, bless their hearts.

First, the computational cost. It’s a beast. Processing long, high-resolution videos requires serious GPU memory. You’ll quickly run into “CUDA out of memory” errors if you’re not careful. The common practice is to process videos in chunks, which introduces its own headaches about stitching tracks together at the boundaries. It’s a classic engineering trade-off they punted on for you to solve.

Second, occlusion is its kryptonite. If Professor Fluffypants ducks behind the couch for a few frames, the track might break. When he reappears, SAM 2 might see him as a new object or, worse, struggle to re-identify him. The temporal feature matching isn’t infinitely persistent. You can sometimes hack around this by prompting again on the frame where he reappears, but it breaks the automation.

Third, the text prompting (in the text-capable version) can be… whimsical. It works surprisingly well for simple nouns (“cat”, “plant”), but be prepared for some truly absurd failures of vision-language understanding with more complex queries. It’s a reminder that we’re still very much in the early days of this tech.

Best Practices from the Trenches

Pre-process Your Video: Downsample it. Seriously. Unless you need pixel-perfect accuracy, reduce the resolution. You’ll get a huge speed boost and massively reduce memory usage for a very small, often imperceptible, loss in quality. cv2.resize() is your friend.
Choose Your Prompt Wisely: A prompt in a cluttered, ambiguous frame will lead to a bad time. Prompt on a frame where your target object is clear, isolated, and well-defined. A good box is often more stable than a single click.
Mind the Background: SAM 2 can get distractingly good at segmenting everything. If you only care about one object, use a precise prompt. If you want everything, use the mask_generator and be prepared to handle a mountain of mask data.
Post-processing is Non-optional: The raw output masks might be noisy. Use simple morphological operations like cv2.morphologyEx() with an opening or closing kernel to smooth out the edges and get rid of tiny holes. It makes a world of difference.

SAM 2 is a powerful tool that feels like it’s from the near future. It’s not without its quirks, but understanding its temporal-tracking core and its practical limitations is the key to making it sing. Now go forth and segment all the videos. Just maybe don’t try to run it on a 4K feature film with your laptop GPU. I’ve made that mistake so you don’t have to.