19.2 Feature Extraction: Freezing a Pretrained Backbone

Right, let’s talk about the most civilized form of digital cannibalism: feature extraction. You’ve got this model, probably some hulking behemoth like ResNet or VGG, that was trained for a thousand epochs on a million images. It learned to recognize edges, textures, cat noses, dog ears, and eventually whole concepts. It’s brilliant at what it does. Your new task, however, is to identify whether a plant is diseased or to classify different types of vintage teapots. You don’t have a million images of teapots. You have, like, two hundred. This is where we get smart and steal all those beautiful, pre-learned feature detectors and just slap a new head on top. We’re not going to mess with the genius backbone; we’re just going to use its brain.

The core idea is simple: you take a pretrained model, you freeze its weights (meaning you tell your optimizer “hands off these parameters!”), and you train only the new classifier layers you stick on the end. The frozen part becomes a fixed feature extractor. It’s like using a world-class chef to perfectly prep all your ingredients (chop, sear, season) and then you, the line cook, just decide how to arrange them on the plate.

Why Freeze the Backbone?

Two reasons, one obvious and one subtle. The obvious one: you prevent catastrophic forgetting. If you started tweaking those lower layers with your piddly little dataset, you’d rapidly destroy all the useful general-purpose features the network spent so long learning. It would start to specialize for your teapots and forget what a basic curve or edge looks like. We don’t want that.

The more subtle reason is computational efficiency and memory savings. When you freeze layers, you prevent the optimizer from having to calculate gradients for them. This massively speeds up each training step and drastically reduces the GPU memory footprint. This is the difference between fitting your model on a beefy GPU or your laptop’s integrated graphics. We do this by setting requires_grad = False for those parameters. No gradients, no update. Simple.

How to Actually Freeze a Model in Code

Let’s say we’re using a pretrained ResNet-18 from PyTorch’s torchvision.models. Here’s how you perform the surgery:

import torch
import torch.nn as nn
import torchvision.models as models

# Load the pretrained model
model = models.resnet18(weights='IMAGENET1K_V1')

# Freeze all the parameters in the entire network!
for param in model.parameters():
    param.requires_grad = False

# Now, replace the final fully-connected layer.
# ResNet-18's original fc layer is for 1000 ImageNet classes.
# We need a new one for, say, 5 classes of teapots.
num_ftrs = model.fc.in_features  # This gets the number of input features for the current fc layer
model.fc = nn.Linear(num_ftrs, 5)  # Now only THIS layer's parameters will have requires_grad=True

# Move model to GPU if available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Your optimizer should ONLY parameters that require gradients.
# This is a classic pitfall: using an optimizer on the whole model.
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001) # Only train the new head!

The Devil’s in the Details: Normalization

Here’s a “questionable choice” you must account for: every pretrained model expects its input to be normalized in a very specific way. The ImageNet models in torchvision are trained on images normalized with mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225]. I don’t know who chose these numbers, but they are now gospel. If you feed in your raw images (with pixel values 0-255 or even 0-1), the feature extraction will be horribly broken. The model is seeing data it never trained on. Always, always use the correct normalization in your data preprocessing.

from torchvision import transforms

# This is the transform you MUST use for feature extraction with an ImageNet-pretrained model
data_transforms = {
    'train': transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) # The Magic Numbers
    ]),
    'val': transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
}

When to Unfreeze (A Tease for the Next Section)

Sometimes, your new dataset is kind of similar to ImageNet. Your teapots are photographed in similar conditions. In this case, pure feature extraction works wonders. But if your data is fundamentally different—say, you’re classifying medical X-rays or satellite imagery—the lowest-level features (edges) might be useful, but the higher-level features (that learned to recognize “cat fur”) might be completely irrelevant. In that case, you might get better performance by doing a bit of surgery: freezing only the early layers and fine-tuning the later ones. But that’s a story for the next section. For now, keep it frozen. It’s simpler, faster, and almost always your best first bet.