19.3 Fine-Tuning: Unfreezing and Training with a Lower Learning Rate

Alright, you’ve got your pre-trained base model humming along, its feature extraction layers frozen solid. It’s doing a decent job, but it’s not your model yet. It’s like a brilliant intern who knows all the theory but hasn’t learned your company’s bizarre inside jokes. To truly make it yours, to get those last few percentage points of accuracy, you need to let it get a little more… personal. This is where the real magic, and the real danger, happens: unfreezing and fine-tuning with a lower learning rate.

Think of it this way: those early layers in your pre-trained model have learned fantastic, general-purpose feature detectors—edges, textures, shapes. We kept them frozen initially so we wouldn’t obliterate all that hard-won knowledge while we trained the new head on top. But your new dataset, your specific cats vs. dogs (or whatever), has its own quirks. Maybe all your cat pictures have a particular background, or your medical images have a specific type of noise. The model needs to tweak those general-purpose feature detectors to become your-purpose feature detectors. That’s what unfreezing allows.

The Grand Unfreezing

This is the moment. You’ve trained your classifier head to convergence on the frozen features. The validation loss has plateaued. It’s time. You’ll loop through your model’s layers and set .trainable = True. But here’s the first designer quirk you need to know: in Keras, simply setting trainable = True isn’t enough to immediately affect training. You have to recompile the model for the change to take effect. It’s a bit of a gotcha, and it’s because the trainable status is baked into the model’s graph at compile time.

# Assuming you have your already-trained model from the frozen phase
print(f"Layers are frozen: {all(not layer.trainable for layer in base_model.layers)}")
# This will print "True"

# Unfreeze the base model! Let the chaos... begin carefully.
base_model.trainable = True

# CRITICAL STEP: You MUST recompile the model for the unfreezing to take effect.
# Notice the learning rate is now an order of magnitude lower (1e-5 vs 1e-4).
model.compile(optimizer=keras.optimizers.Adam(learning_rate=1e-5),
              loss='binary_crossentropy',
              metrics=['accuracy'])

print(f"Layers are frozen: {all(not layer.trainable for layer in base_model.layers)}")
# This will now print "False"

Why the Drastically Lower Learning Rate?

This is the most important concept to grasp. When we were training only the head, we used a relatively healthy learning rate (e.g., 1e-4) because we were training from scratch on a small, randomly initialized set of weights. It was a blank slate.

Now, we’re dealing with pre-trained weights that are already very good. They’re in a nice, deep, wide valley of the loss landscape. If we use a large learning rate, it’s like giving a sculptor a jackhammer to put the finishing touches on a statue. You’ll instantly blow those finely-tuned weights right out of that valley and into some chaotic, high-loss nightmare. This phenomenon is called catastrophic forgetting—the model forgets everything it originally knew in its desperate attempt to learn your new task.

A lower learning rate (I typically start with 10x lower than the one used for the head, so 1e-5 if the head was at 1e-4) allows for gentle, precise nudges. We’re not trying to relearn the concept of an edge; we’re just trying to adjust it slightly to be more sensitive to the particular edges in your dataset.

Selective Unfreezing: A More Conservative Approach

Unfreezing the entire base model is often overkill. The earliest layers have those very general features (edges, blobs) that probably don’t need much adjustment. The later layers have more complex, task-specific features that absolutely do. A common and highly effective strategy is to only unfreeze a portion of the model.

# Let's say our base_model has 100 layers. We want to unfreeze the last 20.
# Set all layers to non-trainable first
base_model.trainable = True
for layer in base_model.layers[:-20]:
    layer.trainable = False

# Always remember to recompile after changing trainability!
model.compile(optimizer=keras.optimizers.Adam(learning_rate=1e-5),
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Let's see what we did
for i, layer in enumerate(base_model.layers):
    print(f"Layer {i}: {layer.name} - Trainable: {layer.trainable}")

This approach is less computationally expensive and reduces the risk of overfitting, as you’re updating far fewer parameters.

Training, Monitoring, and the Art of Early Stopping

Your training loop now needs even more careful supervision. You must use validation loss to monitor progress.

What you want to see: Validation loss continues to decrease gently, or at least stays stable, while training loss decreases. This is the model successfully specializing.
What you don’t want to see: Training loss decreases but validation loss starts to skyrocket. This is the tell-tale sign of catastrophic forgetting or severe overfitting. The model is losing its general knowledge to memorize your specific data.

This is why early stopping is your best friend here. Set it to monitor validation loss with a patience of just 1 or 2 epochs. The moment it detects a sustained increase, it stops training and reverts to the best weights. It’s your emergency brake.

from tensorflow.keras.callbacks import EarlyStopping

fine_tune_epochs = 20
# We start from the epoch we finished the head training. Let's say it was epoch 10.
initial_epoch = 10

# Patience of 2 means it will stop after 2 epochs of no improvement.
early_stop = EarlyStopping(monitor='val_loss', patience=2, restore_best_weights=True)

history_fine = model.fit(
    train_dataset,
    epochs=initial_epoch + fine_tune_epochs,
    initial_epoch=initial_epoch,
    validation_data=validation_dataset,
    callbacks=[early_stop] # This is non-negotiable.
)

The entire process is a dance. You’re balancing the vast, pre-trained knowledge in the base model with the specific requirements of your new task. It requires a light touch (low LR), careful observation (monitoring validation loss), and a quick trigger finger (early stopping). Get it right, and you’ll end up with a model that feels like it was built just for you, because, well, it finally was.