80.9 Saving and Loading Models

Right, let’s talk about saving your work. This isn’t just hitting Ctrl+S in a text editor. In deep learning, your model’s architecture, its trained weights, and its ability to start training right where it left off are three different things, and the frameworks handle them in… let’s call it varied and occasionally frustrating ways. I’ve seen more people trip over this “simple” task than any fancy custom loss function. We’re going to fix that.

The core idea is simple: you train a model for hours (or days), you don’t want that effort to vanish into the digital ether when you close your terminal. You need to save it. But what you save depends entirely on why you’re saving it. Are you done training and just want to use it for predictions? Are you taking a coffee break and want to resume training later? The answer dictates the format.

The Quick and Dirty: Saving Weights Only (.h5)

This is the simplest approach. You’re saving only the learned parameters (the weights and biases) of the model. It’s small, it’s fast, and it’s perfect for deployment or sharing a trained model. What it is not is self-contained. To use these weights again, you must have the exact same model architecture defined and compiled in your code first. It’s like having the key to a very specific lock; you need to already have the lock.

import tensorflow as tf

# Assume we have a built and trained model
model = tf.keras.Sequential([...])
model.compile(...)
model.fit(...)

# Save *only* the weights
model.save_weights('my_awesome_model_weights.h5')

# Later, to use them, you must rebuild the model from code...
rebuilt_model = tf.keras.Sequential([...])  # MUST be the same architecture!
rebuilt_model.compile(...)                  # MUST be compiled the same way!

# ...and then load the weights onto it.
rebuilt_model.load_weights('my_awesome_model_weights.h5')

Pitfall alert: If your rebuilt architecture doesn’t match the original exactly, load_weights() will fail spectacularly and confusingly. It’s a strict key-and-lock situation.

The Full Monty: The SavedModel Format

This is TensorFlow’s preferred, full-model saving format. It saves everything: the architecture, the weights, and the training configuration (optimizer, state, etc.). This is your go-to for “I want to pause training and resume later” or “I want to ship this model for serving without the original code.”

The directory it creates contains a protocol buffer (.pb) file defining the architecture and a variables directory containing the weights. It’s a complete snapshot.

# Save the entire model as a SavedModel
model.save('my_full_model')

# Later, you can load it back. No need to define or compile anything.
# This object is a complete, ready-to-use model.
loaded_model = tf.keras.models.load_model('my_full_model')

# Make predictions
predictions = loaded_model.predict(new_data)

# Or, if it was saved with optimizer state, resume training
loaded_model.fit(more_data, more_epochs=10)

This is usually what you want. It’s robust and the closest thing to a “it just works” solution in the ecosystem.

The Keras Classic: .keras (or the old .h5)

Keras also has its own standalone serialization format (.keras in TF2.10+, .h5 before that). It’s similar to SavedModel in that it saves everything in one file, but the internals are different. It’s useful for portability outside of pure TensorFlow environments, though SavedModel is generally the more powerful and recommended choice now.

# Save as a single .keras file
model.save('my_model.keras')  # or 'my_model.h5'

# Load it back
loaded_model_from_keras = tf.keras.models.load_model('my_model.keras')

The PyTorch State of Affairs

PyTorch, being the wonderfully flexible and sometimes chaotic framework it is, has a similarly flexible approach. The most common method is to save the model’s state_dict—a Python dictionary that maps each layer to its parameter tensor. This is the PyTorch equivalent of saving weights-only.

import torch
import torch.nn as nn

# Define and train a model
model = nn.Sequential(...)
optimizer = torch.optim.Adam(model.parameters())
# ... training loop ...

# Save the state_dict (weights)
torch.save(model.state_dict(), 'model_weights.pth')

# To load, you must first instantiate the model structure
loaded_model = nn.Sequential(...)  # Same architecture!
loaded_model.load_state_dict(torch.load('model_weights.pth'))

And for a full save, often you just dump the entire model object. This is convenient but is considered less reliable long-term because it is tied to the exact class structure where you saved it.

# Save the whole model (less portable)
torch.save(model, 'full_model.pth')

# Load it
loaded_model = torch.load('full_model.pth')

The PyTorch best practice for a full save, especially if you want to resume training, is to save a checkpoint dictionary. This is what you actually do in real projects.

# Create a checkpoint
checkpoint = {
    'epoch': 90,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
    # ... anything else you want to remember ...
}

torch.save(checkpoint, 'model_checkpoint.pth')

# Later, to resume:
checkpoint = torch.load('model_checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']

model.train()  # Set back to training mode
# ... and pick up from epoch 91 ...

This checkpoint method is the most powerful and explicit. You know exactly what you’re saving and exactly what you’re loading. There’s no magic, which means there’s less that can go mysteriously wrong. And in this game, avoiding mysterious errors is half the battle.