Right, let’s talk about the old guard. Before we had models that could write sonnets about your cat, we had models that could, with staggering effort, tell you if a picture was of a cat or a dog. This is where we started, and honestly, you need to know this stuff. It’s the foundation. It’s like learning your scales before you try to play jazz. These architectures aren’t just historical footnotes; their ideas are the DNA inside every modern network you’ll use. So let’s pull them apart and see how they tick.

The Granddaddy: LeNet-5

We begin in 1998, with Yann LeCun’s LeNet-5. It was designed for handwritten digit recognition, a task that seems trivial now but was a genuine nightmare for computers at the time. Its genius was in its simplicity and its proof-of-concept for the core CNN components. Here’s the basic blueprint:

  1. Convolutional Layers: It used small 5x5 filters. Why 5x5? It’s a sweet spot—big enough to capture meaningful features like edges and curves, but small enough to keep the number of parameters manageable. This was crucial on 1998’s hardware.
  2. Subsampling (Pooling): After each conv layer, it used a 2x2 average pooling layer. This was the standard back then. It reduces the spatial dimensions, making the network invariant to small shifts and distortions in the input image. Your digit ‘4’ is a ‘4’ whether it’s centered or slightly off to the left.
  3. Activation: They used tanh or sigmoid. ReLU hadn’t been invented yet, which is a big part of why training deeper networks was so painfully difficult.
  4. The Classic Pattern: Conv -> Pool -> Conv -> Pool -> Fully Connected -> Output. This pattern of alternating feature extraction (conv) and dimensionality reduction (pool) is the bedrock of almost every CNN that followed.

Here’s a modern PyTorch implementation. Notice I use ReLU and MaxPooling instead—it’s faithful to the spirit, but I’m not a masochist who wants to train with sigmoid.

import torch.nn as nn

class LeNet(nn.Module):
    def __init__(self, num_classes=10):
        super(LeNet, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 6, kernel_size=5),  # Grayscale input, 6 filters
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(6, 16, kernel_size=5),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2),
        )
        self.classifier = nn.Sequential(
            nn.Linear(16 * 4 * 4, 120),  # You have to calculate this input size
            nn.ReLU(inplace=True),
            nn.Linear(120, 84),
            nn.ReLU(inplace=True),
            nn.Linear(84, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)  # Flatten the feature maps
        x = self.classifier(x)
        return x

# Instantiate it
model = LeNet()

The pitfall here? That nn.Linear(16 * 4 * 4, 120) line. You must calculate the correct input dimensions after all those conv and pool steps. Get this wrong, and your model will throw a fit during training. It’s a classic “why is my tensor the wrong shape” error.

The Wake-Up Call: AlexNet

Fast forward to 2012. AlexNet didn’t introduce radically new ideas, but it scaled everything the heck up and used them to win the ImageNet competition by a landslide, shocking the entire field. Its success was a cocktail of brute force and clever tricks:

  1. Bigger, Deeper: Two GPUs’ worth of bigger convolutional filters (11x11, 5x5) and more layers.
  2. The ReLU Revolution: This was their secret weapon. ReLU is computationally cheap and, crucially, it mitigates the vanishing gradient problem that plagued tanh/sigmoid, allowing them to actually train a deeper network.
  3. Dropout: They used dropout in the fully connected layers to reduce overfitting. When you have 60 million parameters, overfitting isn’t a risk; it’s a guarantee without regularization.
  4. Local Response Normalization (LRN): This was the one weird trick they used that we’ve mostly abandoned today. The idea was to encourage competition between adjacent filters. It gave a small boost then, but we now know BatchNorm is just vastly superior.

The architecture is essentially: [CONV, ReLU, MAXPOOL, LRN] -> repeat -> [CONV, ReLU] -> [CONV, ReLU] -> [CONV, ReLU, MAXPOOL] -> FC -> FC -> Output.

The Simple & Deep Elegance: VGG

After AlexNet’s slightly messy, “throw everything at the wall” approach, the VGG team from Oxford in 2014 asked a beautiful question: What if we just make everything uniform and much, much deeper?

Their contribution was architectural purity. They used a very simple, repeating building block: a stack of 3x3 convolutional layers followed by a 2x2 max-pooling layer. Why 3x3? Because two 3x3 conv layers have the same receptive field as a single 5x5 layer, but with more non-linearity (two ReLUs instead of one) and fewer parameters. It’s a more efficient way to build depth.

They defined configurations like VGG-11, VGG-13, VGG-16, and VGG-19 (the numbers are layers deep). VGG-16 became the workhorse for years. Its uniform structure makes it incredibly easy to understand and implement from memory.

import torch
import torch.nn as nn

# A basic VGG block: a repeating unit of Conv layers with same padding
def vgg_block(num_convs, in_channels, out_channels):
    layers = []
    for _ in range(num_convs):
        layers.append(nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1))
        layers.append(nn.ReLU(inplace=True))
        in_channels = out_channels  # For the next conv in the block
    layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
    return nn.Sequential(*layers)

# A small VGG-like network
class TinyVGG(nn.Module):
    def __init__(self, num_classes=10):
        super(TinyVGG, self).__init__()
        self.features = nn.Sequential(
            vgg_block(2, 3, 64),   # Two convs, 64 filters
            vgg_block(2, 64, 128), # Two convs, 128 filters
            vgg_block(3, 128, 256), # Three convs, 256 filters
            # ... you could keep going to build VGG-16
        )
        self.classifier = nn.Sequential(
            nn.Linear(256 * 4 * 4, 4096), # Again, mind the dimensions!
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

The glaring weakness of VGG? It’s a parameter monster. Those first few fully connected layers have huge weight matrices (e.g., 512x7x7 -> 4096). This makes the model large on disk and slow to run. While elegant, its inefficiency is why we moved to architectures like ResNet and Inception that give you more bang for your computational buck. But for understanding the power of simple, stacked depth, VGG is unbeatable.