16.4 Residual Networks (ResNet): Skip Connections and Identity Shortcuts

Right, let’s talk about ResNet. You’ve probably hit the infamous “vanishing gradient” problem by now, or at least you’ve heard the horror stories. As networks get deeper, your gradients—those little error signals that are supposed to travel all the way back to the early layers to guide their learning—just… vanish. They get smaller and smaller as they backpropagate through dozens of layers, until they’re practically zero. The early layers learn glacially slow, if at all. It’s like trying to whisper a secret through a stadium full of people; by the time it gets to the other side, the message is gone. So for years, we were stuck. We knew depth was powerful, but we couldn’t actually build deep networks that learned anything.

Then the folks at Microsoft Research dropped this paper, and it was one of those rare, genuine “why didn’t I think of that?” moments. Their solution was so stupidly simple it was brilliant: if the gradient has trouble propagating through a stack of layers, just give it a shortcut.

The Core Idea: Skip Connections

The fundamental building block of a ResNet is the residual block. Instead of a neural network layer (or a few layers) trying to learn the underlying mapping H(x), we let it learn the residual F(x) = H(x) - x.

Why is this a game-changer? Think about it. If the optimal mapping for a layer is to just be the identity function (i.e., do nothing and just pass the input through), which is often the case, what does it have to learn? F(x) = H(x) - x becomes F(x) = x - x = 0. It’s far easier for a stack of nonlinear layers to learn to output zero—to push their weights toward zero—than it is to learn the identity function perfectly. Learning the identity precisely with a bunch of matrix multiplications and ReLUs is actually really, really hard. Learning to output zero? Much simpler.

So the actual output of the block becomes H(x) = F(x) + x. That little + x is the skip connection or identity shortcut. The input x just hops over the main computational block and gets added back in right before the final activation function.

This does two miraculous things:

It solves the vanishing gradient problem. The gradient now has a direct, unimpeded highway to flow back to earlier layers. It can just take the shortcut. If the gradient coming from the deeper layers is weak, the + x path ensures a strong, clean gradient of 1 still gets passed back. The early layers finally get a signal they can use.
It makes the network exceptionally easy to optimize. Even if the layers in the residual block F(x) don’t learn anything useful (i.e., they output zero), the block still defaults to the identity function. Your performance degrades gracefully. In a traditional network, bad layers actively corrupt the data. In a ResNet, bad layers just… do nothing, which is a much better failure mode.

The Two Main Architectures: Original vs. Bottleneck

The original ResNet paper proposed two main styles of blocks for different depths. Let’s get into the code.

The original block for shallower networks (like ResNet-34) looks like this. We use two 3x3 convolutional layers.

import torch
import torch.nn as nn

class BasicBlock(nn.Module):
    expansion = 1  # This block doesn't expand the channel dimension

    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        # First conv: might downsample with stride
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)
        
        # Second conv: always stride=1
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        # The shortcut connection. We need it if we change the spatial size (stride) or channel depth.
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels * self.expansion:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels * self.expansion, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels * self.expansion)
            )

    def forward(self, x):
        identity = x  # Save the input for the shortcut

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        out += self.shortcut(identity)  # The magic happens here
        out = self.relu(out)  # Note: activation *after* the addition

        return out

But for really deep networks (ResNet-50, 101, 152), the computational cost of that original block is too high. So they introduced the bottleneck block. This is where the designers made a choice that seems a bit odd at first but is pure pragmatism. They use a 1x1 conv to first reduce the channel dimension (a “bottleneck”), then a cheaper 3x3 conv operates on this compressed representation, and finally another 1x1 conv expands it back. It’s like taking a detour through a narrower, faster road.

class BottleneckBlock(nn.Module):
    expansion = 4  # The final output channels will be out_channels * 4

    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        # The "bottleneck": reduce dimension
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        
        # The main conv
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        
        # Expand back out
        self.conv3 = nn.Conv2d(out_channels, out_channels * self.expansion, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)

        self.relu = nn.ReLU(inplace=True)

        # Shortcut connection: same deal as before.
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels * self.expansion:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels * self.expansion, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels * self.expansion)
            )

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        out += self.shortcut(identity)
        out = self.relu(out)

        return out

Best Practices and Pitfalls

Notice the pattern in both blocks? The shortcut is an identity if and only if the input and output dimensions match exactly. If we need to change the spatial size (via stride=2) or the number of channels, we can’t just add x to out; it would cause a runtime error due to dimension mismatch. The solution, as in the code, is to use a simple 1x1 convolutional layer (with batch norm) in the shortcut path to perform the necessary projection. It’s a minimal computational cost for a huge architectural benefit.

The other critical best practice is the order of operations: the skip connection adds before the final ReLU. This is the design from the original paper, and while there’s been some debate about pre-activation vs. post-activation, this is the canonical, battle-tested version. Sticking with it is a good idea unless you’re specifically trying to implement a more modern variant like ResNetV2.

The biggest pitfall is overthinking it. The beauty of ResNet is its simplicity. Don’t try to get fancy with the shortcut connections. The identity shortcut is the whole point. If you find yourself designing an elaborate gating mechanism for your skip connection, you’re probably just reinventing a LSTM or a GRU for computer vision. Use the simple addition. It works.

So, to summarize: if your network needs to be more than about 20 layers deep, just use ResNet blocks. It’s not just a good idea; it’s practically the law. They are the reason we can train networks that are hundreds of layers deep without everything falling apart. They are the workhorse of modern deep learning, and understanding this simple trick of adding the input back in is one of the most important concepts in the entire field.