16.8 ConvNeXt: Modernizing ConvNets to Match Transformers

Alright, let’s talk about ConvNeXt. You remember ResNet, right? The “just stack more blocks, it’s probably fine” architecture that somehow worked shockingly well? It was the workhorse of computer vision for years. Then along came the Vision Transformer (ViT), which basically said, “hold my beer,” and showed that slapping the Transformer architecture from NLP onto image patches could achieve state-of-the-art results. Suddenly, all the cool kids were talking about attention mechanisms and patching strategies, and the humble ConvNet started looking a bit… dated.

This is where ConvNeXt comes in. A team at FAIR asked a brilliantly simple question: “Is a convolution-based network inherently worse, or did we just stop innovating on it?” Turns out, it was mostly the latter. They systematically took a standard ResNet and modernized it with a bunch of design tropes borrowed from Transformers. The result wasn’t just an incremental improvement; it was a ResNet that could go toe-to-toe with Swin Transformers on ImageNet. It’s a masterclass in how careful, thoughtful architecture design beats just jumping on the newest hype train.

The Core Modernization Tweaks

The magic of ConvNeXt isn’t in one revolutionary idea but in a series of meticulous, almost obsessive, upgrades. They started with a bog-standard ResNet-50 and applied changes one by one. Let’s walk through the big ones.

First up, they changed the training recipe. This is the most boring but arguably most important part. They trained for way longer (300 epochs instead of 90) with all the latest optimizers (AdamW), data augmentation (MixUp, CutMix, RandAugment), and regularization tricks. This alone gave a massive performance bump, proving that a fair comparison requires a modern training regimen, not just the one from 2015.

Next, they looked at the macro design. They changed the stage compute ratio. Original ResNets had a clunky “layer 4” that was a compute bottleneck. They adjusted the number of blocks in each stage to be more balanced, like (3, 3, 9, 3) instead of (3, 4, 6, 3), giving more compute to the middle stages where it’s most useful.

The “Swapped” Block: A Deeper Look

This is where the real architectural fun begins. Let’s break down the ConvNeXt block and compare it to its predecessor.

The old ResNet block used a large kernel (7x7) only at the very beginning for stem creation, and then everything else was a paltry 3x3. Meanwhile, Transformers effectively have a global receptive field thanks to self-attention. ConvNeXt bridges this gap by doing the unthinkable: it uses larger kernels. Specifically, they swapped the order of the convolutions to do a depthwise convolution first, and they made that convolution a 7x7 kernel. This sounds insane from a compute perspective—until you remember it’s depthwise. The number of parameters and FLOPs for a depthwise 7x7 conv is actually manageable, and it gives each pixel a much broader, more global context, mimicking the perceptual field of a Transformer’s attention.

They also replaced the ReLU activation with the smoother, non-saturating GELU, just like ViT uses. And they swapped the Batch Normalization for Layer Normalization, which is the normalization of choice in Transformers and works better in this context. It’s a classic case of “if your competitor’s stuff is working, maybe just use it.”

Here’s a PyTorch implementation of the core ConvNeXt block. Type this out, run it, feel the difference.

import torch
import torch.nn as nn
import torch.nn.functional as F

class ConvNeXtBlock(nn.Module):
    """A single ConvNeXt block, the modernized replacement for the ResNet bottleneck."""

    def __init__(self, dim, drop_path_rate=0.):
        super().__init__()
        # Depthwise convolution with large 7x7 kernel
        self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim)
        # LayerNorm, not BatchNorm
        self.norm = nn.LayerNorm(dim, eps=1e-6)
        # Pointwise convolutions expand and contract channels (like a mini-FFN)
        self.pwconv1 = nn.Linear(dim, 4 * dim)
        self.pwconv2 = nn.Linear(4 * dim, dim)
        # Activation is GELU, not ReLU
        self.act = nn.GELU()
        # Optional stochastic depth for regularization
        self.drop_path = StochasticDepth(drop_path_rate, mode='row') if drop_path_rate > 0. else nn.Identity()

    def forward(self, x):
        input = x
        x = self.dwconv(x)
        # Conv operations require [B, C, H, W], LayerNorm expects [B, C, H*W] or [B, H, W, C]
        # We permute to channels-last format for the LN and Linear layers (which are 1x1 convs)
        x = x.permute(0, 2, 3, 1) # [B, C, H, W] -> [B, H, W, C]
        x = self.norm(x)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)
        x = x.permute(0, 3, 1, 2) # [B, H, W, C] -> [B, C, H, W]

        x = input + self.drop_path(x)
        return x

# A simple Stochastic Depth implementation for completeness
class StochasticDepth(nn.Module):
    def __init__(self, drop_prob, mode='row'):
        super().__init__()
        self.drop_prob = drop_prob
        self.mode = mode

    def forward(self, x):
        if not self.training or self.drop_prob == 0.:
            return x
        keep_prob = 1 - self.drop_prob
        shape = (x.shape[0],) + (1,) * (x.ndim - 1)
        random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
        random_tensor.floor_() # binarize
        return x.div(keep_prob) * random_tensor

Downsampling and The Stem

They also fixed the often-overlooked parts: the stem and the downsampling layers. The original ResNet stem was a violent 7x7 conv with stride 2 followed by a max pool, which is a great way to lose fine-grained information immediately. ConvNeXt uses a more gentle 4x4 conv with stride 4—it’s just one operation to patchify the image, much cleaner.

For downsampling between stages, ResNet did this weird thing where one branch was strided and the other had to use a clunky 1x1 conv with stride 2 to match dimensions. ConvNeXt simplifies this: they use a LayerNorm followed by a 2x2 conv with stride 2 in the main branch. It’s elegant and effective.

Why This All Works and When to Use It

The reason ConvNeXt is so effective is that it incorporates the inductive biases of convolution—translation equivariance, locality—while adopting the training stability and large receptive fields of Transformers. It’s the best of both worlds. You get the efficiency of convolutions (no O(n²) attention complexity) and the performance of modern architectures.

Pitfall: The biggest gotcha is the channel-last format requirement within the block. If you’re not careful with your .permute() calls, you’ll end up with a dimensional mess. Double-check your tensor shapes when debugging.

Best Practice: Use pretrained weights. Training a ConvNeXt from scratch on a small dataset is possible, but like any modern architecture, it thrives on large-scale pretraining. The torchvision models are a great starting point.

So, when should you use it? Pretty much anytime you need a rock-solid, efficient, and powerful vision backbone. It’s less finicky to train than a ViT and more accurate than a vanilla ResNet. It’s not often that a paper makes such a clear, “stop what you’re doing and use this” argument, but ConvNeXt is one of them.