Convolutional | mikePietsch.com

16.8 ConvNeXt: Modernizing ConvNets to Match Transformers

Alright, let’s talk about ConvNeXt. You remember ResNet, right? The “just stack more blocks, it’s probably fine” architecture that somehow worked shockingly well? It was the workhorse of computer vision for years. Then along came the Vision Transformer (ViT), which basically said, “hold my beer,” and showed that slapping the Transformer architecture from NLP onto image patches could achieve state-of-the-art results. Suddenly, all the cool kids were talking about attention mechanisms and patching strategies, and the humble ConvNet started looking a bit… dated.

16.7 Depthwise Separable Convolutions and MobileNet

Right, so you’ve built a nice, beefy CNN. It’s accurate, and it also requires a small power plant to run and thinks a smartphone is a convenient paperweight. This is the problem MobileNet and its secret weapon, the Depthwise Separable Convolution, were designed to solve. We’re going to tear this idea apart, and I promise you, it’s one of the most elegant “why didn’t I think of that?” tricks in modern deep learning.

16.6 EfficientNet: Compound Scaling of Depth, Width, and Resolution

Right, so you’ve built a model. You’ve tweaked the depth, maybe fiddled with the width, and you’re feeling pretty good about yourself. Then you hit that inevitable plateau. The classic move is to just throw more compute at the problem: make the network deeper, wider, or crank up the input resolution. You do that, and sure, you get a bump in accuracy, but the computational cost (those lovely FLOPS) and number of parameters explode. It’s a brute-force approach, and frankly, it’s a bit inelegant. You’re not a brute; you’re a sophisticated model architect.

16.5 Inception and Xception: Multi-Scale Feature Extraction

Right, so you’ve got your basic convolutional stack figured out. You stack a few layers, maybe a pooling layer here and there, and call it a day. It works, but it’s a bit… simple. It’s like trying to solve every problem with a single, standard-sized wrench. Sometimes you need a socket set, sometimes you need a torque wrench, and sometimes you just need to hit it really hard with a bigger wrench.

16.4 Residual Networks (ResNet): Skip Connections and Identity Shortcuts

Right, let’s talk about ResNet. You’ve probably hit the infamous “vanishing gradient” problem by now, or at least you’ve heard the horror stories. As networks get deeper, your gradients—those little error signals that are supposed to travel all the way back to the early layers to guide their learning—just… vanish. They get smaller and smaller as they backpropagate through dozens of layers, until they’re practically zero. The early layers learn glacially slow, if at all. It’s like trying to whisper a secret through a stadium full of people; by the time it gets to the other side, the message is gone. So for years, we were stuck. We knew depth was powerful, but we couldn’t actually build deep networks that learned anything.

16.3 Classic Architectures: LeNet, AlexNet, VGG

Right, let’s talk about the old guard. Before we had models that could write sonnets about your cat, we had models that could, with staggering effort, tell you if a picture was of a cat or a dog. This is where we started, and honestly, you need to know this stuff. It’s the foundation. It’s like learning your scales before you try to play jazz. These architectures aren’t just historical footnotes; their ideas are the DNA inside every modern network you’ll use. So let’s pull them apart and see how they tick.

16.2 Pooling: Max Pooling, Average Pooling, Global Average Pooling

Right, let’s talk about pooling. You’ve just learned about convolution, where we slide little filters around an image to find features. That’s brilliant, but it leaves us with a problem: a spatial sensitivity that’s almost too good. If a feature moves by a single pixel, it’s now in a different receptive field, and our activation map changes. That’s not very useful for recognizing a cat whether it’s slightly to the left or the right. This is where pooling waltzes in, not to add new information, but to ruthlessly and efficiently summarize what the convolution just found.

16.1 Convolution: Filters, Stride, Padding, and Feature Maps

Right, let’s get our hands dirty with the actual machinery of a CNN: the convolutional layer. Forget the textbook definition for a second. Think of it like this: you’re trying to find a specific pattern—like a vertical edge, a blotch of color, or eventually, the curve of a cat’s ear—in a big image. You wouldn’t look at the whole image at once; you’d take a small magnifying glass and slide it across every possible spot. That’s convolution in a nutshell. It’s a glorified, learnable pattern-matching slide rule.