16.8 ConvNeXt: Modernizing ConvNets to Match Transformers
Alright, let’s talk about ConvNeXt. You remember ResNet, right? The “just stack more blocks, it’s probably fine” architecture that somehow worked shockingly well? It was the workhorse of computer vision for years. Then along came the Vision Transformer (ViT), which basically said, “hold my beer,” and showed that slapping the Transformer architecture from NLP onto image patches could achieve state-of-the-art results. Suddenly, all the cool kids were talking about attention mechanisms and patching strategies, and the humble ConvNet started looking a bit… dated.