13.5 Neural Architecture Search (NAS)

Right, so you’ve built your model, you’ve got your data, and now you’re staring down the barrel of a thousand knobs to turn. Learning rate, batch size, number of layers… it’s enough to make you want to just pick numbers out of a hat. Hyperparameter optimization (HPO) is the formal process of not doing that. But what if the biggest, most architectural knob of them all—the very structure of the neural network itself—is also just a hyperparameter? Enter Neural Architecture Search, or NAS. It’s the meta-game of machine learning: using machine learning to design your machine learning. It’s as gloriously recursive as it is computationally expensive.

The core idea is both simple and profoundly arrogant: instead of you, a mere human, spending weeks sketching network diagrams on napkins based on intuition and folklore, why not let an algorithm search through a vast space of possible architectures to find the best one for your specific problem and dataset? It automates the very kind of innovation that got us here.

The Three Pillars of Any NAS System

Every NAS method, from the simple to the absurdly complex, consists of three core components. Forget this, and you’re just throwing compute at a wall.

Search Space: This is the universe of all possible architectures we’re allowing our algorithm to consider. Do we let it design any conceivable directed acyclic graph? Usually not—that space is infinite and most of it is garbage. Instead, we define a constrained playground. A common approach is the cell-based search space, where the algorithm designs a small, repeating computational unit (a “cell”), and we just stack many copies of that winning cell to form the final deep network. It’s like designing a single, perfect Lego brick and then building a castle out of it. This massively reduces the search space and often leads to architectures that generalize well.

Search Strategy: This is the how. How does our algorithm navigate that vast search space to find a good architecture? This is where the magic (and the compute bill) happens. You’ve got your classic Reinforcement Learning approach, where a controller RNN learns to generate model descriptions (“add a 3x3 conv here,” “add a skip connection there”) and gets rewarded based on how well that model performs. It’s cool, but it’s like training a dog by only giving it a treat after it runs a full marathon. You’ll also see Evolutionary Algorithms, which mutate and cross over populations of architectures, killing off the weak. And then there’s the modern darling, Differentiable Architecture Search (DARTS), which is frankly a bit of a hack so clever it works. Instead of evaluating each architecture in isolation, DARTS makes the entire search space continuous—you don’t choose whether to have a connection between two layers, you learn a weight for that connection. You then jointly optimize the model weights and these architecture weights using gradient descent. It’s far more efficient than RL or evolution, but it has its own quirks, like a tendency to favor simpler operations like skip connections over more complex convolutions.

Performance Estimation Strategy: This is the brutal, practical bottleneck. You’ve proposed a new architecture. How do you know if it’s any good? The obvious answer is “train it from scratch on the training data and evaluate it on the validation set.” This is also the ludicrously expensive answer. A naive NAS that does this is why the early papers from big tech firms reported costs equivalent to thousands of years of GPU time. So we use proxies: we train for fewer epochs, we train on a smaller dataset, we use lower resolution images, or we employ weight sharing. Weight sharing, used in methods like ENAS and DARTS, is the trick where you train one giant “supernet” that encompasses all possible architectures in your search space. When you want to evaluate a specific sub-architecture, you don’t train it from scratch; you just use the weights that corresponding bit of the supernet has already learned. It’s not perfect, but it cuts the cost from “mortgage your house” to “still pretty pricey.”

A (Somewhat) Practical DARTS Example

Let’s be clear: full-scale NAS is still a game for those with serious hardware. But you can play with the concepts on a smaller dataset like CIFAR-10. Here’s a simplified look at what the core of a DARTS-like search might entail, using PyTorch. This is a skeleton to illustrate the idea, not a production-ready script.

import torch
import torch.nn as nn
import torch.optim as optim

# Let's define a simple "supernet" with a mixed operation between two nodes.
class MixedOp(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        # The operations we're considering between these two nodes.
        self.ops = nn.ModuleList([
            nn.Conv2d(in_channels, out_channels, 1, padding=0),  # Identity / pointwise conv
            nn.Conv2d(in_channels, out_channels, 3, padding=1),  # 3x3 conv
            nn.MaxPool2d(3, stride=1, padding=1),               # Max Pool
            nn.AvgPool2d(3, stride=1, padding=1),                # Avg Pool
            nn.Identity() if in_channels == out_channels else None  # Skip connection (with check)
        ])
        # Remove the None placeholder if Identity was invalid
        self.ops = nn.ModuleList([op for op in self.ops if op is not None])
        # The architecture parameters (alpha) we need to learn for these ops.
        self.alpha = nn.Parameter(torch.randn(len(self.ops)))

    def forward(self, x):
        # The forward pass is a weighted sum of all operations.
        # Use softmax to make the alphas a probability distribution.
        weights = torch.softmax(self.alpha, dim=0)
        return sum(w * op(x) for w, op in zip(weights, self.ops))

# Our simplistic supernet would be built from many of these MixedOps.
# The search would involve:
# 1. Alternating between updating the model weights (with alphas fixed)...
model_optimizer = optim.Adam(supernet.parameters(), lr=0.025)
# 2. ...and updating the architecture weights alpha (with model weights fixed).
arch_optimizer = optim.Adam([p for n, p in supernet.named_parameters() if 'alpha' in n], lr=3e-4)

# After training, you'd derive your final architecture:
# for each MixedOp, choose the operation with the highest alpha value.
final_architecture_ops = [torch.argmax(mixed_op.alpha).item() for mixed_op in supernet.mixed_ops]

The Rough Edges and Reality Check

NAS isn’t a silver bullet. It’s a cannon that consumes GPUs as ammunition. The biggest pitfall is failing to account for the cost. The search phase is often just step one; you then need to take the discovered architecture and retrain it from scratch for a full number of epochs to get its true performance. Many a paper has been criticized for a sloppy retraining protocol that makes the NAS result look better than it is.

Furthermore, the architectures it finds are often bizarre. They’re full of skip connections, unusual groupings, and look nothing like the clean, human-designed ResNets you’re used to. They can be brittle and harder to train, and their performance gains can sometimes be attributed more to a larger parameter count than genuine algorithmic insight.

So, should you use it? If you’re a researcher or a company with a data center, absolutely. It’s the frontier. If you’re working on a product with a limited budget, your time and compute are almost always better spent on collecting more data, improving your feature engineering, or doing good old-fashioned manual HPO. Use NAS not because it’s easy, but because you’ve already exhausted the other, cheaper options and that last percent of accuracy is worth its weight in gold—and electricity.