16.6 EfficientNet: Compound Scaling of Depth, Width, and Resolution

Right, so you’ve built a model. You’ve tweaked the depth, maybe fiddled with the width, and you’re feeling pretty good about yourself. Then you hit that inevitable plateau. The classic move is to just throw more compute at the problem: make the network deeper, wider, or crank up the input resolution. You do that, and sure, you get a bump in accuracy, but the computational cost (those lovely FLOPS) and number of parameters explode. It’s a brute-force approach, and frankly, it’s a bit inelegant. You’re not a brute; you’re a sophisticated model architect.

Enter the team at Google Brain with a simple, almost annoyingly sensible question: “Why are we scaling these dimensions arbitrarily and independently?” Depth, width, and resolution aren’t orthogonal concepts; they’re deeply intertwined. Think about it. A higher resolution image has finer-grained patterns that might require a deeper network to capture more complex features or a wider network to capture more of those patterns. The EfficientNet paper’s genius was in recognizing this and formalizing it. They didn’t just propose a new model; they proposed a new, principled scaling method.

The Compound Scaling Mantra

The core idea is compound scaling, governed by this simple formula. Don’t worry, it’s not as scary as it looks:

Depth: d = α^φ Width: w = β^φ Resolution: r = γ^φ

…with the constraint that α · β² · γ² ≈ 2 and α ≥ 1, β ≥ 1, γ ≥ 1.

Hold on, what does this actually mean? Let’s break it down. φ is your compound coefficient—your knob for controlling how much more resources you want to use. Twist this knob, and it scales all three dimensions up together. The magic is in the constants α, β, γ. They’re determined via a small grid search on a baseline model (EfficientNet-B0) to find the optimal balance for a fixed φ. The paper found optimal values were α=1.2, β=1.1, γ=1.15.

The constraint α · β² · γ² ≈ 2 is the key insight. It ensures that if you increase φ by 1, the total FLOPS will roughly increase by α · β² · γ² ≈ 2 (i.e., they double). This is why it’s so efficient. You’re getting a coordinated boost on all fronts for a predictable cost, rather than a lopsided and inefficient boost on just one.

Implementing Scaling for a Modern Architecture

You don’t have to use the official EfficientNet to use this idea. The principle is what’s important. Let’s say you have a simple CNN backbone and you want to apply compound scaling. Here’s how you might think about it programmatically. First, you define your baseline.

import torch
import torch.nn as nn

class BaselineBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1)
        self.act = nn.ReLU()

    def forward(self, x):
        return self.act(self.conv(x))

class MyBaselineCNN(nn.Module):
    def __init__(self, depth=3, width=32, resolution=224):
        super().__init__()
        # We'll simulate depth with a stack of blocks
        layers = []
        current_channels = 3  # input RGB channels
        for i in range(depth):
            layers.append(BaselineBlock(current_channels, width))
            current_channels = width
        self.net = nn.Sequential(*layers)
        self.resolution = resolution

    def forward(self, x):
        # A real model would have downsampling, but this is for illustration
        return self.net(x)

# Our baseline model
baseline_model = MyBaselineCNN(depth=3, width=32, resolution=224)

Now, let’s scale it up using the compound principle with φ=1. We’ll use the paper’s values.

# Compound scaling parameters
alpha = 1.2  # for depth
beta = 1.1   # for width
gamma = 1.15 # for resolution
phi = 1      # our chosen compound coefficient

# Calculate new dimensions
new_depth = round(alpha ** phi)  # rounds to 1.2^1 = ~1, but we use 3 as base, so...
scaled_depth = 3 + round(3 * (alpha ** phi - 1))  # More practical: scale the *number* of blocks
scaled_width = round(32 * (beta ** phi))
scaled_resolution = round(224 * (gamma ** phi))

print(f"Scaled Depth (number of blocks): {scaled_depth}")
print(f"Scaled Width (channels): {scaled_width}")
print(f"Scaled Resolution: {scaled_resolution}")

# Instantiate the scaled model
scaled_model = MyBaselineCNN(depth=scaled_depth, width=scaled_width, resolution=scaled_resolution)

This is a simplified example, but it captures the essence. You’re not just making it deeper; you’re making it all slightly bigger in a coordinated way.

The Pitfalls and Practical Realities

Now, the honest truth. This scaling law is brilliant, but it’s not a free lunch. The first pitfall is memory. Scaling resolution (γ) is the most expensive operation; it quadratically increases the number of pixels. Your GPU memory will feel it long before your FLOPS meter peaks. You might find yourself using smaller batches or more aggressive gradient checkpointing.

Secondly, the baseline model matters… a lot. The paper’s search for α, β, γ was done on their specifically designed EfficientNet-B0, which is already optimized for mobile devices. If you apply these exact same constants to, say, a ResNet-50, you’ll get an improvement, but it might not be optimal. The real best practice is to perform a small neural architecture search (NAS) or grid search on your baseline model to find the right α, β, γ for your specific architecture and dataset. It’s more work, but it’s the difference between using someone else’s prescription and getting glasses made for your own eyes.

Finally, don’t forget about training time. A compound-scaled model is more efficient for its performance, but it’s still bigger than your baseline. Epochs will take longer. The benefit is that you’ll likely need fewer epochs to reach a higher accuracy, but your wall-clock time might not decrease. You’re trading raw compute time for a better model, not necessarily for a faster training cycle. It’s about precision, not just speed.