35.5 Progressive GAN, StyleGAN, and BigGAN

Right, let’s get into the good stuff. You’ve got the basics of GANs down—the generator and discriminator locked in their eternal, adversarial dance. It was a brilliant idea, but you quickly hit a wall: scaling them up to generate high-resolution images (say, 1024x1024) was like trying to build a skyscraper out of toothpicks. The training was unstable, the results were often a horrifying mess, and the whole process felt like it was held together with duct tape and hope. This is where the big brains at NVIDIA came in and changed the game.

The Genius of Progressive Growing

The core problem with naively training a giant GAN is that asking a generator to conjure a photorealistic 1024x1024 face from pure noise in one go is, frankly, absurd. It’s like expecting a toddler to compose a symphony before they’ve learned to hum a tune. Progressive GANs solved this with a beautifully simple idea: start small and grow.

Instead of starting at the target resolution, we begin training a very small generator and discriminator, say on 4x4 images. At this resolution, the task is trivial. The network quickly learns to generate basic blobs of color with vaguely correct structure. Once it’s stable at that level, we smoothly add new layers to both networks that progressively double the resolution to 8x8, then 16x16, and so on, all the way up to our target.

Here’s the key: we don’t just slap on new layers and switch. We fade them in gradually. During the fading phase, the previous, smaller-scale resolution is still active. The new, higher-resolution layers are treated like a residual block that’s slowly blended from 0% to 100% influence. This means the network always has a stable, trained representation to fall back on while it’s learning the new, finer-grained details. It’s the machine learning equivalent of training wheels. This approach provided unprecedented stability and allowed for the first time truly high-resolution, coherent image generation.

# A conceptual look at the additive fading block used in Progressive GAN
# This is a simplified version to illustrate the core idea.

import torch
import torch.nn as nn
import torch.nn.functional as F

class FadingBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        # The new convolutional layers for the higher resolution
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, 1, 1)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, 1, 1)
        self.upsample = nn.Upsample(scale_factor=2, mode='nearest')

    def forward(self, x, alpha):
        # x is the input from the previous, stable layer
        # 'alpha' is the blending factor, goes from 0 to 1 over many training iterations

        # The "old" path: just upsample the previous stable output
        old_out = self.upsample(x)

        # The "new" path: process through the new layers
        new_out = self.upsample(x)
        new_out = self.conv1(new_out)
        new_out = F.leaky_relu(new_out, 0.2)
        new_out = self.conv2(new_out)
        new_out = F.leaky_relu(new_out, 0.2)

        # Blend the two paths together
        return (1 - alpha) * old_out + alpha * new_out

# During training, you'd slowly increment alpha from 0 to 1 over thousands of iterations.

StyleGAN: Throwing the Rulebook Out the Window

Just when we thought Progressive GAN was the pinnacle, the same team dropped StyleGAN and basically showed us we were doing it all wrong. They took the progressive idea and supercharged it with a fundamentally new architecture. The biggest breakthrough was the separation of style and noise.

The classic GAN throws a random latent vector z directly into the generator. StyleGAN first maps z to an intermediate latent space W. This w vector is then fed into each layer of the generator through Adaptive Instance Normalization (AdaIN). Think of w as the overall artistic style instructions: pose, hair style, face shape. This is the “global” control.

But here’s the witty part: they also add per-pixel noise to the input of each convolutional layer. This noise is completely random and different for every generation. Why? Because it provides the variation for the local details: the exact placement of every pore, every strand of hair, the subtle skin texture. The style (w) says “this region has freckles,” and the noise actually places each individual freckle. This is why StyleGAN images don’t look plastic; they have a stunning, lifelike stochastic texture.

BigGAN: The Brute Force (But Brilliant) Approach

While NVIDIA was rearchitecting the generator, the folks at DeepMind took a different approach: “What if we just scaled the hell out of the existing architecture?” BigGAN is a testament to the raw power of massive compute and big batches.

They made three key changes:

Ludicrous Batch Size: They used batches up to 2048 images. This gives the discriminator a huge, comprehensive view of the data distribution every single step, making its gradients incredibly informative for the generator.
Model Scaling: They increased the number of filters in each layer (the “channel width”) and used something called “shared embedding” for the class conditioning, which is far more parameter-efficient.
Orthogonal Regularization: To counter the instability that comes with such a huge model, they used a special regularization technique to keep the weight matrices well-conditioned and prevent the training from spectacularly exploding, which it was otherwise very prone to do.

The results were staggering. BigGAN produced 512x512 images on ImageNet with an incredible variety and fidelity that blew everything else out of the water. The trade-off? It was horrifically unstable. The paper itself has an entire section dedicated to “truncation tricks” needed to make it work and openly discusses the “collapse” that inevitably happens. It’s the quintessential “go out in a blaze of glory” model.

Common Pitfalls and the Truncation Trick

All of these models share a common GAN pitfall: mode collapse, where the generator finds one or a few “good” outputs and just spams them. BigGAN is particularly famous for its eventual, inevitable collapse. But a more subtle issue is that the latent space z often maps to low-probability, “weird” outputs you wouldn’t typically see in the training data.

This is where the truncation trick comes in. Instead of sampling z from the full normal distribution (e.g., z ~ N(0, 1)), you sample from a truncated normal distribution, say N(0, 0.7), by rejecting values above a certain threshold. This pulls your samples towards the “mean” of the distribution, which typically corresponds to more average, higher-quality, and safer outputs. It’s a classic trade-off: better average quality at the cost of some diversity. In StyleGAN, you do this truncation in the W space, not the Z space, which is far more effective. You’ll see this in almost every pre-trained model’s inference code.

# A common implementation of the truncation trick for sampling
def generate_with_truncation(generator, truncation=0.7, mean=0, std=1):
    """
    generator: Your trained generator model
    truncation: Value between 0 and 1. 1 = no truncation, 0.7 is a common value.
    mean, std: of the latent distribution the generator was trained on.
    """
    # Calculate the threshold for truncation
    threshold = std * truncation

    # Keep sampling until we get a batch within the threshold
    while True:
        z = torch.randn(batch_size, latent_dim) * std + mean
        # Check if all values are within the desired range
        if torch.all(z.abs() < threshold):
            break

    # Generate the image
    with torch.no_grad():
        output = generator(z)
    return output

# In practice for StyleGAN, you'd do this on the w-vectors, not the z-vectors.