35.2 Variational Autoencoders (VAE): Latent Space and ELBO

Right, so you’ve heard of autoencoders, the charmingly simple neural networks that learn to copy their input to their output, squeezing it through a “bottleneck” layer in the middle. Cute, but ultimately useless for generation. You ask one to generate a new face, and it just gives you a blurry, averaged mess of the data it trained on—the “blob of all faces.” Not exactly what we’re after.

The Variational Autoencoder (VAE) is the clever fix to this. It doesn’t just learn a compressed representation (a code); it learns a probability distribution for that code. Instead of outputting a single vector for an input image, it outputs two vectors: one for the mean (mu) and one for the standard deviation (sigma) of a Gaussian distribution. We then sample from this distribution to get our actual latent code z. This stochasticity is the magic sauce. It forces the entire latent space to be continuous and meaningful. Every point in that space is now a valid, sampled point from a Gaussian, so if you wander around that space and decode a point, you should get a coherent output. No more blobs.

The Heart of the Matter: The ELBO

This whole “let’s learn a distribution” thing introduces a massive problem: how do you backpropagate through a random sampling operation? You can’t take the derivative of a random number. The answer is the reparameterization trick, and it’s so brilliant it deserves a slow clap.

Instead of sampling z directly from N(μ, σ²), we shift the randomness to a separate, fixed input. We sample a random variable ε from a standard normal distribution N(0, 1) and compute:

z = mu + sigma * epsilon

Now, the path from our parameters (mu, sigma) to the output z is completely deterministic! The stochasticity comes from the input ε, which we can handle. This makes backpropagation possible. The network learns to adjust mu and sigma to make this equation work for the data it sees.

But why would it learn to do anything useful? This is where the Evidence Lower BOund (ELBO) comes in. The ELBO is the objective function we maximize, and it’s a work of art. It has two terms that are constantly at war with each other:

Reconstruction Loss: This is the familiar “how badly did we mess up the copy?” part, typically the Mean Squared Error or Binary Cross-Entropy between the input and the output. Maximizing the ELBO means minimizing this error. It wants the decoder to get really good at reconstructing the input from the sampled z.
KL Divergence Loss: This is the regularizer. It measures how much the learned distribution (defined by mu and sigma) diverges from our prior—which we conveniently set as the standard normal distribution N(0, 1). It pushes the learned distributions for all our data points towards the center of the latent space, preventing them from scattering wildly and ensuring the space remains continuous and navigable.

The ELBO is therefore a trade-off: “Be good at reconstructing your input, but don’t cheat by making your latent distributions too specialized and far from the standard normal.” This tug-of-war is what creates a well-structured latent space.

Here’s the code reality of this, because the math is useless without it:

import torch
import torch.nn as nn
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, latent_dim):
        super(VAE, self).__init__()
        self.latent_dim = latent_dim

        # Encoder
        self.encoder = nn.Sequential(
            nn.Linear(784, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
        )
        # These layers output the mean and log-variance (more stable than variance)
        self.fc_mu = nn.Linear(256, latent_dim)
        self.fc_logvar = nn.Linear(256, latent_dim)

        # Decoder
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 512),
            nn.ReLU(),
            nn.Linear(512, 784),
            nn.Sigmoid() # Because we're using BCE loss on normalized pixels
        )

    def encode(self, x):
        h = self.encoder(x)
        mu = self.fc_mu(h)
        logvar = self.fc_logvar(h)
        return mu, logvar

    def reparameterize(self, mu, logvar):
        # The famous reparameterization trick
        std = torch.exp(0.5 * logvar) # Convert log-variance to standard deviation
        eps = torch.randn_like(std)    # Sample noise from standard normal
        return mu + eps * std

    def decode(self, z):
        return self.decoder(z)

    def forward(self, x):
        mu, logvar = self.encode(x.view(-1, 784))
        z = self.reparameterize(mu, logvar)
        x_recon = self.decode(z)
        return x_recon, mu, logvar

# The loss function
def vae_loss(recon_x, x, mu, logvar):
    # Reconstruction loss (assuming pixel values between 0-1)
    BCE = F.binary_cross_entropy(recon_x, x.view(-1, 784), reduction='sum')

    # KL Divergence loss (closed-form solution for KL(N(mu, sigma) || N(0,1))
    KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

    # Combine them. The beta factor is often used to weight the KLD term.
    # This is the "beta-VAE" extension, crucial for disentangled representations.
    return BCE + 0.0001 * KLD

The Blurry Truth and The Beta Trick

You’ll notice something immediately about VAE outputs: they can be kind of… blurry. This is the direct result of that ELBO trade-off. The model is often penalized for being too precise, which can lead to it taking the “safer,” more averaged guess. It’s the price of a orderly latent space.

This is where the beta factor in the loss function comes in. In a standard VAE, beta=1. But by making beta > 1, you weight the KL term more heavily, forcing a more disentangled latent space (where each dimension controls a single, independent feature of the data). By making beta < 1, you weight the reconstruction term more, leading to sharper outputs but a potentially messier latent space. Tuning beta is a dark art. Start with 1 and adjust based on whether you care more about generation quality or latent space structure.

The other classic “oops” moment is the KL vanishing problem, where the network quickly minimizes the KL term to zero and then just acts like a standard autoencoder, ignoring the latent distribution. Using a beta schedule that starts low and increases can help the model learn to use the latent space effectively before being strongly regularized.

So, are VAEs the ultimate generative model? For pristine, high-resolution generation, diffusion models have largely taken the crown. But the VAE’s true legacy is that beautiful, structured, and continuous latent space. It’s a probabilistic map of your data, and that is incredibly powerful for more than just generation—it’s used for interpolation, anomaly detection, and as a foundation for more complex models. It taught us how to wrestle with probability in a neural network, and we’re still using its tricks today.