35.1 Generative Modeling: Density Estimation and Sampling

Right, let’s get this straight. You want a machine to create something from nothing. Not just anything, but something that looks plausibly like it belongs in our world—a human face, a cat picture, a sonnet. This isn’t magic; it’s generative modeling. And at its core, it’s a beautifully twisted statistical problem.

Think of it like this: we have a universe of all possible data (say, all possible 64x64 images). Our real data—actual pictures of cats—lives in a tiny, complex, and utterly unknown region of this universe. We call this the true data distribution. Our job is to build a model that can a) understand the shape of this tiny region (density estimation) and then b) point to a random spot inside it (sampling).

The Core Idea: Learning the Data’s “Shape”

Fundamentally, a generative model is trying to learn the probability density function ( p(x) ) of your data. If I give you a random blob of pixels, a good generative model should be able to tell you, “The probability that is a cat is 0.0001%,” or, “Ah, yes, that’s a proper cat, 92% probability.” This is density estimation.

But we don’t just want to judge; we want to create. So we also need sampling. This means asking the model to generate a new vector ( x ) that has a high probability under the ( p(x) ) it has learned. It’s the difference between an art critic and an artist. We need our model to be both.

The problem? That true distribution ( p(x) ) is insanely complicated. We have to approximate it with a simpler, parameterized model ( p_{\theta}(x) ) (e.g., a neural network). We then tune the parameters ( \theta ) to make ( p_{\theta}(x) ) as close as possible to ( p(x) ). This is the heart of the whole game.

Explicit Density Models: The Meticulous Cartographers

These models try to explicitly define and compute ( p_{\theta}(x) ). They’re like cartographers meticulously mapping the territory.

The Challenge of Intractable Normalization

Here’s the first big wall we hit. To be a valid probability distribution, ( p_{\theta}(x) ) must integrate to 1 over all possible ( x ). This means we need a normalization constant, often called the partition function Z: ( p_{\theta}(x) = \frac{\tilde{p}{\theta}(x)}{Z} ), where ( \tilde{p}{\theta}(x) ) is our unnormalized model.

Calculating ( Z = \int \tilde{p}_{\theta}(x) dx ) is a nightmare for high-dimensional data like images. It’s completely intractable. This is where VAEs make a brilliant, if slightly compromising, move.

Variational Autoencoders (VAEs): The Clever Compromisers

VAEs don’t try to learn ( p(x) ) directly. Instead, they introduce a latent variable ( z ) (e.g., a random vector) and model the joint distribution ( p(x, z) ). They’re essentially saying, “Let’s assume every image ( x ) is generated from a simpler, lower-dimensional latent code ( z ).”

The goal is to maximize the log-likelihood ( \log p(x) ), but because that’s intractable, we use variational inference. We introduce an approximate posterior distribution ( q_{\phi}(z | x) ) (the encoder) to approximate the true posterior ( p(z | x) ). This leads to the Evidence Lower BOund (ELBO):

$$\log p(x) \geq \mathbb{E}{z \sim q{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{KL}(q_{\phi}(z|x) \parallel p(z))$$

This is genius. We now have a tractable objective to maximize:

Reconstruction loss: The first term encourages the decoder ( p_{\theta}(x|z) ) to reconstruct the input ( x ) accurately from the latent code ( z ).
Regularization loss: The KL divergence term pushes the encoder’s distribution ( q_{\phi}(z|x) ) to be close to a simple prior (usually a standard Normal distribution, ( p(z) = \mathcal{N}(0,1) )).

Here’s a simplified TensorFlow/Keras example of a VAE encoder and loss function:

import tensorflow as tf
from tensorflow.keras import layers, Model

# Encoder
encoder_inputs = layers.Input(shape=(28, 28, 1))
x = layers.Conv2D(32, 3, activation="relu", strides=2, padding="same")(encoder_inputs)
x = layers.Conv2D(64, 3, activation="relu", strides=2, padding="same")(x)
x = layers.Flatten()(x)
z_mean = layers.Dense(latent_dim, name="z_mean")(x)
z_log_var = layers.Dense(latent_dim, name="z_log_var")(x)

# Sampling layer (The "Reparameterization Trick")
def sampling(args):
    z_mean, z_log_var = args
    batch_size = tf.shape(z_mean)[0]
    epsilon = tf.random.normal(shape=(batch_size, latent_dim))
    return z_mean + tf.exp(0.5 * z_log_var) * epsilon # This is the key!

z = layers.Lambda(sampling, output_shape=(latent_dim,))([z_mean, z_log_var])
encoder = Model(encoder_inputs, [z_mean, z_log_var, z], name="encoder")

# Define the VAE loss
def vae_loss(input_img, reconstructed):
    # 1. Reconstruction Loss (Binary Cross-Entropy or MSE)
    reconstruction_loss = tf.reduce_mean(
        tf.keras.losses.binary_crossentropy(
            tf.keras.layers.Flatten()(input_img),
            tf.keras.layers.Flatten()(reconstructed)
        )
    )
    reconstruction_loss *= 28 * 28 # Scale up relative to KL

    # 2. KL Divergence Loss
    kl_loss = -0.5 * tf.reduce_mean(
        z_log_var - tf.square(z_mean) - tf.exp(z_log_var) + 1
    )
    return reconstruction_loss + kl_loss

The reparameterization trick is the secret sauce here. It lets us backpropagate through random sampling by making it deterministic save for a random input epsilon.

The VAE’s Pitfall: That KL loss is a double-edged sword. It ensures a well-behaved latent space but often dominates, leading to overly smooth, blurry reconstructions and samples. The model becomes too good at regularization and not good enough at creating sharp images. It’s the price of their mathematical elegance.

Implicit Density Models: The Forgers

This is where GANs come in. They say, “Forget explicitly modeling ( p(x) ); who needs a probability density anyway? All we care about is sampling.” They don’t map the territory; they learn to forge documents that look exactly like the real thing.

A GAN consists of two networks:

Generator (G): Takes random noise ( z ) and tries to generate a fake sample ( G(z) ).
Discriminator (D): Tries to distinguish between real samples from the data and fake samples from G.

This setup is a beautiful, adversarial min-max game. The generator is trying to minimize the discriminator’s ability to do its job, while the discriminator is trying to maximize it. The objective function is:

$$\min_G \max_D V(D, G) = \mathbb{E}{x \sim p{data}}[\log D(x)] + \mathbb{E}_{z \sim p(z)}[\log (1 - D(G(z)))]$$

The GAN’s Pitfall: This game is notoriously unstable. You’re not optimizing a nice, smooth loss function. You’re balancing two networks in a contest where one’s gain is the other’s loss. It’s like training two boxers by having them punch each other and hoping they both become world champions simultaneously. You often get mode collapse, where the generator finds one single image that fools the discriminator (e.g., a seemingly perfect dog) and just outputs that every time. Useful for producing one great forgery, useless for generating a diverse dataset.

Diffusion Models: The Masters of Gradual Destruction

Diffusion models take a completely different, almost philosophical approach. They don’t start with noise and try to build a image. Instead, they meticulously learn how to reverse a process of destruction.

The forward process is fixed: we gradually add Gaussian noise to an image over many steps ( T ) until it becomes pure noise. This is trivial. The magic is in the reverse process. We train a neural network (usually a U-Net) to predict the noise that was added at a given step. Why is this brilliant? Because the network isn’t tasked with generating a perfect image from scratch in one go. It’s only ever asked to make a small, denoising correction. It’s the “how do you eat an elephant? One bite at a time” of generative models.

By chaining these small, learned denoising steps, we can start with pure noise and gradually subtract the predicted noise to walk backwards to a clean image. The probability distribution is implicitly modeled through this iterative denoising procedure.

Why they took over: They are vastly more stable to train than GANs (no adversarial game) and produce much higher quality and diverse samples than VAEs. The trade-off? They are painfully slow at sampling because you need to run the model for every single step of the reverse process (often 1000 steps). It’s the most computationally expensive “one bite at a time” meal you’ll ever serve.