35.6 Denoising Diffusion Probabilistic Models (DDPM)

Alright, let’s get our hands dirty with Denoising Diffusion Probabilistic Models, or DDPMs. This is the paper that really kicked off the modern diffusion revolution, and for good reason. It’s a gloriously simple, almost brute-force idea that just works. Forget the complex adversarial training of GANs or the sometimes-blurry reconstructions of VAEs. Diffusion is all about systematically destroying your data with noise and then teaching a neural network to reverse the process. It’s like teaching someone to clean a incredibly messy room by only showing them how to make it slightly less messy, one step at a time.

The core intuition is that it’s much easier to guess what a slightly less noisy image should look like than to imagine an entire pristine image from pure chaos. Our model is just a denoiser. A very, very good one that’s been trained on every possible level of mess.

The Two-Process Dance: Forward and Reverse

The entire framework is built on two Markov chains: the forward process and the reverse process. The forward process is fixed; it’s not learned. We take our real image, x₀, and over T timesteps, we gradually add a tiny bit of Gaussian noise to it. This is defined by a variance schedule, β_t, which controls how much noise we add at each step. It starts small and usually gets a bit larger towards the end.

The clever bit is that we don’t have to loop through all T steps to sample a noisy image at timestep t. We can do it in a single step thanks to the magic of closed-form expressions. Given x₀, we can directly compute x_t:

q(x_t | x_0) = N(x_t; sqrt(γ_t) * x_0, (1 - γ_t)I), where γ_t (often called α_bar_t) is the cumulative product of (1 - β) up to step t.

This is huge. During training, we can randomly sample a timestep t, noise the living daylights out of an image in one go, and then ask our model to predict the noise that was added. No sequential processing required.

import torch
import torch.nn as nn

def forward_diffusion_sample(x0, t, sqrt_alphas_cumprod, sqrt_one_minus_alphas_cumprod):
    """
    Sample from q(x_t | x_0) in one step.
    Args:
        x0: Original image tensor (batch_size, channels, height, width)
        t: Timestep tensor (batch_size,)
        sqrt_alphas_cumprod: Precomputed sqrt(α_bar_t) for each t
        sqrt_one_minus_alphas_cumprod: Precomputed sqrt(1 - α_bar_t) for each t
    """
    # Gather the precomputed values for the given timesteps t
    sqrt_alpha_cumprod_t = sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
    sqrt_one_minus_alpha_cumprod_t = sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)

    # Generate random noise ε ~ N(0, I)
    noise = torch.randn_like(x0)

    # Compute x_t
    x_t = sqrt_alpha_cumprod_t * x0 + sqrt_one_minus_alpha_cumprod_t * noise
    return x_t, noise

The reverse process is where the learning happens. We train a neural network, typically a U-Net, to predict the noise ε we added in the forward process. It takes in the noisy image x_t and the timestep t (which is crucial so the network knows how noisy the image is) and tries to output ε.

Why predict the noise and not the clean image directly? Mathematically, it’s equivalent, but predicting the noise tends to be more stable and easier for the network to learn. It’s a lower-variance task.

The Training Loop and The “Oops” Moment

The training objective is beautifully simple. It’s basically a mean-squared error loss between the predicted noise and the actual noise we added.

def train_step(model, x0, t, sqrt_alphas_cumprod, sqrt_one_minus_alphas_cumprod, loss_fn):
    # 1. Sample noise and create noisy image x_t
    x_t, noise = forward_diffusion_sample(x0, t, sqrt_alphas_cumprod, sqrt_one_minus_alphas_cumprod)

    # 2. Get model prediction (this is the noise ε_θ)
    noise_pred = model(x_t, t) # Your U-Net needs to accept the timestep 't' as input!

    # 3. Calculate loss. Simple MSE.
    loss = loss_fn(noise_pred, noise)
    return loss

This is the “oops” moment for the network. We show it a horribly noisy image and it says, “Hmm, I think the noise in this looks like this.” And we just say, “You absolute fool, it looked like this.” Rinse and repeat a few million times.

The Magic Trick: Sampling (Reverse Process)

Once the model is trained, we can generate new images by reversing the forward process. We start from pure noise, x_T ~ N(0, I), and then use our trained model to iteratively denoise it.

At each step from t = T down to t = 1, we:

Use the model to predict the noise ε_θ(x_t, t) in the current image.
Use this prediction to get a slightly less noisy image x_{t-1}. The original DDPM paper uses a stochastic sampling equation that also adds a tiny bit of new noise (controlled by a variance schedule σ_t). This stochasticity is why it’s called a probabilistic model—it helps generate diverse samples.

@torch.no_grad()
def sample(model, num_samples, img_size, timesteps, betas, sqrt_recip_alphas, sqrt_one_minus_alphas_cumprod, posterior_variance):
    """
    Sample from the model using the reverse process.
    """
    # Start from pure noise
    x_t = torch.randn((num_samples, 3, img_size, img_size))

    for t in range(timesteps-1, -1, -1):
        # Create a tensor for this timestep for every sample in the batch
        t_batch = torch.full((num_samples,), t, dtype=torch.long)

        # 1. Predict noise using the model
        noise_pred = model(x_t, t_batch)

        # 2. Get the mean of x_{t-1} (using the reparameterization trick)
        # See the paper for the derivation of this formula. It's not obvious!
        mean = sqrt_recip_alphas[t] * (x_t - betas[t] / sqrt_one_minus_alphas_cumprod[t] * noise_pred)

        # If we're at the last step, variance is 0 (deterministic)
        if t == 0:
            noise = 0
        else:
            noise = torch.randn_like(x_t) * torch.sqrt(posterior_variance[t])

        # 3. Sample x_{t-1} ~ N(mean, σ_t^2 * I)
        x_t = mean + noise

    # After looping from T to 0, x_t is now our generated image x_0
    return x_t

Best Practices and Pitfalls

Timestep Conditioning: How you feed the timestep t to the U-Net is critical. The standard, brilliant method is to use sinusoidal positional embeddings (like in Transformers) and project them into a vector that gets added to the feature maps throughout the network, often via group normalization layers. Don’t just append it as a channel; it gets lost.
The Variance Schedule (β_t): This is a hyperparameter you absolutely must get right. It defines the “noise roadmap.” A linear schedule from β_1=1e-4 to β_T=0.02 works, but later papers found that a cosine schedule performs better, as it adds noise more gracefully at the very beginning and end. Get this wrong, and your model will struggle to learn or generate anything coherent.
Compute Cost: Let’s be honest, the main pitfall is the compute. Training a diffusion model from scratch is brutally expensive. You need a big dataset, a powerful U-Net, and many, many timesteps T (often 1000). The sampling is also painfully slow—you have to run the model T times sequentially. This is the biggest trade-off: stunning quality for agonizing speed. Later models like DDIM would tackle this sampling slowness head-on.
The “Blurry” Problem: While better than VAEs, early DDPMs could still sometimes produce slightly “averaged” or blurry results, especially on complex datasets. The model learns the data distribution so well that if there are multiple “right” ways to denoise a patch of noise, it might pick the safest, most average option. The stochastic sampling helps, but it’s a fundamental tension.