docs/Deep Learning
Project·2–3 days

Tiny Diffusion Model (DDPM)

Implement Denoising Diffusion Probabilistic Models from scratch — forward noising, U-Net noise predictor, and the reverse sampling loop.

What you'll build

A DDPM that learns to generate images by reversing a gradual noising process. Start with MNIST (trains in under an hour on a single GPU), then move to CIFAR-10 or a custom 64×64 dataset.

Prerequisites

Diffusion sits at the intersection of probability and deep learning — line these up first:

  • Gaussians — sampling from N(μ, σ²), reparameterization trick
  • Variance schedules — what β_t, α_t, and ᾱ_t mean and how they relate
  • U-Net architecture — encoder/decoder with skip connections, group norm
  • Residual blocks and time/condition injection patterns (FiLM, additive embeddings)
  • Sinusoidal embeddings (same idea as transformer positional encodings)
  • MSE loss — and why predicting noise ε is equivalent to a weighted ELBO
  • EMA of model weights — why sampling uses the EMA copy
  • Mixed precision + gradient clipping for stable training

Warm-up exercises

  1. Sample 10k points from a 2D Gaussian mixture and visualize them — then train a tiny MLP diffusion model on this toy data before tackling images.
  2. Implement the forward noising process q(x_t | x_0) and plot a single MNIST digit at t = 0, 100, 500, 999.
  3. Build a minimal U-Net (no time conditioning) that learns to denoise images at a fixed noise level.
  4. Add sinusoidal time embeddings and verify gradients flow through them.
  5. Derive ᾱ_t from a linear β schedule and confirm sqrt(ᾱ_T) ≈ 0 at T=1000.

Difficulty

Advanced. Bridges probability theory and deep learning in a beautiful way.

The two processes

  • Forward (fixed): q(xₜ | x₀) = N(√ᾱₜ x₀, (1 − ᾱₜ) I) — add Gaussian noise over T steps.
  • Reverse (learned): a U-Net εθ(xₜ, t) predicts the noise that was added.

Training objective

# Simple MSE on predicted noise (Ho et al. 2020)
t = torch.randint(0, T, (B,))
noise = torch.randn_like(x0)
xt = sqrt_alpha_bar[t] * x0 + sqrt_one_minus_alpha_bar[t] * noise
loss = F.mse_loss(model(xt, t), noise)

Milestones

  1. Schedule. Implement linear or cosine β schedule, precompute αₜ and ᾱₜ.
  2. U-Net. Down/up blocks with residual connections, group norm, sinusoidal time embeddings injected into each block.
  3. Training. Random t per sample, EMA of model weights (decay 0.999) for sampling.
  4. Sampling. Implement DDPM ancestral sampler (T=1000 steps) and DDIM (50 steps).
  5. Eval. Visual quality + FID on a held-out set.

Sinusoidal time embedding

def time_embed(t, dim):
    half = dim // 2
    freqs = torch.exp(-math.log(10000) * torch.arange(half) / half).to(t)
    args = t[:, None].float() * freqs[None]
    return torch.cat([args.sin(), args.cos()], dim=-1)

Stretch goals

  • Classifier-free guidance with a class-conditional U-Net
  • Latent diffusion — train a small VAE first, diffuse in latent space
  • Text conditioning via a frozen CLIP text encoder + cross-attention
  • v-prediction parameterization (used in Stable Diffusion 2)