Project·2–3 days

Tiny Diffusion Model (DDPM)

Implement Denoising Diffusion Probabilistic Models from scratch — forward noising, U-Net noise predictor, and the reverse sampling loop.

What you'll build

A DDPM that learns to generate images by reversing a gradual noising process. Start with MNIST (trains in under an hour on a single GPU), then move to CIFAR-10 or a custom 64×64 dataset.

Prerequisites

Diffusion sits at the intersection of probability and deep learning — line these up first:

Gaussians — sampling from N(μ, σ²), reparameterization trick
Variance schedules — what β_t, α_t, and ᾱ_t mean and how they relate
U-Net architecture — encoder/decoder with skip connections, group norm
Residual blocks and time/condition injection patterns (FiLM, additive embeddings)
Sinusoidal embeddings (same idea as transformer positional encodings)
MSE loss — and why predicting noise ε is equivalent to a weighted ELBO
EMA of model weights — why sampling uses the EMA copy
Mixed precision + gradient clipping for stable training

Warm-up exercises

Sample 10k points from a 2D Gaussian mixture and visualize them — then train a tiny MLP diffusion model on this toy data before tackling images.
Implement the forward noising process q(x_t | x_0) and plot a single MNIST digit at t = 0, 100, 500, 999.
Build a minimal U-Net (no time conditioning) that learns to denoise images at a fixed noise level.
Add sinusoidal time embeddings and verify gradients flow through them.
Derive ᾱ_t from a linear β schedule and confirm sqrt(ᾱ_T) ≈ 0 at T=1000.

Difficulty

Advanced. Bridges probability theory and deep learning in a beautiful way.

The two processes

Forward (fixed): q(xₜ | x₀) = N(√ᾱₜ x₀, (1 − ᾱₜ) I) — add Gaussian noise over T steps.
Reverse (learned): a U-Net εθ(xₜ, t) predicts the noise that was added.

Training objective

# Simple MSE on predicted noise (Ho et al. 2020)
t = torch.randint(0, T, (B,))
noise = torch.randn_like(x0)
xt = sqrt_alpha_bar[t] * x0 + sqrt_one_minus_alpha_bar[t] * noise
loss = F.mse_loss(model(xt, t), noise)

Milestones

Schedule. Implement linear or cosine β schedule, precompute αₜ and ᾱₜ.
U-Net. Down/up blocks with residual connections, group norm, sinusoidal time embeddings injected into each block.
Training. Random t per sample, EMA of model weights (decay 0.999) for sampling.
Sampling. Implement DDPM ancestral sampler (T=1000 steps) and DDIM (50 steps).
Eval. Visual quality + FID on a held-out set.

Sinusoidal time embedding

def time_embed(t, dim):
    half = dim // 2
    freqs = torch.exp(-math.log(10000) * torch.arange(half) / half).to(t)
    args = t[:, None].float() * freqs[None]
    return torch.cat([args.sin(), args.cos()], dim=-1)

Stretch goals

Classifier-free guidance with a class-conditional U-Net
Latent diffusion — train a small VAE first, diffuse in latent space
Text conditioning via a frozen CLIP text encoder + cross-attention
v-prediction parameterization (used in Stable Diffusion 2)