Project·2–3 days
Tiny Diffusion Model (DDPM)
Implement Denoising Diffusion Probabilistic Models from scratch — forward noising, U-Net noise predictor, and the reverse sampling loop.
What you'll build
A DDPM that learns to generate images by reversing a gradual noising process. Start with MNIST (trains in under an hour on a single GPU), then move to CIFAR-10 or a custom 64×64 dataset.
Prerequisites
Diffusion sits at the intersection of probability and deep learning — line these up first:
- Gaussians — sampling from N(μ, σ²), reparameterization trick
- Variance schedules — what β_t, α_t, and ᾱ_t mean and how they relate
- U-Net architecture — encoder/decoder with skip connections, group norm
- Residual blocks and time/condition injection patterns (FiLM, additive embeddings)
- Sinusoidal embeddings (same idea as transformer positional encodings)
- MSE loss — and why predicting noise ε is equivalent to a weighted ELBO
- EMA of model weights — why sampling uses the EMA copy
- Mixed precision + gradient clipping for stable training
Warm-up exercises
- Sample 10k points from a 2D Gaussian mixture and visualize them — then train a tiny MLP diffusion model on this toy data before tackling images.
- Implement the forward noising process
q(x_t | x_0)and plot a single MNIST digit at t = 0, 100, 500, 999. - Build a minimal U-Net (no time conditioning) that learns to denoise images at a fixed noise level.
- Add sinusoidal time embeddings and verify gradients flow through them.
- Derive ᾱ_t from a linear β schedule and confirm
sqrt(ᾱ_T) ≈ 0at T=1000.
Difficulty
Advanced. Bridges probability theory and deep learning in a beautiful way.
The two processes
- Forward (fixed): q(xₜ | x₀) = N(√ᾱₜ x₀, (1 − ᾱₜ) I) — add Gaussian noise over T steps.
- Reverse (learned): a U-Net εθ(xₜ, t) predicts the noise that was added.
Training objective
# Simple MSE on predicted noise (Ho et al. 2020)
t = torch.randint(0, T, (B,))
noise = torch.randn_like(x0)
xt = sqrt_alpha_bar[t] * x0 + sqrt_one_minus_alpha_bar[t] * noise
loss = F.mse_loss(model(xt, t), noise)Milestones
- Schedule. Implement linear or cosine β schedule, precompute αₜ and ᾱₜ.
- U-Net. Down/up blocks with residual connections, group norm, sinusoidal time embeddings injected into each block.
- Training. Random t per sample, EMA of model weights (decay 0.999) for sampling.
- Sampling. Implement DDPM ancestral sampler (T=1000 steps) and DDIM (50 steps).
- Eval. Visual quality + FID on a held-out set.
Sinusoidal time embedding
def time_embed(t, dim):
half = dim // 2
freqs = torch.exp(-math.log(10000) * torch.arange(half) / half).to(t)
args = t[:, None].float() * freqs[None]
return torch.cat([args.sin(), args.cos()], dim=-1)Stretch goals
- Classifier-free guidance with a class-conditional U-Net
- Latent diffusion — train a small VAE first, diffuse in latent space
- Text conditioning via a frozen CLIP text encoder + cross-attention
- v-prediction parameterization (used in Stable Diffusion 2)