Project·3–4 hours
Neural Style Transfer
Recreate the iconic 'painting style' transfer using a frozen VGG-19, content loss, and Gram-matrix style loss.
What you'll build
A program that takes a content image (a photo) and a style image (a painting) and produces a new image that has the content of the first but the brushstrokes, colors, and textures of the second. You optimize the pixels of the output image — not network weights.
Prerequisites
- CNN feature hierarchies — what shallow vs deep VGG layers represent (edges → textures → parts → objects)
- Optimizing inputs, not weights — comfort with
requires_grad=Trueon a tensor and frozen model parameters - MSE loss and weighted multi-term losses
- Gram matrices — the linear-algebra meaning (inner products between feature channels)
- Autograd basics — calling
.backward()on a scalar, reading.grad - L-BFGS at a high level (why it's used here vs SGD)
- PIL / torchvision transforms — normalizing/denormalizing images for VGG
Warm-up exercises
- Load pretrained VGG-19, run an image through it, and print the shape of every intermediate activation.
- Optimize a random-noise tensor so it matches a target image under MSE — this proves you understand "optimize the pixels".
- Compute the Gram matrix of a
(1, 64, 32, 32)tensor by hand and via your function; confirm they match. - Build a small
FeatureExtractorthat returns a dict of named activations from chosen layers using forward hooks.
Difficulty
Intermediate. A great way to internalize what CNN features actually capture.
Key idea
- Content loss: MSE between feature maps of output and content image at a mid-deep VGG layer (e.g.
conv4_2). - Style loss: MSE between Gram matrices of output and style image across multiple shallow + deep layers.
- Total loss:
α · L_content + β · L_style + γ · L_tv
Gram matrix
def gram(f): # f: (B, C, H, W)
B, C, H, W = f.shape
f = f.view(B, C, H * W)
return (f @ f.transpose(1, 2)) / (C * H * W)Milestones
- Load pretrained VGG-19, freeze it, replace MaxPool with AvgPool for smoother gradients.
- Pick content layer (
relu4_2) and style layers (relu1_1, relu2_1, relu3_1, relu4_1, relu5_1). - Initialize output as a copy of the content image; make it the only trainable tensor.
- Optimize with L-BFGS for 300–500 steps.
- Add total-variation loss to reduce noise; tune
β/αratio (typical 1e3–1e6).
Stretch goals
- Train a feed-forward stylization network (Johnson et al.) for real-time transfer
- Arbitrary style transfer via AdaIN
- Video style transfer with temporal consistency loss