Project·3–4 hours

Neural Style Transfer

Recreate the iconic 'painting style' transfer using a frozen VGG-19, content loss, and Gram-matrix style loss.

What you'll build

A program that takes a content image (a photo) and a style image (a painting) and produces a new image that has the content of the first but the brushstrokes, colors, and textures of the second. You optimize the pixels of the output image — not network weights.

Prerequisites

CNN feature hierarchies — what shallow vs deep VGG layers represent (edges → textures → parts → objects)
Optimizing inputs, not weights — comfort with requires_grad=True on a tensor and frozen model parameters
MSE loss and weighted multi-term losses
Gram matrices — the linear-algebra meaning (inner products between feature channels)
Autograd basics — calling .backward() on a scalar, reading .grad
L-BFGS at a high level (why it's used here vs SGD)
PIL / torchvision transforms — normalizing/denormalizing images for VGG

Warm-up exercises

Load pretrained VGG-19, run an image through it, and print the shape of every intermediate activation.
Optimize a random-noise tensor so it matches a target image under MSE — this proves you understand "optimize the pixels".
Compute the Gram matrix of a (1, 64, 32, 32) tensor by hand and via your function; confirm they match.
Build a small FeatureExtractor that returns a dict of named activations from chosen layers using forward hooks.

Difficulty

Intermediate. A great way to internalize what CNN features actually capture.

Key idea

Content loss: MSE between feature maps of output and content image at a mid-deep VGG layer (e.g. conv4_2).
Style loss: MSE between Gram matrices of output and style image across multiple shallow + deep layers.
Total loss: α · L_content + β · L_style + γ · L_tv

Gram matrix

def gram(f):  # f: (B, C, H, W)
    B, C, H, W = f.shape
    f = f.view(B, C, H * W)
    return (f @ f.transpose(1, 2)) / (C * H * W)

Milestones

Load pretrained VGG-19, freeze it, replace MaxPool with AvgPool for smoother gradients.
Pick content layer (relu4_2) and style layers (relu1_1, relu2_1, relu3_1, relu4_1, relu5_1).
Initialize output as a copy of the content image; make it the only trainable tensor.
Optimize with L-BFGS for 300–500 steps.
Add total-variation loss to reduce noise; tune β/α ratio (typical 1e3–1e6).

Stretch goals

Train a feed-forward stylization network (Johnson et al.) for real-time transfer
Arbitrary style transfer via AdaIN
Video style transfer with temporal consistency loss