docs/Deep Learning
Project·3–4 hours

Neural Style Transfer

Recreate the iconic 'painting style' transfer using a frozen VGG-19, content loss, and Gram-matrix style loss.

What you'll build

A program that takes a content image (a photo) and a style image (a painting) and produces a new image that has the content of the first but the brushstrokes, colors, and textures of the second. You optimize the pixels of the output image — not network weights.

Prerequisites

  • CNN feature hierarchies — what shallow vs deep VGG layers represent (edges → textures → parts → objects)
  • Optimizing inputs, not weights — comfort with requires_grad=True on a tensor and frozen model parameters
  • MSE loss and weighted multi-term losses
  • Gram matrices — the linear-algebra meaning (inner products between feature channels)
  • Autograd basics — calling .backward() on a scalar, reading .grad
  • L-BFGS at a high level (why it's used here vs SGD)
  • PIL / torchvision transforms — normalizing/denormalizing images for VGG

Warm-up exercises

  1. Load pretrained VGG-19, run an image through it, and print the shape of every intermediate activation.
  2. Optimize a random-noise tensor so it matches a target image under MSE — this proves you understand "optimize the pixels".
  3. Compute the Gram matrix of a (1, 64, 32, 32) tensor by hand and via your function; confirm they match.
  4. Build a small FeatureExtractor that returns a dict of named activations from chosen layers using forward hooks.

Difficulty

Intermediate. A great way to internalize what CNN features actually capture.

Key idea

  • Content loss: MSE between feature maps of output and content image at a mid-deep VGG layer (e.g. conv4_2).
  • Style loss: MSE between Gram matrices of output and style image across multiple shallow + deep layers.
  • Total loss: α · L_content + β · L_style + γ · L_tv

Gram matrix

def gram(f):  # f: (B, C, H, W)
    B, C, H, W = f.shape
    f = f.view(B, C, H * W)
    return (f @ f.transpose(1, 2)) / (C * H * W)

Milestones

  1. Load pretrained VGG-19, freeze it, replace MaxPool with AvgPool for smoother gradients.
  2. Pick content layer (relu4_2) and style layers (relu1_1, relu2_1, relu3_1, relu4_1, relu5_1).
  3. Initialize output as a copy of the content image; make it the only trainable tensor.
  4. Optimize with L-BFGS for 300–500 steps.
  5. Add total-variation loss to reduce noise; tune β/α ratio (typical 1e3–1e6).

Stretch goals

  • Train a feed-forward stylization network (Johnson et al.) for real-time transfer
  • Arbitrary style transfer via AdaIN
  • Video style transfer with temporal consistency loss