Backpropagation
The algorithm that made deep learning possible — efficient gradient computation via the chain rule.
The Idea
Training a network means minimizing a loss L(θ) with respect to parameters θ. To do that we need ∂L/∂θ for every parameter. Doing this naively scales poorly. Backprop computes all gradients in a single backward pass, in time proportional to the forward pass.
The Chain Rule, Repeatedly
For a composition L = f(g(h(x))):
∂L/∂x = ∂L/∂f · ∂f/∂g · ∂g/∂h · ∂h/∂xBackprop applies this rule layer-by-layer, reusing intermediate results stored during the forward pass.
Forward & Backward Pass
# Forward
z₁ = W₁ x + b₁; a₁ = φ(z₁)
z₂ = W₂ a₁ + b₂; ŷ = z₂
L = ½(ŷ − y)²
# Backward
δ₂ = ŷ − y
∂L/∂W₂ = δ₂ · a₁ᵀ
δ₁ = (W₂ᵀ δ₂) ⊙ φ'(z₁)
∂L/∂W₁ = δ₁ · xᵀWhy It's Fast
- Gradients are computed in one pass backward through the graph.
- Each operation needs only local information: its inputs and its upstream gradient.
- Modern frameworks (PyTorch, JAX) build a computation graph and apply this automatically via autograd.
Backprop isn't magic — it's just bookkeeping. The magic is that it makes the bookkeeping cheap enough that training billion-parameter models is feasible.
Backpropagation is fundamentally an application of which calculus rule?
What is the asymptotic cost of one backward pass relative to one forward pass?
Why do frameworks store activations during the forward pass?
Which of these does backprop NOT do?
If activations are not saved (e.g. with gradient checkpointing), what happens?
In the backward pass through y = Wx, the gradient with respect to W is:
Autograd in PyTorch builds the computation graph: