Autograd & Computation Graphs
The machinery that lets you write a forward pass and get gradients for free.
The Computation Graph
Every tensor operation builds a DAG where nodes are operations and edges are tensors. When you call .backward(), autograd walks the graph in reverse, applying the chain rule at each node.
x = torch.tensor(2.0, requires_grad=True)
y = x ** 3 + 2 * x
y.backward()
print(x.grad) # 3x² + 2 = 14Define-by-Run vs Define-and-Run
- Define-by-run (PyTorch, JAX) — the graph is built dynamically as Python runs. Easy to debug, supports control flow.
- Define-and-run (TF1, ONNX) — graph is compiled ahead of time. Faster, but rigid.
Modern frameworks combine both: write eagerly, then torch.compile or jax.jit for graph-level optimization.
Vector-Jacobian Products
Backprop never explicitly forms Jacobians — that would be huge. Instead it computes VJPs: given upstream gradient v, return vᵀJ. Each primitive operation registers its own VJP rule.
Stop-Gradient and Detach
z = x.detach() # break the graph
with torch.no_grad(): # don't track ops at all
out = model(x)Essential for target networks (DQN), EMA teachers (self-supervised), and inference-only passes.
Higher-Order Gradients
Set create_graph=True to differentiate through the gradient itself. Used in meta-learning (MAML), influence functions, and physics-informed nets.
Autograd is the unsung hero of deep learning. Before it, every new architecture meant deriving gradients by hand. Now you just write the forward pass.
What does PyTorch's autograd build as you execute tensor ops?
Why does backprop compute VJPs instead of full Jacobians?
What does tensor.detach() do?
When should you use torch.no_grad()?
What does create_graph=True in .backward() enable?
Define-by-run frameworks (PyTorch, JAX eager) are advantageous because:
Which use case typically requires stop-gradient / detach?