docs/Deep Learning
Core·8 min read

Autograd & Computation Graphs

The machinery that lets you write a forward pass and get gradients for free.

The Computation Graph

Every tensor operation builds a DAG where nodes are operations and edges are tensors. When you call .backward(), autograd walks the graph in reverse, applying the chain rule at each node.

x = torch.tensor(2.0, requires_grad=True)
y = x ** 3 + 2 * x
y.backward()
print(x.grad)   # 3x² + 2 = 14

Define-by-Run vs Define-and-Run

  • Define-by-run (PyTorch, JAX) — the graph is built dynamically as Python runs. Easy to debug, supports control flow.
  • Define-and-run (TF1, ONNX) — graph is compiled ahead of time. Faster, but rigid.

Modern frameworks combine both: write eagerly, then torch.compile or jax.jit for graph-level optimization.

Vector-Jacobian Products

Backprop never explicitly forms Jacobians — that would be huge. Instead it computes VJPs: given upstream gradient v, return vᵀJ. Each primitive operation registers its own VJP rule.

Stop-Gradient and Detach

z = x.detach()           # break the graph
with torch.no_grad():    # don't track ops at all
    out = model(x)

Essential for target networks (DQN), EMA teachers (self-supervised), and inference-only passes.

Higher-Order Gradients

Set create_graph=True to differentiate through the gradient itself. Used in meta-learning (MAML), influence functions, and physics-informed nets.

Autograd is the unsung hero of deep learning. Before it, every new architecture meant deriving gradients by hand. Now you just write the forward pass.
QuizCheck your understanding
01

What does PyTorch's autograd build as you execute tensor ops?

02

Why does backprop compute VJPs instead of full Jacobians?

03

What does tensor.detach() do?

04

When should you use torch.no_grad()?

05

What does create_graph=True in .backward() enable?

06

Define-by-run frameworks (PyTorch, JAX eager) are advantageous because:

07

Which use case typically requires stop-gradient / detach?