Core·8 min read

Optimization

How we actually descend the loss landscape — from vanilla SGD to AdamW.

Stochastic Gradient Descent

The simplest update rule:

θ ← θ − η · ∇L(θ; batch)

where η is the learning rate. Noisy gradients from mini-batches actually help escape sharp minima.

Accumulate a velocity vector to smooth out oscillations:

v ← β · v + ∇L(θ)
θ ← θ − η · v

Typical β = 0.9. Helps push through flat regions and dampens zig-zagging in narrow valleys.

Adam combines momentum with per-parameter adaptive learning rates using first and second moment estimates:

m ← β₁·m + (1−β₁)·g
v ← β₂·v + (1−β₂)·g²
θ ← θ − η · m̂ / (√v̂ + ε)

AdamW decouples weight decay from the gradient update — the default choice for training transformers.

Rule of thumb: use AdamW with cosine decay and 1–5% warmup. It's not always optimal, but it's almost never bad.

QuizCheck your understanding

Adam maintains running estimates of:

The 'W' in AdamW stands for the fact that it:

A typical momentum coefficient β for SGD is:

Cosine decay schedules the LR to:

Gradient clipping is most useful when:

Compared to vanilla SGD, Adam typically:

Warmup is especially important when: