Optimization
How we actually descend the loss landscape — from vanilla SGD to AdamW.
Stochastic Gradient Descent
The simplest update rule:
θ ← θ − η · ∇L(θ; batch)where η is the learning rate. Noisy gradients from mini-batches actually help escape sharp minima.
Momentum
Accumulate a velocity vector to smooth out oscillations:
v ← β · v + ∇L(θ)
θ ← θ − η · vTypical β = 0.9. Helps push through flat regions and dampens zig-zagging in narrow valleys.
Adam & AdamW
Adam combines momentum with per-parameter adaptive learning rates using first and second moment estimates:
m ← β₁·m + (1−β₁)·g
v ← β₂·v + (1−β₂)·g²
θ ← θ − η · m̂ / (√v̂ + ε)AdamW decouples weight decay from the gradient update — the default choice for training transformers.
Learning Rate Schedules
- Warmup — start small, ramp up linearly over the first few thousand steps.
- Cosine decay — smoothly anneal toward zero over training.
- Step decay — drop the LR by 10× at fixed milestones.
Rule of thumb: use AdamW with cosine decay and 1–5% warmup. It's not always optimal, but it's almost never bad.
Adam maintains running estimates of:
The 'W' in AdamW stands for the fact that it:
A typical momentum coefficient β for SGD is:
Cosine decay schedules the LR to:
Gradient clipping is most useful when:
Compared to vanilla SGD, Adam typically:
Warmup is especially important when: