docs/Deep Learning
Core·8 min read

Optimization

How we actually descend the loss landscape — from vanilla SGD to AdamW.

Stochastic Gradient Descent

The simplest update rule:

θ ← θ − η · ∇L(θ; batch)

where η is the learning rate. Noisy gradients from mini-batches actually help escape sharp minima.

Momentum

Accumulate a velocity vector to smooth out oscillations:

v ← β · v + ∇L(θ)
θ ← θ − η · v

Typical β = 0.9. Helps push through flat regions and dampens zig-zagging in narrow valleys.

Adam & AdamW

Adam combines momentum with per-parameter adaptive learning rates using first and second moment estimates:

m ← β₁·m + (1−β₁)·g
v ← β₂·v + (1−β₂)·g²
θ ← θ − η · m̂ / (√v̂ + ε)

AdamW decouples weight decay from the gradient update — the default choice for training transformers.

Learning Rate Schedules

  • Warmup — start small, ramp up linearly over the first few thousand steps.
  • Cosine decay — smoothly anneal toward zero over training.
  • Step decay — drop the LR by 10× at fixed milestones.
Rule of thumb: use AdamW with cosine decay and 1–5% warmup. It's not always optimal, but it's almost never bad.
QuizCheck your understanding
01

Adam maintains running estimates of:

02

The 'W' in AdamW stands for the fact that it:

03

A typical momentum coefficient β for SGD is:

04

Cosine decay schedules the LR to:

05

Gradient clipping is most useful when:

06

Compared to vanilla SGD, Adam typically:

07

Warmup is especially important when: