Gradient Descent
The geometric intuition behind how networks learn — walking downhill on the loss surface, with the right step size and a little bit of inertia.
The Loss Landscape
Picture the loss L(θ) as a hilly terrain in a very high-dimensional space. Every point is a setting of the model's parameters; the height is how badly it performs. Training is just walking downhill until you reach a valley.
The gradient ∇L(θ) points uphill — the direction of steepest increase. So we step in the opposite direction:
θ ← θ − η · ∇L(θ)That's it. That's the whole algorithm. Everything else is variations on how to compute the gradient, how big a step to take, and whether to remember where you've been.
Batch, Stochastic, Mini-batch
- Batch GD — compute the gradient over the entire dataset. Smooth, accurate, slow, memory-hungry.
- Stochastic GD — one sample at a time. Fast and noisy. The noise actually helps escape bad local minima.
- Mini-batch GD — somewhere in between (32–4096 samples). The universal default.
The noise from mini-batching is a feature, not a bug — it acts like an implicit regularizer.
Why You Need Momentum
Vanilla GD gets stuck in two failure modes:
- Narrow ravines — the gradient zig-zags across the steep walls instead of moving along the floor.
- Flat plateaus — gradients are tiny; progress crawls.
Momentum fixes both by maintaining a running velocity:
v ← β · v + ∇L(θ)
θ ← θ − η · vThink of a ball rolling downhill instead of a hiker taking discrete steps. Oscillating components cancel out across updates; consistent components accumulate. Typical β = 0.9 means each step remembers ~90% of the previous velocity.
The Learning Rate, Intuitively
The learning rate η is the single most important hyperparameter. It controls how big a step you take in the gradient direction.
- Too small — training crawls. Loss decreases, but takes forever.
- Just right — loss drops smoothly and quickly.
- Too large — loss bounces, plateaus, or diverges to NaN.
- Way too large — the very first step shoots you to a worse part of the landscape.
How to Find It
Run an LR range test: start at 1e-7, multiply by ~1.3 each step, and plot loss vs LR on a log scale. Pick the LR roughly one order of magnitude below where loss first starts diverging — that's where the curve is steepest downward.
Typical starting points:
Adam / AdamW → 1e-4 to 3e-4
SGD + momentum → 1e-2 to 1e-1
Fine-tuning (LLM) → 1e-5 to 5e-5Schedules: The LR Should Move
A constant learning rate is almost never optimal. Two patterns dominate:
- Warmup — start with a tiny LR and linearly ramp up over the first few hundred or thousand steps. Prevents early instability when activations and gradients haven't settled.
- Cosine decay — smoothly anneal the LR toward zero over the rest of training. Lets you take big confident steps early and careful refining steps late.
┃ ╱‾‾‾─╮
LR ┃ ╱ ╲___
┃ ╱ ╲_____
┃╱ ‾─╮___
┗━━━━━━━━━━━━━━━━━━━━━━━━━▶ steps
warmup ──── cosine decay ────Putting It Together
optimizer = torch.optim.SGD(
model.parameters(),
lr=0.1,
momentum=0.9,
weight_decay=1e-4,
)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=0.1,
total_steps=num_steps,
pct_start=0.05, # 5% warmup
)If your model isn't training, lower the learning rate by 10× before changing anything else. If it's training but slowly, raise it by 3× and watch what happens. Most "architecture problems" are actually LR problems in disguise.
The gradient ∇L(θ) points in the direction of:
If your loss diverges to NaN within the first few steps, the most likely cause is:
What does momentum (β ≈ 0.9) actually accumulate?
Mini-batch gradient descent vs full-batch:
A typical starting LR for AdamW on a transformer is roughly:
What is LR warmup and why is it used?
In a narrow ravine on the loss surface, plain SGD tends to:
Cosine LR decay does what?