Core·8 min read

Gradient Descent

The geometric intuition behind how networks learn — walking downhill on the loss surface, with the right step size and a little bit of inertia.

The Loss Landscape

Picture the loss L(θ) as a hilly terrain in a very high-dimensional space. Every point is a setting of the model's parameters; the height is how badly it performs. Training is just walking downhill until you reach a valley.

The gradient ∇L(θ) points uphill — the direction of steepest increase. So we step in the opposite direction:

θ ← θ − η · ∇L(θ)

That's it. That's the whole algorithm. Everything else is variations on how to compute the gradient, how big a step to take, and whether to remember where you've been.

Batch, Stochastic, Mini-batch

Batch GD — compute the gradient over the entire dataset. Smooth, accurate, slow, memory-hungry.
Stochastic GD — one sample at a time. Fast and noisy. The noise actually helps escape bad local minima.
Mini-batch GD — somewhere in between (32–4096 samples). The universal default.

The noise from mini-batching is a feature, not a bug — it acts like an implicit regularizer.

Why You Need Momentum

Vanilla GD gets stuck in two failure modes:

Narrow ravines — the gradient zig-zags across the steep walls instead of moving along the floor.
Flat plateaus — gradients are tiny; progress crawls.

Momentum fixes both by maintaining a running velocity:

v ← β · v + ∇L(θ)
θ ← θ − η · v

Think of a ball rolling downhill instead of a hiker taking discrete steps. Oscillating components cancel out across updates; consistent components accumulate. Typical β = 0.9 means each step remembers ~90% of the previous velocity.

The Learning Rate, Intuitively

The learning rate η is the single most important hyperparameter. It controls how big a step you take in the gradient direction.

Too small — training crawls. Loss decreases, but takes forever.
Just right — loss drops smoothly and quickly.
Too large — loss bounces, plateaus, or diverges to NaN.
Way too large — the very first step shoots you to a worse part of the landscape.

How to Find It

Run an LR range test: start at 1e-7, multiply by ~1.3 each step, and plot loss vs LR on a log scale. Pick the LR roughly one order of magnitude below where loss first starts diverging — that's where the curve is steepest downward.

Typical starting points:
  Adam / AdamW       → 1e-4  to  3e-4
  SGD + momentum     → 1e-2  to  1e-1
  Fine-tuning (LLM)  → 1e-5  to  5e-5

Schedules: The LR Should Move

A constant learning rate is almost never optimal. Two patterns dominate:

Warmup — start with a tiny LR and linearly ramp up over the first few hundred or thousand steps. Prevents early instability when activations and gradients haven't settled.
Cosine decay — smoothly anneal the LR toward zero over the rest of training. Lets you take big confident steps early and careful refining steps late.

         ┃   ╱‾‾‾─╮
LR       ┃  ╱     ╲___
         ┃ ╱          ╲_____
         ┃╱                 ‾─╮___
         ┗━━━━━━━━━━━━━━━━━━━━━━━━━▶ steps
          warmup  ──── cosine decay ────

Putting It Together

optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=1e-4,
)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=0.1,
    total_steps=num_steps,
    pct_start=0.05,   # 5% warmup
)

If your model isn't training, lower the learning rate by 10× before changing anything else. If it's training but slowly, raise it by 3× and watch what happens. Most "architecture problems" are actually LR problems in disguise.

QuizCheck your understanding

The gradient ∇L(θ) points in the direction of:

If your loss diverges to NaN within the first few steps, the most likely cause is:

What does momentum (β ≈ 0.9) actually accumulate?

Mini-batch gradient descent vs full-batch:

A typical starting LR for AdamW on a transformer is roughly:

What is LR warmup and why is it used?

In a narrow ravine on the loss surface, plain SGD tends to:

Cosine LR decay does what?