Core·8 min read

Regularization

The toolkit for fighting overfitting — keeping your model honest on data it hasn't seen.

The Bias-Variance Tradeoff

A model that's too simple underfits — high bias. A model that's too flexible overfits — high variance, memorizing noise instead of learning structure. Regularization gives us knobs to slide along this axis.

Weight Decay (L2)

L_total = L + λ · Σ ‖w‖²

Penalizes large weights, encouraging simpler functions. Use AdamW so decay is decoupled from the adaptive gradient — typical λ ≈ 0.01–0.1.

Dropout

Randomly zero out a fraction p of activations during training:

h = mask ⊙ h / (1 − p)   # mask ~ Bernoulli(1−p)

Forces the network to not rely on any single neuron — like training an ensemble. Common values: 0.1–0.5. Largely replaced by other regularizers in modern LLMs.

Early Stopping

Track validation loss; stop training when it stops improving. The cheapest, most effective regularizer ever invented.

Data Augmentation

Vision — flips, crops, color jitter, MixUp, CutMix, RandAugment.
Text — back-translation, token masking, synonym replacement.
Audio — SpecAugment, pitch/time stretching.

Doubling effective dataset size beats almost any architectural change.

Label Smoothing

y_smooth = (1 − ε) · y_onehot + ε / K

Replaces hard 0/1 targets with soft ones. Reduces overconfidence and improves calibration — used in nearly every modern image classifier.

Rule of thumb: more data > augmentation > weight decay > dropout. Apply in that order until validation loss is happy.

QuizCheck your understanding

A model with low training loss but high validation loss is most likely:

L2 regularization (weight decay) encourages:

Why use AdamW instead of Adam + L2 penalty?

Dropout is typically applied:

Which is generally the most effective regularizer for vision models?

Early stopping works by:

Label smoothing with ε replaces hard targets to:

MixUp augmentation combines: