Regularization
The toolkit for fighting overfitting — keeping your model honest on data it hasn't seen.
The Bias-Variance Tradeoff
A model that's too simple underfits — high bias. A model that's too flexible overfits — high variance, memorizing noise instead of learning structure. Regularization gives us knobs to slide along this axis.
Weight Decay (L2)
L_total = L + λ · Σ ‖w‖²Penalizes large weights, encouraging simpler functions. Use AdamW so decay is decoupled from the adaptive gradient — typical λ ≈ 0.01–0.1.
Dropout
Randomly zero out a fraction p of activations during training:
h = mask ⊙ h / (1 − p) # mask ~ Bernoulli(1−p)Forces the network to not rely on any single neuron — like training an ensemble. Common values: 0.1–0.5. Largely replaced by other regularizers in modern LLMs.
Early Stopping
Track validation loss; stop training when it stops improving. The cheapest, most effective regularizer ever invented.
Data Augmentation
- Vision — flips, crops, color jitter, MixUp, CutMix, RandAugment.
- Text — back-translation, token masking, synonym replacement.
- Audio — SpecAugment, pitch/time stretching.
Doubling effective dataset size beats almost any architectural change.
Label Smoothing
y_smooth = (1 − ε) · y_onehot + ε / KReplaces hard 0/1 targets with soft ones. Reduces overconfidence and improves calibration — used in nearly every modern image classifier.
Rule of thumb: more data > augmentation > weight decay > dropout. Apply in that order until validation loss is happy.
A model with low training loss but high validation loss is most likely:
L2 regularization (weight decay) encourages:
Why use AdamW instead of Adam + L2 penalty?
Dropout is typically applied:
Which is generally the most effective regularizer for vision models?
Early stopping works by:
Label smoothing with ε replaces hard targets to:
MixUp augmentation combines: