Normalization
Keeping activations well-behaved so deep networks actually train.
Why Normalize?
As signals propagate through deep networks, their distributions drift — internal covariate shift. Activations explode or vanish, gradients follow, and training stalls. Normalization re-centers and re-scales activations to keep the optimization landscape friendly.
Batch Normalization
Normalize each feature across the batch dimension, then learn an affine transform:
μ_B = mean(x, dim=batch)
σ_B = std(x, dim=batch)
x̂ = (x − μ_B) / (σ_B + ε)
y = γ · x̂ + βGreat in CNNs. Weakness: behaves differently at train vs inference time, and breaks down at small batch sizes.
Layer Normalization
Normalize across the feature dimension within a single sample — no batch dependency:
y = γ · (x − mean(x)) / std(x) + βDefault choice in transformers. Works identically at train and inference, and on batch size 1.
RMSNorm
Drop the mean-subtraction; just rescale by the root-mean-square:
y = γ · x / sqrt( mean(x²) + ε )Used in LLaMA, T5, and most modern LLMs. Cheaper than LayerNorm with no measurable quality loss.
Pre-Norm vs Post-Norm
- Post-norm — norm after the residual add (original transformer). Needs warmup or it diverges.
- Pre-norm — norm before the sublayer. Trains stably without warmup. Standard today.
If your model is exploding, normalization is the first thing to check. If it's already normalized, check it again — wrong axis is the most common bug.
BatchNorm normalizes across which dimension?
Why is LayerNorm preferred over BatchNorm in transformers?
What does RMSNorm drop compared to LayerNorm?
What problem does normalization primarily address in deep networks?
Pre-norm vs post-norm: which trains more stably without warmup?
At inference time, BatchNorm uses:
Why does BatchNorm degrade with very small batch sizes?