Core·8 min read

Normalization

Keeping activations well-behaved so deep networks actually train.

Why Normalize?

As signals propagate through deep networks, their distributions drift — internal covariate shift. Activations explode or vanish, gradients follow, and training stalls. Normalization re-centers and re-scales activations to keep the optimization landscape friendly.

Batch Normalization

Normalize each feature across the batch dimension, then learn an affine transform:

μ_B = mean(x, dim=batch)
σ_B = std(x, dim=batch)
x̂  = (x − μ_B) / (σ_B + ε)
y   = γ · x̂ + β

Great in CNNs. Weakness: behaves differently at train vs inference time, and breaks down at small batch sizes.

Layer Normalization

Normalize across the feature dimension within a single sample — no batch dependency:

y = γ · (x − mean(x)) / std(x) + β

Default choice in transformers. Works identically at train and inference, and on batch size 1.

RMSNorm

Drop the mean-subtraction; just rescale by the root-mean-square:

y = γ · x / sqrt( mean(x²) + ε )

Used in LLaMA, T5, and most modern LLMs. Cheaper than LayerNorm with no measurable quality loss.

Pre-Norm vs Post-Norm

Post-norm — norm after the residual add (original transformer). Needs warmup or it diverges.
Pre-norm — norm before the sublayer. Trains stably without warmup. Standard today.

If your model is exploding, normalization is the first thing to check. If it's already normalized, check it again — wrong axis is the most common bug.

QuizCheck your understanding

BatchNorm normalizes across which dimension?

Why is LayerNorm preferred over BatchNorm in transformers?

What does RMSNorm drop compared to LayerNorm?

What problem does normalization primarily address in deep networks?

Pre-norm vs post-norm: which trains more stably without warmup?

At inference time, BatchNorm uses:

Why does BatchNorm degrade with very small batch sizes?