Advanced·8 min read

Training Tricks

The practical knowledge that separates models that train from models that diverge.

Regularization

Weight decay — L2 penalty on weights. Use AdamW so it's decoupled from the gradient.
Dropout — randomly zero activations during training. Less common in modern LLMs.
Data augmentation — the most effective regularizer for vision.
Label smoothing — softens hard targets, reduces overconfidence.

Normalization

BatchNorm — normalize across the batch dimension. Standard in CNNs.
LayerNorm — normalize across features per sample. Standard in transformers.
RMSNorm — LayerNorm without the mean subtraction. Used in LLaMA.

Mixed Precision

Train in bfloat16 or float16 with master weights in float32. ~2× speedup and ~½ memory on modern GPUs, with no loss in quality. Use torch.cuda.amp or bf16 autocast.

Gradient Clipping

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Prevents the occasional huge gradient from blowing up training. Essential for transformers and RNNs.

Initialization Matters

Use scheme-appropriate init: Kaiming for ReLU networks, Xavier for tanh/sigmoid. For transformers, scaled init like μP enables hyperparameter transfer across model sizes.

Most "training problems" are actually data problems, init problems, or learning rate problems — in that order.