Advanced·8 min read
Training Tricks
The practical knowledge that separates models that train from models that diverge.
Regularization
- Weight decay — L2 penalty on weights. Use AdamW so it's decoupled from the gradient.
- Dropout — randomly zero activations during training. Less common in modern LLMs.
- Data augmentation — the most effective regularizer for vision.
- Label smoothing — softens hard targets, reduces overconfidence.
Normalization
- BatchNorm — normalize across the batch dimension. Standard in CNNs.
- LayerNorm — normalize across features per sample. Standard in transformers.
- RMSNorm — LayerNorm without the mean subtraction. Used in LLaMA.
Mixed Precision
Train in bfloat16 or float16 with master weights in float32. ~2× speedup and ~½ memory on modern GPUs, with no loss in quality. Use torch.cuda.amp or bf16 autocast.
Gradient Clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)Prevents the occasional huge gradient from blowing up training. Essential for transformers and RNNs.
Initialization Matters
Use scheme-appropriate init: Kaiming for ReLU networks, Xavier for tanh/sigmoid. For transformers, scaled init like μP enables hyperparameter transfer across model sizes.
Most "training problems" are actually data problems, init problems, or learning rate problems — in that order.