Activations & Loss
The non-linearities that make networks expressive and the losses that tell them what to learn.
Activation Functions
- ReLU —
max(0, x). Cheap, sparse, the default for hidden layers. - GELU — smooth ReLU variant. Standard in transformers.
- SiLU / Swish —
x · σ(x). Used in modern LLMs. - Sigmoid — squashes to
(0,1). Used for binary outputs and gates. - Softmax — turns logits into a probability distribution.
Loss Functions
Cross-Entropy (Classification)
L = −Σ y_i · log(ŷ_i)Pairs naturally with softmax outputs. Penalizes confident wrong answers heavily.
Mean Squared Error (Regression)
L = (1/N) · Σ (y_i − ŷ_i)²Contrastive Losses
Used in self-supervised learning (CLIP, SimCLR) — pull similar pairs together, push dissimilar pairs apart in embedding space.
Pick the loss that matches your output distribution. The activation on the final layer should be consistent with it (softmax + cross-entropy, linear + MSE, sigmoid + BCE).
Which activation is the standard choice for hidden layers in modern transformers?
What does softmax output?
Which loss should pair with a sigmoid output for binary classification?
Why is ReLU usually preferred over sigmoid in hidden layers?
Cross-entropy loss heavily penalizes which kind of prediction?
Which loss is used in contrastive self-supervised methods like SimCLR and CLIP?
SiLU (Swish) is defined as:
For a regression problem with unbounded real-valued targets, the typical final layer + loss is: