Core·8 min read

Activations & Loss

The non-linearities that make networks expressive and the losses that tell them what to learn.

Activation Functions

ReLU — max(0, x). Cheap, sparse, the default for hidden layers.
GELU — smooth ReLU variant. Standard in transformers.
SiLU / Swish — x · σ(x). Used in modern LLMs.
Sigmoid — squashes to (0,1). Used for binary outputs and gates.
Softmax — turns logits into a probability distribution.

Loss Functions

Cross-Entropy (Classification)

L = −Σ y_i · log(ŷ_i)

Pairs naturally with softmax outputs. Penalizes confident wrong answers heavily.

Mean Squared Error (Regression)

L = (1/N) · Σ (y_i − ŷ_i)²

Contrastive Losses

Used in self-supervised learning (CLIP, SimCLR) — pull similar pairs together, push dissimilar pairs apart in embedding space.

Pick the loss that matches your output distribution. The activation on the final layer should be consistent with it (softmax + cross-entropy, linear + MSE, sigmoid + BCE).

QuizCheck your understanding

Which activation is the standard choice for hidden layers in modern transformers?

What does softmax output?

Which loss should pair with a sigmoid output for binary classification?

Why is ReLU usually preferred over sigmoid in hidden layers?

Cross-entropy loss heavily penalizes which kind of prediction?

Which loss is used in contrastive self-supervised methods like SimCLR and CLIP?

SiLU (Swish) is defined as:

For a regression problem with unbounded real-valued targets, the typical final layer + loss is: