Transformers
The architecture that ate machine learning. Self-attention, parallelism, and scale.
The Core Idea: Attention
Instead of processing tokens sequentially, every token attends to every other token directly. For each token we compute three projections: Query, Key, and Value.
Attention(Q, K, V) = softmax(Q · Kᵀ / √dₖ) · VThe softmax gives a weighted average — each output token is a mixture of all input values, weighted by query-key similarity.
Multi-Head Attention
Run h attention operations in parallel with different projections, then concatenate. Each head learns to attend to different relationships (syntax, coreference, position, etc.).
The Full Block
x = x + MultiHeadAttention(LayerNorm(x))
x = x + FeedForward(LayerNorm(x))Pre-norm residual structure. The FFN is just two linear layers with a GELU activation — usually 4× wider than the model dimension.
Positional Encodings
Self-attention is permutation-invariant, so we inject position information. Modern choices:
- RoPE — rotary embeddings, the dominant choice in LLMs (LLaMA, GPT-NeoX).
- ALiBi — biases attention scores by distance, enables length extrapolation.
- Sinusoidal — the original transformer's fixed encoding.
Why It Won
- Parallelism — entire sequences process at once, perfect for GPUs.
- Long-range dependencies — every token directly sees every other.
- Scaling laws — performance keeps improving predictably with compute and data.
"Attention is all you need" — and it turned out to be enough for text, images, audio, video, protein folding, and just about everything else.