docs/Deep Learning
Architectures·8 min read

Transformers

The architecture that ate machine learning. Self-attention, parallelism, and scale.

The Core Idea: Attention

Instead of processing tokens sequentially, every token attends to every other token directly. For each token we compute three projections: Query, Key, and Value.

Attention(Q, K, V) = softmax(Q · Kᵀ / √dₖ) · V

The softmax gives a weighted average — each output token is a mixture of all input values, weighted by query-key similarity.

Multi-Head Attention

Run h attention operations in parallel with different projections, then concatenate. Each head learns to attend to different relationships (syntax, coreference, position, etc.).

The Full Block

x = x + MultiHeadAttention(LayerNorm(x))
x = x + FeedForward(LayerNorm(x))

Pre-norm residual structure. The FFN is just two linear layers with a GELU activation — usually 4× wider than the model dimension.

Positional Encodings

Self-attention is permutation-invariant, so we inject position information. Modern choices:

  • RoPE — rotary embeddings, the dominant choice in LLMs (LLaMA, GPT-NeoX).
  • ALiBi — biases attention scores by distance, enables length extrapolation.
  • Sinusoidal — the original transformer's fixed encoding.

Why It Won

  • Parallelism — entire sequences process at once, perfect for GPUs.
  • Long-range dependencies — every token directly sees every other.
  • Scaling laws — performance keeps improving predictably with compute and data.
"Attention is all you need" — and it turned out to be enough for text, images, audio, video, protein folding, and just about everything else.