docs/Deep Learning
Project·1 weekend

Build a Mini-GPT From Scratch

Implement a decoder-only transformer language model from first principles — tokenizer, attention, training loop, and sampling — in under 400 lines of PyTorch.

What you'll build

A character-level (or BPE) GPT trained on a small corpus (Shakespeare, TinyStories, or your own text). You implement multi-head causal self-attention, positional encodings, layer norm, residual streams, and a sampler with temperature and top-k.

Prerequisites

Don't skip these — transformers reward solid fundamentals:

  • Embeddings — token and positional, why we add them
  • Scaled dot-product attention — Q/K/V intuition, why we divide by √d_k
  • Multi-head attention — splitting d_model across heads, concatenating, projecting
  • Causal masking — why a decoder-only LM uses a lower-triangular mask
  • LayerNorm (and the pre-norm vs post-norm distinction)
  • Residual streams and why gradient flow needs them
  • Cross-entropy over a vocabulary; the shift-by-one targets trick
  • Optimization — AdamW, gradient clipping, warmup + cosine schedule
  • Sampling math — temperature, top-k, top-p (nucleus)

Warm-up exercises

  1. Implement scaled dot-product attention in ~10 lines and verify it matches F.scaled_dot_product_attention on random tensors.
  2. Hand-derive what a 4×4 causal mask looks like and apply it to a toy attention matrix.
  3. Train a 2-layer MLP bigram language model on Shakespeare and report val loss — this is your baseline to beat.
  4. Write a sampler that, given logits, supports temperature, top-k, and top-p; test on a fixed distribution.
  5. Implement a tiny char-level tokenizer with encode/decode round-tripping.

Difficulty

Intermediate → Advanced. You'll truly understand transformers by the end.

Architecture

Token Embedding + Positional Embedding
  → N × [LayerNorm → Causal MHA → Residual
          LayerNorm → MLP (4× hidden) → Residual]
  → LayerNorm → LM Head (tied with token embeddings)

Milestones

  1. Tokenizer. Start with char-level. Stretch: implement a tiny BPE.
  2. Data loader. Sample random contiguous blocks of size block_size with shifted targets.
  3. Attention. Implement causal mask, scaled dot-product, multi-head split/concat. Verify against F.scaled_dot_product_attention.
  4. Block + Model. Pre-norm transformer block. Weight-tie the LM head to the token embedding.
  5. Training. AdamW, cosine schedule with warmup, gradient clipping at 1.0, mixed precision.
  6. Sampling. Implement temperature, top-k, and top-p (nucleus) sampling.
  7. Eval. Track train/val loss and bits-per-character. Generate samples every N steps.

Causal self-attention (core)

class CausalSelfAttention(nn.Module):
    def __init__(self, d, h, T):
        super().__init__()
        self.h, self.dh = h, d // h
        self.qkv = nn.Linear(d, 3 * d, bias=False)
        self.proj = nn.Linear(d, d)
        self.register_buffer("mask", torch.tril(torch.ones(T, T)).bool())

    def forward(self, x):
        B, T, D = x.shape
        q, k, v = self.qkv(x).chunk(3, dim=-1)
        q = q.view(B, T, self.h, self.dh).transpose(1, 2)
        k = k.view(B, T, self.h, self.dh).transpose(1, 2)
        v = v.view(B, T, self.h, self.dh).transpose(1, 2)
        att = (q @ k.transpose(-2, -1)) / (self.dh ** 0.5)
        att = att.masked_fill(~self.mask[:T, :T], float("-inf"))
        att = att.softmax(dim=-1)
        y = (att @ v).transpose(1, 2).contiguous().view(B, T, D)
        return self.proj(y)

Stretch goals

  • Swap learned positional embeddings for RoPE
  • Replace LayerNorm with RMSNorm and ReLU² / SwiGLU MLP
  • Implement KV-cache for fast incremental decoding
  • Fine-tune on instruction data (SFT)