Project·1 weekend

Build a Mini-GPT From Scratch

Implement a decoder-only transformer language model from first principles — tokenizer, attention, training loop, and sampling — in under 400 lines of PyTorch.

What you'll build

A character-level (or BPE) GPT trained on a small corpus (Shakespeare, TinyStories, or your own text). You implement multi-head causal self-attention, positional encodings, layer norm, residual streams, and a sampler with temperature and top-k.

Prerequisites

Don't skip these — transformers reward solid fundamentals:

Embeddings — token and positional, why we add them
Scaled dot-product attention — Q/K/V intuition, why we divide by √d_k
Multi-head attention — splitting d_model across heads, concatenating, projecting
Causal masking — why a decoder-only LM uses a lower-triangular mask
LayerNorm (and the pre-norm vs post-norm distinction)
Residual streams and why gradient flow needs them
Cross-entropy over a vocabulary; the shift-by-one targets trick
Optimization — AdamW, gradient clipping, warmup + cosine schedule
Sampling math — temperature, top-k, top-p (nucleus)

Warm-up exercises

Implement scaled dot-product attention in ~10 lines and verify it matches F.scaled_dot_product_attention on random tensors.
Hand-derive what a 4×4 causal mask looks like and apply it to a toy attention matrix.
Train a 2-layer MLP bigram language model on Shakespeare and report val loss — this is your baseline to beat.
Write a sampler that, given logits, supports temperature, top-k, and top-p; test on a fixed distribution.
Implement a tiny char-level tokenizer with encode/decode round-tripping.

Difficulty

Intermediate → Advanced. You'll truly understand transformers by the end.

Architecture

Token Embedding + Positional Embedding
  → N × [LayerNorm → Causal MHA → Residual
          LayerNorm → MLP (4× hidden) → Residual]
  → LayerNorm → LM Head (tied with token embeddings)

Milestones

Tokenizer. Start with char-level. Stretch: implement a tiny BPE.
Data loader. Sample random contiguous blocks of size block_size with shifted targets.
Attention. Implement causal mask, scaled dot-product, multi-head split/concat. Verify against F.scaled_dot_product_attention.
Block + Model. Pre-norm transformer block. Weight-tie the LM head to the token embedding.
Training. AdamW, cosine schedule with warmup, gradient clipping at 1.0, mixed precision.
Sampling. Implement temperature, top-k, and top-p (nucleus) sampling.
Eval. Track train/val loss and bits-per-character. Generate samples every N steps.

Causal self-attention (core)

class CausalSelfAttention(nn.Module):
    def __init__(self, d, h, T):
        super().__init__()
        self.h, self.dh = h, d // h
        self.qkv = nn.Linear(d, 3 * d, bias=False)
        self.proj = nn.Linear(d, d)
        self.register_buffer("mask", torch.tril(torch.ones(T, T)).bool())

    def forward(self, x):
        B, T, D = x.shape
        q, k, v = self.qkv(x).chunk(3, dim=-1)
        q = q.view(B, T, self.h, self.dh).transpose(1, 2)
        k = k.view(B, T, self.h, self.dh).transpose(1, 2)
        v = v.view(B, T, self.h, self.dh).transpose(1, 2)
        att = (q @ k.transpose(-2, -1)) / (self.dh ** 0.5)
        att = att.masked_fill(~self.mask[:T, :T], float("-inf"))
        att = att.softmax(dim=-1)
        y = (att @ v).transpose(1, 2).contiguous().view(B, T, D)
        return self.proj(y)

Stretch goals

Swap learned positional embeddings for RoPE
Replace LayerNorm with RMSNorm and ReLU² / SwiGLU MLP
Implement KV-cache for fast incremental decoding
Fine-tune on instruction data (SFT)