Project·1 weekend
Build a Mini-GPT From Scratch
Implement a decoder-only transformer language model from first principles — tokenizer, attention, training loop, and sampling — in under 400 lines of PyTorch.
What you'll build
A character-level (or BPE) GPT trained on a small corpus (Shakespeare, TinyStories, or your own text). You implement multi-head causal self-attention, positional encodings, layer norm, residual streams, and a sampler with temperature and top-k.
Prerequisites
Don't skip these — transformers reward solid fundamentals:
- Embeddings — token and positional, why we add them
- Scaled dot-product attention — Q/K/V intuition, why we divide by √d_k
- Multi-head attention — splitting d_model across heads, concatenating, projecting
- Causal masking — why a decoder-only LM uses a lower-triangular mask
- LayerNorm (and the pre-norm vs post-norm distinction)
- Residual streams and why gradient flow needs them
- Cross-entropy over a vocabulary; the shift-by-one targets trick
- Optimization — AdamW, gradient clipping, warmup + cosine schedule
- Sampling math — temperature, top-k, top-p (nucleus)
Warm-up exercises
- Implement scaled dot-product attention in ~10 lines and verify it matches
F.scaled_dot_product_attentionon random tensors. - Hand-derive what a 4×4 causal mask looks like and apply it to a toy attention matrix.
- Train a 2-layer MLP bigram language model on Shakespeare and report val loss — this is your baseline to beat.
- Write a sampler that, given logits, supports temperature, top-k, and top-p; test on a fixed distribution.
- Implement a tiny char-level tokenizer with
encode/decoderound-tripping.
Difficulty
Intermediate → Advanced. You'll truly understand transformers by the end.
Architecture
Token Embedding + Positional Embedding
→ N × [LayerNorm → Causal MHA → Residual
LayerNorm → MLP (4× hidden) → Residual]
→ LayerNorm → LM Head (tied with token embeddings)Milestones
- Tokenizer. Start with char-level. Stretch: implement a tiny BPE.
- Data loader. Sample random contiguous blocks of size
block_sizewith shifted targets. - Attention. Implement causal mask, scaled dot-product, multi-head split/concat. Verify against
F.scaled_dot_product_attention. - Block + Model. Pre-norm transformer block. Weight-tie the LM head to the token embedding.
- Training. AdamW, cosine schedule with warmup, gradient clipping at 1.0, mixed precision.
- Sampling. Implement temperature, top-k, and top-p (nucleus) sampling.
- Eval. Track train/val loss and bits-per-character. Generate samples every N steps.
Causal self-attention (core)
class CausalSelfAttention(nn.Module):
def __init__(self, d, h, T):
super().__init__()
self.h, self.dh = h, d // h
self.qkv = nn.Linear(d, 3 * d, bias=False)
self.proj = nn.Linear(d, d)
self.register_buffer("mask", torch.tril(torch.ones(T, T)).bool())
def forward(self, x):
B, T, D = x.shape
q, k, v = self.qkv(x).chunk(3, dim=-1)
q = q.view(B, T, self.h, self.dh).transpose(1, 2)
k = k.view(B, T, self.h, self.dh).transpose(1, 2)
v = v.view(B, T, self.h, self.dh).transpose(1, 2)
att = (q @ k.transpose(-2, -1)) / (self.dh ** 0.5)
att = att.masked_fill(~self.mask[:T, :T], float("-inf"))
att = att.softmax(dim=-1)
y = (att @ v).transpose(1, 2).contiguous().view(B, T, D)
return self.proj(y)Stretch goals
- Swap learned positional embeddings for RoPE
- Replace LayerNorm with RMSNorm and ReLU² / SwiGLU MLP
- Implement KV-cache for fast incremental decoding
- Fine-tune on instruction data (SFT)