Architectures·8 min read
RNNs & LSTMs
The pre-transformer way to model sequences — still useful, still elegant.
Recurrent Neural Networks
An RNN processes a sequence by maintaining a hidden state that gets updated at each step:
hₜ = φ(W_h · hₜ₋₁ + W_x · xₜ + b)
yₜ = W_y · hₜSame weights at every timestep — the network has "memory" of past inputs encoded in h.
The Vanishing Gradient Problem
When unrolling through many timesteps, gradients are multiplied repeatedly by the same Jacobian. If its largest eigenvalue is <1, gradients vanish; if >1, they explode. Either way, long-range dependencies become impossible to learn.
LSTM: Gating to the Rescue
Long Short-Term Memory cells introduce a separate cell state and three gates that control information flow:
fₜ = σ(W_f · [hₜ₋₁, xₜ]) # forget gate
iₜ = σ(W_i · [hₜ₋₁, xₜ]) # input gate
oₜ = σ(W_o · [hₜ₋₁, xₜ]) # output gate
c̃ₜ = tanh(W_c · [hₜ₋₁, xₜ])
cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ
hₜ = oₜ ⊙ tanh(cₜ)The additive update on c creates a gradient highway — information (and gradients) flow across hundreds of timesteps.
When to Use Them Today
- On-device inference where transformer memory cost is prohibitive.
- Strictly sequential / streaming tasks (audio, time series).
- Hybrid models — modern state-space models (Mamba, S4) are spiritual successors.