Architectures·8 min read

RNNs & LSTMs

The pre-transformer way to model sequences — still useful, still elegant.

Recurrent Neural Networks

An RNN processes a sequence by maintaining a hidden state that gets updated at each step:

hₜ = φ(W_h · hₜ₋₁ + W_x · xₜ + b)
yₜ = W_y · hₜ

Same weights at every timestep — the network has "memory" of past inputs encoded in h.

The Vanishing Gradient Problem

When unrolling through many timesteps, gradients are multiplied repeatedly by the same Jacobian. If its largest eigenvalue is <1, gradients vanish; if >1, they explode. Either way, long-range dependencies become impossible to learn.

LSTM: Gating to the Rescue

Long Short-Term Memory cells introduce a separate cell state and three gates that control information flow:

fₜ = σ(W_f · [hₜ₋₁, xₜ])    # forget gate
iₜ = σ(W_i · [hₜ₋₁, xₜ])    # input gate
oₜ = σ(W_o · [hₜ₋₁, xₜ])    # output gate
c̃ₜ = tanh(W_c · [hₜ₋₁, xₜ])
cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ
hₜ = oₜ ⊙ tanh(cₜ)

The additive update on c creates a gradient highway — information (and gradients) flow across hundreds of timesteps.

When to Use Them Today

On-device inference where transformer memory cost is prohibitive.
Strictly sequential / streaming tasks (audio, time series).
Hybrid models — modern state-space models (Mamba, S4) are spiritual successors.