Architectures·8 min read
Convolutional Neural Networks
The architecture that defined a decade of computer vision.
Why Convolutions?
Images have two properties an MLP ignores:
- Local structure — nearby pixels are correlated.
- Translation invariance — a cat is a cat wherever it appears.
Convolutions exploit both by sliding a small kernel across the input, sharing weights across spatial positions.
The Convolution Operation
y[i,j] = Σ_m Σ_n x[i+m, j+n] · k[m,n]A 3×3 kernel has only 9 parameters but is applied at every pixel — orders of magnitude fewer parameters than a dense layer.
Building Blocks
- Conv layer — learns filters (edges, textures, parts, objects).
- Pooling — downsamples (max-pool or strided conv) to grow the receptive field.
- BatchNorm — stabilizes training by normalizing activations per channel.
- Residual connections —
y = F(x) + x, enables training of 100+ layers (ResNet).
A Modern CNN Block
class ResBlock(nn.Module):
def __init__(self, c):
super().__init__()
self.conv1 = nn.Conv2d(c, c, 3, padding=1)
self.conv2 = nn.Conv2d(c, c, 3, padding=1)
self.bn1 = nn.BatchNorm2d(c)
self.bn2 = nn.BatchNorm2d(c)
def forward(self, x):
h = F.relu(self.bn1(self.conv1(x)))
h = self.bn2(self.conv2(h))
return F.relu(h + x)CNNs are no longer state-of-the-art on every vision task — Vision Transformers caught up — but they remain efficient, well-understood, and excellent baselines.