Architectures·8 min read

Convolutional Neural Networks

The architecture that defined a decade of computer vision.

Why Convolutions?

Images have two properties an MLP ignores:

Local structure — nearby pixels are correlated.
Translation invariance — a cat is a cat wherever it appears.

Convolutions exploit both by sliding a small kernel across the input, sharing weights across spatial positions.

The Convolution Operation

y[i,j] = Σ_m Σ_n x[i+m, j+n] · k[m,n]

A 3×3 kernel has only 9 parameters but is applied at every pixel — orders of magnitude fewer parameters than a dense layer.

Building Blocks

Conv layer — learns filters (edges, textures, parts, objects).
Pooling — downsamples (max-pool or strided conv) to grow the receptive field.
BatchNorm — stabilizes training by normalizing activations per channel.
Residual connections — y = F(x) + x, enables training of 100+ layers (ResNet).

A Modern CNN Block

class ResBlock(nn.Module):
    def __init__(self, c):
        super().__init__()
        self.conv1 = nn.Conv2d(c, c, 3, padding=1)
        self.conv2 = nn.Conv2d(c, c, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(c)
        self.bn2 = nn.BatchNorm2d(c)
    def forward(self, x):
        h = F.relu(self.bn1(self.conv1(x)))
        h = self.bn2(self.conv2(h))
        return F.relu(h + x)

CNNs are no longer state-of-the-art on every vision task — Vision Transformers caught up — but they remain efficient, well-understood, and excellent baselines.