Project·Multi-day project

Deep Q-Network for Atari

Train an agent to play Atari games from pixels using a convolutional Q-network, replay buffer, target network, and ε-greedy exploration.

What you'll build

A from-scratch reimplementation of DeepMind's 2015 Nature DQN. You'll train an agent on Pong or Breakout that learns superhuman play purely from screen pixels and the score.

Prerequisites

RL has its own vocabulary on top of deep learning — get these solid first:

MDPs — states, actions, rewards, discount factor γ, episodes
Value vs Q-functions — what V(s) and Q(s, a) mean intuitively
Bellman equation — Q(s,a) = r + γ·max Q(s', a')
Exploration vs exploitation — ε-greedy policies
Off-policy learning and why experience replay is needed
Why a target network — moving targets, training instability
CNNs for processing 84×84×4 stacked frames
Huber / smooth L1 loss and why it's preferred over MSE here
Gym / Gymnasium API — reset, step, observation/action spaces

Warm-up exercises

Solve CartPole-v1 with tabular-free DQN (small MLP, no conv) — get average return > 475.
Implement a replay buffer as a ring buffer and benchmark random sampling speed for 1M transitions.
Wrap an Atari env with frame skip, grayscale, resize, and frame stack — print the resulting observation shape and dtype.
Hand-derive the Bellman target for one transition and verify your code computes the same number.
Plot how ε decays over 1M steps with your annealing schedule before training.

Difficulty

Advanced. RL is famously finicky — debugging requires patience.

Core components

Atari wrappers: frame skip (4), grayscale, resize to 84×84, frame stack (4), reward clipping, no-op reset, episodic life.
Q-network: 3 conv layers + 2 FC, outputs Q-values for each action.
Replay buffer: 1M transitions, uniform sampling (stretch: prioritized).
Target network: hard-update every 10k steps.
ε-greedy: linear anneal from 1.0 → 0.1 over 1M steps.

The Bellman target

# Sample batch from replay buffer: (s, a, r, s', done)
with torch.no_grad():
    target = r + gamma * (1 - done) * q_target(s_next).max(1).values

q_pred = q_online(s).gather(1, a.unsqueeze(1)).squeeze(1)
loss = F.smooth_l1_loss(q_pred, target)

Milestones

Get a random agent running with all preprocessing wrappers.
Implement replay buffer + Q-network; verify forward shapes.
Train on Pong (easiest Atari). Expect ~1M frames to start winning.
Add target network and Huber loss; observe stability improvement.
Log episode return, ε, mean Q, loss. Plot smoothed curves.

Stretch goals

Double DQN — decouple action selection from evaluation
Dueling DQN — split into value and advantage streams
Prioritized replay — sample by TD error
Rainbow — combine all DQN improvements
Port to PPO and train on continuous-control tasks (MuJoCo)

Watch out for

Reward signal is sparse and noisy. If your agent isn't learning after 500k frames, it's almost always a preprocessing bug (wrong frame stack, missing reward clip) — not a hyperparameter issue.