Project·Multi-day project
Deep Q-Network for Atari
Train an agent to play Atari games from pixels using a convolutional Q-network, replay buffer, target network, and ε-greedy exploration.
What you'll build
A from-scratch reimplementation of DeepMind's 2015 Nature DQN. You'll train an agent on Pong or Breakout that learns superhuman play purely from screen pixels and the score.
Prerequisites
RL has its own vocabulary on top of deep learning — get these solid first:
- MDPs — states, actions, rewards, discount factor γ, episodes
- Value vs Q-functions — what V(s) and Q(s, a) mean intuitively
- Bellman equation — Q(s,a) = r + γ·max Q(s', a')
- Exploration vs exploitation — ε-greedy policies
- Off-policy learning and why experience replay is needed
- Why a target network — moving targets, training instability
- CNNs for processing 84×84×4 stacked frames
- Huber / smooth L1 loss and why it's preferred over MSE here
- Gym / Gymnasium API —
reset,step, observation/action spaces
Warm-up exercises
- Solve CartPole-v1 with tabular-free DQN (small MLP, no conv) — get average return > 475.
- Implement a replay buffer as a ring buffer and benchmark random sampling speed for 1M transitions.
- Wrap an Atari env with frame skip, grayscale, resize, and frame stack — print the resulting observation shape and dtype.
- Hand-derive the Bellman target for one transition and verify your code computes the same number.
- Plot how ε decays over 1M steps with your annealing schedule before training.
Difficulty
Advanced. RL is famously finicky — debugging requires patience.
Core components
- Atari wrappers: frame skip (4), grayscale, resize to 84×84, frame stack (4), reward clipping, no-op reset, episodic life.
- Q-network: 3 conv layers + 2 FC, outputs Q-values for each action.
- Replay buffer: 1M transitions, uniform sampling (stretch: prioritized).
- Target network: hard-update every 10k steps.
- ε-greedy: linear anneal from 1.0 → 0.1 over 1M steps.
The Bellman target
# Sample batch from replay buffer: (s, a, r, s', done)
with torch.no_grad():
target = r + gamma * (1 - done) * q_target(s_next).max(1).values
q_pred = q_online(s).gather(1, a.unsqueeze(1)).squeeze(1)
loss = F.smooth_l1_loss(q_pred, target)Milestones
- Get a random agent running with all preprocessing wrappers.
- Implement replay buffer + Q-network; verify forward shapes.
- Train on Pong (easiest Atari). Expect ~1M frames to start winning.
- Add target network and Huber loss; observe stability improvement.
- Log episode return, ε, mean Q, loss. Plot smoothed curves.
Stretch goals
- Double DQN — decouple action selection from evaluation
- Dueling DQN — split into value and advantage streams
- Prioritized replay — sample by TD error
- Rainbow — combine all DQN improvements
- Port to PPO and train on continuous-control tasks (MuJoCo)
Watch out for
Reward signal is sparse and noisy. If your agent isn't learning after 500k frames, it's almost always a preprocessing bug (wrong frame stack, missing reward clip) — not a hyperparameter issue.