Reinforcement Learning Series

Welcome to my Reinforcement Learning playground.

The plan is simple: work through every major RL algorithm family by training agents on classic OpenAI Gymnasium games. Each entry in this series pairs a core algorithm with a game environment that highlights its strengths (and exposes its weaknesses). Code is written from scratch — no stable-baselines shortcuts.

Tier 1: Tabular RL — The Foundations

Before neural networks enter the picture. Pure Q-tables, pure intuition.

Algorithm	Core Idea	Game
Q-Learning	Off-policy TD control. Always updates toward the greedy max-Q action.	FrozenLake-v1
SARSA	On-policy TD control. Updates toward the action actually taken — more conservative.	Taxi-v3

Tier 2: Deep Value — When Tables Aren't Enough

Replacing Q-tables with neural networks. The birth of modern deep RL.

Algorithm	Core Idea	Game
DQN	Function approximation + experience replay + target network. The DeepMind Atari breakthrough.	CartPole-v1
Double DQN	Decouples action selection from evaluation to fix DQN's overestimation bias.	MountainCar-v0
Dueling DQN	Separates state-value $V(s)$ from advantage $A(s,a)$ for more efficient learning.	Breakout (Atari)

Tier 3: Actor-Critic — The Modern Mainstream

Two networks working together: an Actor (policy) and a Critic (value function). This is the architecture behind RLHF in LLMs.

Algorithm	Core Idea	Game
A2C	Synchronous advantage actor-critic with a shared feature extractor.	LunarLander-v2
PPO	Clipped surrogate objective — the most stable and widely deployed policy gradient method. OpenAI's default.	CarRacing-v3
DDPG	Deterministic policy gradient for continuous actions. Essentially DQN for continuous spaces.	Pendulum-v1
SAC	Maximum-entropy RL — optimizes reward and exploration simultaneously. Current SOTA for continuous control.	CarRacing-v3

Tier 4: Advanced & Model-Based

Beyond model-free. Learning a model of the world and planning inside it.

Algorithm	Core Idea	Game
Dyna-Q	Model-based tabular RL. The agent "imagines" transitions to supplement real experience.	GridWorld
World Models	VAE + MDN-RNN + Controller. Train a policy entirely inside a learned dream.	CarRacing-v3
DPO	Direct Preference Optimization — bypassing the critic entirely with human preference data.	LLM alignment

Published Episodes

Car Racing

RL Series [01]: CarRacing-v3 — From Pixels to Policy

Mar 3, 2026

This is going to be a long ride. If you spot bugs or have ideas for which games to pair with specific algorithms, reach out — we're building this in the open.