- Published on
Reinforcement Learning Series
- Authors

- Name
- Shuqi Wang
Welcome to my Reinforcement Learning playground.
The plan is simple: work through every major RL algorithm family by training agents on classic OpenAI Gymnasium games. Each entry in this series pairs a core algorithm with a game environment that highlights its strengths (and exposes its weaknesses). Code is written from scratch — no stable-baselines shortcuts.
Tier 1: Tabular RL — The Foundations
Before neural networks enter the picture. Pure Q-tables, pure intuition.
| Algorithm | Core Idea | Game |
|---|---|---|
| Q-Learning | Off-policy TD control. Always updates toward the greedy max-Q action. | FrozenLake-v1 |
| SARSA | On-policy TD control. Updates toward the action actually taken — more conservative. | Taxi-v3 |
Tier 2: Deep Value — When Tables Aren't Enough
Replacing Q-tables with neural networks. The birth of modern deep RL.
| Algorithm | Core Idea | Game |
|---|---|---|
| DQN | Function approximation + experience replay + target network. The DeepMind Atari breakthrough. | CartPole-v1 |
| Double DQN | Decouples action selection from evaluation to fix DQN's overestimation bias. | MountainCar-v0 |
| Dueling DQN | Separates state-value from advantage for more efficient learning. | Breakout (Atari) |
Tier 3: Actor-Critic — The Modern Mainstream
Two networks working together: an Actor (policy) and a Critic (value function). This is the architecture behind RLHF in LLMs.
| Algorithm | Core Idea | Game |
|---|---|---|
| A2C | Synchronous advantage actor-critic with a shared feature extractor. | LunarLander-v2 |
| PPO | Clipped surrogate objective — the most stable and widely deployed policy gradient method. OpenAI's default. | CarRacing-v3 |
| DDPG | Deterministic policy gradient for continuous actions. Essentially DQN for continuous spaces. | Pendulum-v1 |
| SAC | Maximum-entropy RL — optimizes reward and exploration simultaneously. Current SOTA for continuous control. | CarRacing-v3 |
Tier 4: Advanced & Model-Based
Beyond model-free. Learning a model of the world and planning inside it.
| Algorithm | Core Idea | Game |
|---|---|---|
| Dyna-Q | Model-based tabular RL. The agent "imagines" transitions to supplement real experience. | GridWorld |
| World Models | VAE + MDN-RNN + Controller. Train a policy entirely inside a learned dream. | CarRacing-v3 |
| DPO | Direct Preference Optimization — bypassing the critic entirely with human preference data. | LLM alignment |
Published Episodes
RL-01: CarRacing-v3 — From Pixels to Policy
We kick off the series by dissecting the
CarRacing-v3environment in Gymnasium — its history, the landmark papers that made it famous (World Models, DREAMER), the reward function's mathematical structure, and a random agent baseline. Includes the full algorithm roadmap.
This is going to be a long ride. If you spot bugs or have ideas for which games to pair with specific algorithms, reach out — we're building this in the open.