Saguaro
Published on

Reinforcement Learning Series

Authors
  • avatar
    Name
    Shuqi Wang
    Twitter

Welcome to my Reinforcement Learning playground.

The plan is simple: work through every major RL algorithm family by training agents on classic OpenAI Gymnasium games. Each entry in this series pairs a core algorithm with a game environment that highlights its strengths (and exposes its weaknesses). Code is written from scratch — no stable-baselines shortcuts.


Tier 1: Tabular RL — The Foundations

Before neural networks enter the picture. Pure Q-tables, pure intuition.

AlgorithmCore IdeaGame
Q-LearningOff-policy TD control. Always updates toward the greedy max-Q action.FrozenLake-v1
SARSAOn-policy TD control. Updates toward the action actually taken — more conservative.Taxi-v3

Tier 2: Deep Value — When Tables Aren't Enough

Replacing Q-tables with neural networks. The birth of modern deep RL.

AlgorithmCore IdeaGame
DQNFunction approximation + experience replay + target network. The DeepMind Atari breakthrough.CartPole-v1
Double DQNDecouples action selection from evaluation to fix DQN's overestimation bias.MountainCar-v0
Dueling DQNSeparates state-value V(s)V(s) from advantage A(s,a)A(s,a) for more efficient learning.Breakout (Atari)

Tier 3: Actor-Critic — The Modern Mainstream

Two networks working together: an Actor (policy) and a Critic (value function). This is the architecture behind RLHF in LLMs.

AlgorithmCore IdeaGame
A2CSynchronous advantage actor-critic with a shared feature extractor.LunarLander-v2
PPOClipped surrogate objective — the most stable and widely deployed policy gradient method. OpenAI's default.CarRacing-v3
DDPGDeterministic policy gradient for continuous actions. Essentially DQN for continuous spaces.Pendulum-v1
SACMaximum-entropy RL — optimizes reward and exploration simultaneously. Current SOTA for continuous control.CarRacing-v3

Tier 4: Advanced & Model-Based

Beyond model-free. Learning a model of the world and planning inside it.

AlgorithmCore IdeaGame
Dyna-QModel-based tabular RL. The agent "imagines" transitions to supplement real experience.GridWorld
World ModelsVAE + MDN-RNN + Controller. Train a policy entirely inside a learned dream.CarRacing-v3
DPODirect Preference Optimization — bypassing the critic entirely with human preference data.LLM alignment

Published Episodes

RL-01: CarRacing-v3 — From Pixels to Policy

We kick off the series by dissecting the CarRacing-v3 environment in Gymnasium — its history, the landmark papers that made it famous (World Models, DREAMER), the reward function's mathematical structure, and a random agent baseline. Includes the full algorithm roadmap.


This is going to be a long ride. If you spot bugs or have ideas for which games to pair with specific algorithms, reach out — we're building this in the open.

Thanks for reading. Stay curious!