RL Series [01]: CarRacing-v3 — From Pixels to Policy

Reinforcement Learning Series — Episode 01
This series walks through every major family of RL algorithms, each paired with a classic Gymnasium game. We start with the continuous control benchmark CarRacing-v3 — the same environment that powered DeepMind's World Models paper. No black boxes, just code and raw pixels.

Overview

This episode is a setup post. Before writing a single line of policy network code, we need to fully understand the environment we're living in. That means:

Understanding where CarRacing came from — its history and why it matters to the RL research community.
Dissecting the game mechanics: observations, actions, and the reward function in precise detail.
Running our first experiment: a random agent baseline to see just how bad an untrained agent performs.
Laying out the roadmap of algorithms we'll implement in future episodes.

1. A Brief History of CarRacing

The CarRacing environment was introduced as part of OpenAI Gym in the mid-2010s. At the time, most RL benchmarks were either too simple (CartPole, MountainCar) or too expensive to run (Atari, MuJoCo). CarRacing struck a rare balance: it offered a visually rich, pixel-based observation combined with a continuous action space — making it genuinely hard, while remaining computationally accessible.

The environment was written by Oleg Klimov and is rendered using the Box2D physics engine, giving the car simulation realistic tire friction and suspension dynamics.

Key Papers That Made CarRacing Famous

CarRacing became a landmark benchmark precisely because of the papers written around it:

Year	Paper	Contribution
2018	World Models (Ha & Schmidhuber, NeurIPS)	Trained an agent entirely inside a "dream" — a learned generative model of the environment. Achieved 906 ± 21 average reward on CarRacing-v0.
2019	DREAMER (Hafner et al., ICLR 2020)	Extended world models with latent imagination and backpropagation through time. Set a new state of the art on continuous control tasks.
2021	DreamerV2 (Hafner et al., ICLR 2021)	Switched to discrete latents (categorical distributions) via straight-through gradients, improving stability and performance.
2023	TD-MPC2 (Hansen et al.)	Unified model-based planning with a temporal difference objective. Achieved near-human performance across dozens of continuous control environments.

NOTE

The World Models paper is arguably the most influential use of CarRacing. The core idea: instead of learning to act in the real environment, train a VAE to compress pixels into a compact latent vector $z$ , then train an MDN-RNN to predict future latents. The controller is then trained entirely within the generated dream, avoiding expensive environment rendering during policy learning.

Understanding these papers gives us a sense of what we're building towards — even our simple DQN and PPO implementations will serve as a foundation for eventually exploring model-based approaches.

From OpenAI Gym to Farama Gymnasium

OpenAI officially transferred maintenance of the Gym library to the Farama Foundation in 2022. The community-maintained fork, Gymnasium, introduced several improvements to CarRacing including the CarRacing-v3 version which fixed subtle physics inconsistencies and added proper support for discrete action spaces. In this series we use gymnasium throughout.

# Install the correct dependencies
pip install gymnasium[box2d]   # Box2D physics engine required
pip install gymnasium[other]   # Additional optional dependencies

2. Environment Deep Dive

2.1 Observation Space

The agent perceives the world as a top-down, 96×96 RGB image — a tensor of shape (96, 96, 3) with uint8 values in [0, 255].

This is not just a clean bird's-eye view. The bottom band of the image contains a mock dashboard with continuous readouts for:

True speed (white bar)
Four ABS sensor readings
Steering wheel position
Gyroscope reading

This matters architecturally — many implementations crop the bottom 12 pixels to remove these indicators, while others let the convolutional network learn to use them. We'll explore both.

2.2 Action Space

The default action space is continuous, a 3-dimensional Box: $a = [a_\text{steer},\ a_\text{gas},\ a_\text{brake}]$ .

Dimension	Range	Meaning
`a[0]` — Steering	`[-1.0, 1.0]`	Negative = full left, positive = full right
`a[1]` — Gas	`[0.0, 1.0]`	Throttle (forward acceleration)
`a[2]` — Brake	`[0.0, 1.0]`	Braking force

This continuous space is the natural interface for algorithms like PPO and SAC. However, deep Q-learning (DQN) requires a discrete action space. Gymnasium provides an optional discrete mode via continuous=False, which collapses to 5 actions: do nothing, steer left, steer right, gas, or brake.

We'll need this discretization in our DQN episode — and we'll implement the wrapper ourselves to understand exactly what gets lost in the translation.

2.3 Reward Function

The reward function encodes the designer's intent in a mathematically precise way. Understanding it is critical for understanding why an agent behaves the way it does.

Let $N$ be the total number of track tiles in a generated episode. The reward at each timestep $t$ has three components:

+1000/N per newly visited tile (each tile is rewarded only once).
−0.1 every frame as a time penalty, regardless of what the car does.
−100 if the car strays too far off the track, triggering episode termination.

Summing this up: if an agent visits all $N$ tiles, its total tile reward is exactly $+1000$ . The time penalty of $-0.1$ /frame means a perfect episode of 1000 frames yields $1000 - 100 = 900$ net reward. The solve threshold is typically defined as 900+ average reward over 100 consecutive episodes.

This deceptively simple reward function creates some interesting optimization traps:

A greedy agent might spin in circles repeatedly visiting the same tile (the tile reward only applies to new tiles).
An agent optimizing too hard for speed might accumulate large $-100$ off-track penalties.
An agent that drives very slowly may visit all tiles but bleed out the time penalty.

These are exactly the kinds of reward shaping pathologies that make this environment a useful benchmark.

3. Environment Setup & Random Policy Baseline

Now let's ground everything above in code. The first notebook (01_Environment_Setup.ipynb) initializes the environment, inspects its spaces, and runs a completely random agent.

Section 1: Inspecting the Environment

import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import display, clear_output

# Initialize environment in 'rgb_array' mode for inline visualization
env = gym.make('CarRacing-v3', render_mode='rgb_array')
state, info = env.reset()

print(f"Observation Space: {env.observation_space}")
# >> Box(0, 255, (96, 96, 3), uint8)

print(f"Action Space: {env.action_space}")
# >> Box([-1.  0.  0.], 1.0, (3,), float32)

plt.figure(figsize=(6, 6))
plt.imshow(state)
plt.title('Initial Environment State (96x96 RGB Px)')
plt.axis('off')
plt.show()

env.close()

The output confirms our theory: a 96×96 uint8 image, and a 3-dimensional continuous action vector with asymmetric bounds (steering is symmetric around zero; gas and brake are non-negative).

Section 2: Random Policy Baseline

env_test = gym.make('CarRacing-v3', render_mode='rgb_array')
state, info = env_test.reset()

steps_to_run = 500
total_reward = 0
fig, ax = plt.subplots(figsize=(5, 5))

for step in range(steps_to_run):
    # Sample a uniformly random action from the entire action space
    action = env_test.action_space.sample()
    state, reward, terminated, truncated, info = env_test.step(action)
    total_reward += reward

    # Inline rendering for notebook visualization
    frame = env_test.render()
    ax.clear()
    ax.imshow(frame)
    ax.set_title(f"Step: {step+1} | Cumulative Reward: {total_reward:.1f}")
    ax.axis('off')
    display(fig)
    clear_output(wait=True)

    if terminated or truncated:
        print(f"Episode ended at step {step+1}. Resetting...")
        state, info = env_test.reset()
        total_reward = 0

env_test.close()
plt.close()
print("Done.")

What you'll observe: the car immediately starts spinning, veering off-track, and triggering the $-100$ terminal penalty within the first 10-20 frames. This is our baseline — any reasonably trained agent should dramatically outperform this random policy.

NOTE

The random policy serves as more than just a sanity check. It's also how we collect an initial dataset for offline methods and world model training — random rollouts provide unbiased coverage of the state space, unlike on-policy data which becomes increasingly biased as the agent improves.

4. What's Coming: The Algorithm Roadmap

Over the next several episodes, we'll train agents on CarRacing-v3 using progressively more sophisticated algorithms. Here's the plan:

Episode RL-02: Deep Q-Network (DQN)

Before tackling continuous control, we'll discretize the action space and implement a full DQN agent from scratch:

Dueling DQN architecture: separate value stream $V(s)$ and advantage stream $A(s, a)$ , combined as $Q(s, a) = V(s) + A(s, a) - \overline{A}(s)$
Prioritized Experience Replay: sample transitions proportional to their TD error, focusing training on informative experiences
Double DQN: decouple action selection and action evaluation to reduce overestimation bias

Episode RL-03: Proximal Policy Optimization (PPO)

PPO is the workhorse of modern RL — the same algorithm in the RLHF loop that fine-tunes LLMs. We move back to the continuous action space and implement:

Actor-Critic architecture with a shared CNN feature extractor
Surrogate clipping objective: $L^\text{CLIP}(\theta) = \mathbb{E}_t[\min(r_t(\theta)\hat{A}_t,\ \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)\hat{A}_t)]$
GAE (Generalized Advantage Estimation): exponentially-weighted advantage estimation that interpolates between Monte Carlo and TD(0)

Episode RL-04+: World Models & Dreamer (The Long Game)

Once we have solid model-free baselines, we'll revisit the idea that made CarRacing famous in RL literature — learning to act inside a learned world model.