- Published on
RL Series [01]: CarRacing-v3 — From Pixels to Policy
- Authors

- Name
- Shuqi Wang
Reinforcement Learning Series — Episode 01
This series walks through every major family of RL algorithms, each paired with a classic Gymnasium game. We start with the continuous control benchmark
CarRacing-v3— the same environment that powered DeepMind's World Models paper. No black boxes, just code and raw pixels.
Overview
This episode is a setup post. Before writing a single line of policy network code, we need to fully understand the environment we're living in. That means:
- Understanding where CarRacing came from — its history and why it matters to the RL research community.
- Dissecting the game mechanics: observations, actions, and the reward function in precise detail.
- Running our first experiment: a random agent baseline to see just how bad an untrained agent performs.
- Laying out the roadmap of algorithms we'll implement in future episodes.
1. A Brief History of CarRacing
The CarRacing environment was introduced as part of OpenAI Gym in the mid-2010s. At the time, most RL benchmarks were either too simple (CartPole, MountainCar) or too expensive to run (Atari, MuJoCo). CarRacing struck a rare balance: it offered a visually rich, pixel-based observation combined with a continuous action space — making it genuinely hard, while remaining computationally accessible.
The environment was written by Oleg Klimov and is rendered using the Box2D physics engine, giving the car simulation realistic tire friction and suspension dynamics.
Key Papers That Made CarRacing Famous
CarRacing became a landmark benchmark precisely because of the papers written around it:
| Year | Paper | Contribution |
|---|---|---|
| 2018 | World Models (Ha & Schmidhuber, NeurIPS) | Trained an agent entirely inside a "dream" — a learned generative model of the environment. Achieved 906 ± 21 average reward on CarRacing-v0. |
| 2019 | DREAMER (Hafner et al., ICLR 2020) | Extended world models with latent imagination and backpropagation through time. Set a new state of the art on continuous control tasks. |
| 2021 | DreamerV2 (Hafner et al., ICLR 2021) | Switched to discrete latents (categorical distributions) via straight-through gradients, improving stability and performance. |
| 2023 | TD-MPC2 (Hansen et al.) | Unified model-based planning with a temporal difference objective. Achieved near-human performance across dozens of continuous control environments. |
NOTE
The World Models paper is arguably the most influential use of CarRacing. The core idea: instead of learning to act in the real environment, train a VAE to compress pixels into a compact latent vector , then train an MDN-RNN to predict future latents. The controller is then trained entirely within the generated dream, avoiding expensive environment rendering during policy learning.
Understanding these papers gives us a sense of what we're building towards — even our simple DQN and PPO implementations will serve as a foundation for eventually exploring model-based approaches.
From OpenAI Gym to Farama Gymnasium
OpenAI officially transferred maintenance of the Gym library to the Farama Foundation in 2022. The community-maintained fork, Gymnasium, introduced several improvements to CarRacing including the CarRacing-v3 version which fixed subtle physics inconsistencies and added proper support for discrete action spaces. In this series we use gymnasium throughout.
# Install the correct dependencies
pip install gymnasium[box2d] # Box2D physics engine required
pip install gymnasium[other] # Additional optional dependencies
2. Environment Deep Dive
2.1 Observation Space
The agent perceives the world as a top-down, 96×96 RGB image — a tensor of shape (96, 96, 3) with uint8 values in [0, 255].
This is not just a clean bird's-eye view. The bottom band of the image contains a mock dashboard with continuous readouts for:
- True speed (white bar)
- Four ABS sensor readings
- Steering wheel position
- Gyroscope reading
This matters architecturally — many implementations crop the bottom 12 pixels to remove these indicators, while others let the convolutional network learn to use them. We'll explore both.
2.2 Action Space
The default action space is continuous, a 3-dimensional Box: .
| Dimension | Range | Meaning |
|---|---|---|
a[0] — Steering | [-1.0, 1.0] | Negative = full left, positive = full right |
a[1] — Gas | [0.0, 1.0] | Throttle (forward acceleration) |
a[2] — Brake | [0.0, 1.0] | Braking force |
This continuous space is the natural interface for algorithms like PPO and SAC. However, deep Q-learning (DQN) requires a discrete action space. Gymnasium provides an optional discrete mode via continuous=False, which collapses to 5 actions: do nothing, steer left, steer right, gas, or brake.
We'll need this discretization in our DQN episode — and we'll implement the wrapper ourselves to understand exactly what gets lost in the translation.
2.3 Reward Function
The reward function encodes the designer's intent in a mathematically precise way. Understanding it is critical for understanding why an agent behaves the way it does.
Let be the total number of track tiles in a generated episode. The reward at each timestep has three components:
- +1000/N per newly visited tile (each tile is rewarded only once).
- −0.1 every frame as a time penalty, regardless of what the car does.
- −100 if the car strays too far off the track, triggering episode termination.
Summing this up: if an agent visits all tiles, its total tile reward is exactly . The time penalty of /frame means a perfect episode of 1000 frames yields net reward. The solve threshold is typically defined as 900+ average reward over 100 consecutive episodes.
This deceptively simple reward function creates some interesting optimization traps:
- A greedy agent might spin in circles repeatedly visiting the same tile (the tile reward only applies to new tiles).
- An agent optimizing too hard for speed might accumulate large off-track penalties.
- An agent that drives very slowly may visit all tiles but bleed out the time penalty.
These are exactly the kinds of reward shaping pathologies that make this environment a useful benchmark.
3. Environment Setup & Random Policy Baseline
Now let's ground everything above in code. The first notebook (01_Environment_Setup.ipynb) initializes the environment, inspects its spaces, and runs a completely random agent.
Section 1: Inspecting the Environment
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import display, clear_output
# Initialize environment in 'rgb_array' mode for inline visualization
env = gym.make('CarRacing-v3', render_mode='rgb_array')
state, info = env.reset()
print(f"Observation Space: {env.observation_space}")
# >> Box(0, 255, (96, 96, 3), uint8)
print(f"Action Space: {env.action_space}")
# >> Box([-1. 0. 0.], 1.0, (3,), float32)
plt.figure(figsize=(6, 6))
plt.imshow(state)
plt.title('Initial Environment State (96x96 RGB Px)')
plt.axis('off')
plt.show()
env.close()
The output confirms our theory: a 96×96 uint8 image, and a 3-dimensional continuous action vector with asymmetric bounds (steering is symmetric around zero; gas and brake are non-negative).
Section 2: Random Policy Baseline
env_test = gym.make('CarRacing-v3', render_mode='rgb_array')
state, info = env_test.reset()
steps_to_run = 500
total_reward = 0
fig, ax = plt.subplots(figsize=(5, 5))
for step in range(steps_to_run):
# Sample a uniformly random action from the entire action space
action = env_test.action_space.sample()
state, reward, terminated, truncated, info = env_test.step(action)
total_reward += reward
# Inline rendering for notebook visualization
frame = env_test.render()
ax.clear()
ax.imshow(frame)
ax.set_title(f"Step: {step+1} | Cumulative Reward: {total_reward:.1f}")
ax.axis('off')
display(fig)
clear_output(wait=True)
if terminated or truncated:
print(f"Episode ended at step {step+1}. Resetting...")
state, info = env_test.reset()
total_reward = 0
env_test.close()
plt.close()
print("Done.")
What you'll observe: the car immediately starts spinning, veering off-track, and triggering the terminal penalty within the first 10-20 frames. This is our baseline — any reasonably trained agent should dramatically outperform this random policy.
NOTE
The random policy serves as more than just a sanity check. It's also how we collect an initial dataset for offline methods and world model training — random rollouts provide unbiased coverage of the state space, unlike on-policy data which becomes increasingly biased as the agent improves.
4. What's Coming: The Algorithm Roadmap
Over the next several episodes, we'll train agents on CarRacing-v3 using progressively more sophisticated algorithms. Here's the plan:
Episode RL-02: Deep Q-Network (DQN)
Before tackling continuous control, we'll discretize the action space and implement a full DQN agent from scratch:
- Dueling DQN architecture: separate value stream and advantage stream , combined as
- Prioritized Experience Replay: sample transitions proportional to their TD error, focusing training on informative experiences
- Double DQN: decouple action selection and action evaluation to reduce overestimation bias
Episode RL-03: Proximal Policy Optimization (PPO)
PPO is the workhorse of modern RL — the same algorithm in the RLHF loop that fine-tunes LLMs. We move back to the continuous action space and implement:
- Actor-Critic architecture with a shared CNN feature extractor
- Surrogate clipping objective:
- GAE (Generalized Advantage Estimation): exponentially-weighted advantage estimation that interpolates between Monte Carlo and TD(0)
Episode RL-04+: World Models & Dreamer (The Long Game)
Once we have solid model-free baselines, we'll revisit the idea that made CarRacing famous in RL literature — learning to act inside a learned world model.
Recommended Reading & References
- Gymnasium Documentation: CarRacing-v3
- World Models (Ha & Schmidhuber, 2018): Paper · Interactive Demo
- Dreamer (Hafner et al., 2019): Paper
- DreamerV2 (Hafner et al., 2020): Paper
- Proximal Policy Optimization (Schulman et al., 2017): Paper
- Human-level control through deep RL — the original DQN paper (Mnih et al., Nature 2015): Paper