scientific-skills/stable-baselines3/references/algorithms.md
This document provides detailed characteristics of all RL algorithms in Stable Baselines3 to help select the right algorithm for specific tasks.
| Algorithm | Type | Action Space | Sample Efficiency | Training Speed | Use Case |
|---|---|---|---|---|---|
| PPO | On-Policy | All | Medium | Fast | General-purpose, stable |
| A2C | On-Policy | All | Low | Very Fast | Quick prototyping, multiprocessing |
| SAC | Off-Policy | Continuous | High | Medium | Continuous control, sample-efficient |
| TD3 | Off-Policy | Continuous | High | Medium | Continuous control, deterministic |
| DDPG | Off-Policy | Continuous | High | Medium | Continuous control (use TD3 instead) |
| DQN | Off-Policy | Discrete | Medium | Medium | Discrete actions, Atari games |
| HER | Off-Policy | All | Very High | Medium | Goal-conditioned tasks |
| RecurrentPPO | On-Policy | All | Medium | Slow | Partial observability (POMDP) |
Overview: General-purpose on-policy algorithm with good performance across many tasks.
Strengths:
Weaknesses:
Best For:
Hyperparameter Guidance:
n_steps: 2048-4096 for continuous, 128-256 for Atarilearning_rate: 3e-4 is a good defaultn_epochs: 10 for continuous, 4 for Ataribatch_size: 64gamma: 0.99 (0.995-0.999 for long episodes)Overview: Synchronous variant of A3C, simpler than PPO but less stable.
Strengths:
Weaknesses:
Best For:
Hyperparameter Guidance:
n_steps: 5-256 depending on tasklearning_rate: 7e-4gamma: 0.99Overview: Off-policy algorithm with entropy regularization, state-of-the-art for continuous control.
Strengths:
Weaknesses:
Best For:
Hyperparameter Guidance:
learning_rate: 3e-4buffer_size: 1M for most taskslearning_starts: 10000batch_size: 256tau: 0.005 (target network update rate)train_freq: 1 with gradient_steps=-1 for best performanceOverview: Improved DDPG with double Q-learning and delayed policy updates.
Strengths:
Weaknesses:
Best For:
Hyperparameter Guidance:
learning_rate: 1e-3buffer_size: 1Mlearning_starts: 10000batch_size: 100policy_delay: 2 (update policy every 2 critic updates)Overview: Early off-policy continuous control algorithm.
Strengths:
Weaknesses:
Best For:
Overview: Classic off-policy algorithm for discrete action spaces.
Strengths:
Weaknesses:
Best For:
Hyperparameter Guidance:
learning_rate: 1e-4buffer_size: 100K-1M depending on tasklearning_starts: 50000 for Ataribatch_size: 32exploration_fraction: 0.1exploration_final_eps: 0.05Variants:
Overview: Not a standalone algorithm but a replay buffer strategy for goal-conditioned tasks.
Strengths:
Weaknesses:
Best For:
Usage:
from stable_baselines3 import SAC, HerReplayBuffer
model = SAC(
"MultiInputPolicy",
env,
replay_buffer_class=HerReplayBuffer,
replay_buffer_kwargs=dict(
n_sampled_goal=4,
goal_selection_strategy="future", # or "episode", "final"
),
)
Overview: PPO with LSTM policy for handling partial observability.
Strengths:
Weaknesses:
Best For:
What is your action space?
Is sample efficiency critical?
Do you need fast wall-clock training?
Is the task goal-conditioned with sparse rewards?
Is the environment partially observable?
# Use vectorized environments for speed
env = make_vec_env(env_id, n_envs=8, vec_env_cls=SubprocVecEnv)
model = PPO(
"MlpPolicy",
env,
n_steps=2048, # Collect this many steps per environment before update
batch_size=64,
n_epochs=10,
learning_rate=3e-4,
gamma=0.99,
)
# Fewer environments, but use gradient_steps=-1 for efficiency
env = make_vec_env(env_id, n_envs=4)
model = SAC(
"MlpPolicy",
env,
buffer_size=1_000_000,
learning_starts=10000,
batch_size=256,
train_freq=1,
gradient_steps=-1, # Do 1 gradient step per env step (4 with 4 envs)
learning_rate=3e-4,
)
Approximate expected performance (mean reward) on common benchmarks:
Note: Performance varies significantly with hyperparameters and training time.