PufferLib Training Guide

Overview

PuffeRL is PufferLib's high-performance training algorithm based on CleanRL's PPO with LSTMs, enhanced with proprietary research improvements. It achieves training at millions of steps per second through optimized vectorization and efficient implementation.

Training Workflow

Basic Training Loop

The PuffeRL trainer provides three core methods:

python

# Collect environment interactions
rollout_data = trainer.evaluate()

# Train on collected batch
train_metrics = trainer.train()

# Aggregate and log results
trainer.mean_and_log()

CLI Training

Quick start training via command line:

bash

# Basic training
puffer train environment_name --train.device cuda --train.learning-rate 0.001

# Custom configuration
puffer train environment_name \
    --train.device cuda \
    --train.batch-size 32768 \
    --train.learning-rate 0.0003 \
    --train.num-iterations 10000

Python Training Script

python

import pufferlib
from pufferlib import PuffeRL

# Initialize environment
env = pufferlib.make('environment_name', num_envs=256)

# Create trainer
trainer = PuffeRL(
    env=env,
    policy=my_policy,
    device='cuda',
    learning_rate=3e-4,
    batch_size=32768,
    n_epochs=4,
    gamma=0.99,
    gae_lambda=0.95,
    clip_coef=0.2,
    ent_coef=0.01,
    vf_coef=0.5,
    max_grad_norm=0.5
)

# Training loop
for iteration in range(num_iterations):
    # Collect rollouts
    rollout_data = trainer.evaluate()

    # Train on batch
    train_metrics = trainer.train()

    # Log results
    trainer.mean_and_log()

Key Training Parameters

Core Hyperparameters

learning_rate: Learning rate for optimizer (default: 3e-4)
batch_size: Number of timesteps per training batch (default: 32768)
n_epochs: Number of training epochs per batch (default: 4)
num_envs: Number of parallel environments (default: 256)
num_steps: Steps per environment per rollout (default: 128)

PPO Parameters

gamma: Discount factor (default: 0.99)
gae_lambda: Lambda for GAE calculation (default: 0.95)
clip_coef: PPO clipping coefficient (default: 0.2)
ent_coef: Entropy coefficient for exploration (default: 0.01)
vf_coef: Value function loss coefficient (default: 0.5)
max_grad_norm: Maximum gradient norm for clipping (default: 0.5)

Performance Parameters

device: Computing device ('cuda' or 'cpu')
compile: Use torch.compile for faster training (default: True)
num_workers: Number of vectorization workers (default: auto)

Distributed Training

Multi-GPU Training

Use torchrun for distributed training across multiple GPUs:

bash

torchrun --nproc_per_node=4 train.py \
    --train.device cuda \
    --train.batch-size 131072

Multi-Node Training

For distributed training across multiple nodes:

bash

# On main node (rank 0)
torchrun --nproc_per_node=8 \
    --nnodes=4 \
    --node_rank=0 \
    --master_addr=MASTER_IP \
    --master_port=29500 \
    train.py

# On worker nodes (rank 1, 2, 3)
torchrun --nproc_per_node=8 \
    --nnodes=4 \
    --node_rank=NODE_RANK \
    --master_addr=MASTER_IP \
    --master_port=29500 \
    train.py

Monitoring and Logging

Logger Integration

PufferLib supports multiple logging backends:

Weights & Biases

python

from pufferlib import WandbLogger

logger = WandbLogger(
    project='my_project',
    entity='my_team',
    name='experiment_name',
    config=trainer_config
)

trainer = PuffeRL(env, policy, logger=logger)

Neptune

python

from pufferlib import NeptuneLogger

logger = NeptuneLogger(
    project='my_team/my_project',
    name='experiment_name',
    api_token='YOUR_TOKEN'
)

trainer = PuffeRL(env, policy, logger=logger)

No Logger

python

from pufferlib import NoLogger

trainer = PuffeRL(env, policy, logger=NoLogger())

Key Metrics

Training logs include:

Performance Metrics:
- Steps per second (SPS)
- Training throughput
- Wall-clock time per iteration
Learning Metrics:
- Episode rewards (mean, min, max)
- Episode lengths
- Value function loss
- Policy loss
- Entropy
- Explained variance
- Clipfrac
Environment Metrics:
- Environment-specific rewards
- Success rates
- Custom metrics

Terminal Dashboard

PufferLib provides a real-time terminal dashboard showing:

Training progress
Current SPS
Episode statistics
Loss values
GPU utilization

Checkpointing

Saving Checkpoints

python

# Save checkpoint
trainer.save_checkpoint('checkpoint.pt')

# Save with additional metadata
trainer.save_checkpoint(
    'checkpoint.pt',
    metadata={'iteration': iteration, 'best_reward': best_reward}
)

Loading Checkpoints

python

# Load checkpoint
trainer.load_checkpoint('checkpoint.pt')

# Resume training
for iteration in range(resume_iteration, num_iterations):
    trainer.evaluate()
    trainer.train()
    trainer.mean_and_log()

Hyperparameter Tuning with Protein

The Protein system enables automatic hyperparameter and reward tuning:

python

from pufferlib import Protein

# Define search space
search_space = {
    'learning_rate': [1e-4, 3e-4, 1e-3],
    'batch_size': [16384, 32768, 65536],
    'ent_coef': [0.001, 0.01, 0.1],
    'clip_coef': [0.1, 0.2, 0.3]
}

# Run hyperparameter search
protein = Protein(
    env_name='environment_name',
    search_space=search_space,
    num_trials=100,
    metric='mean_reward'
)

best_config = protein.optimize()

Performance Optimization Tips

Maximizing Throughput

Batch Size: Increase batch_size to fully utilize GPU
Num Envs: Balance between CPU and GPU utilization
Compile: Enable torch.compile for 10-20% speedup
Workers: Adjust num_workers based on environment complexity
Device: Always use 'cuda' for neural network training

Environment Speed

Pure Python environments: ~100k-500k SPS
C-based environments: ~4M SPS
With training overhead: ~1M-4M total SPS

Memory Management

Reduce batch_size if running out of GPU memory
Decrease num_envs if running out of CPU memory
Use gradient accumulation for large effective batch sizes

Common Training Patterns

Curriculum Learning

python

# Start with easy tasks, gradually increase difficulty
difficulty_levels = [0.1, 0.3, 0.5, 0.7, 1.0]

for difficulty in difficulty_levels:
    env = pufferlib.make('environment_name', difficulty=difficulty)
    trainer = PuffeRL(env, policy)

    for iteration in range(iterations_per_level):
        trainer.evaluate()
        trainer.train()
        trainer.mean_and_log()

Reward Shaping

python

# Wrap environment with custom reward shaping
class RewardShapedEnv(pufferlib.PufferEnv):
    def step(self, actions):
        obs, rewards, dones, infos = super().step(actions)

        # Add shaped rewards
        shaped_rewards = rewards + 0.1 * proximity_bonus

        return obs, shaped_rewards, dones, infos

Multi-Stage Training

python

# Train in multiple stages with different configurations
stages = [
    {'learning_rate': 1e-3, 'iterations': 1000},   # Exploration
    {'learning_rate': 3e-4, 'iterations': 5000},   # Main training
    {'learning_rate': 1e-4, 'iterations': 2000}    # Fine-tuning
]

for stage in stages:
    trainer.learning_rate = stage['learning_rate']
    for iteration in range(stage['iterations']):
        trainer.evaluate()
        trainer.train()
        trainer.mean_and_log()

Troubleshooting

Low Performance

Check environment is vectorized correctly
Verify GPU utilization with nvidia-smi
Increase batch_size to saturate GPU
Enable compile mode
Profile with torch.profiler

Training Instability

Reduce learning_rate
Decrease batch_size
Increase num_envs for more diverse samples
Add entropy coefficient for more exploration
Check reward scaling

Memory Issues

Reduce batch_size or num_envs
Use gradient accumulation
Disable compile mode if causing OOM
Check for memory leaks in custom environments