scientific-skills/pufferlib/references/training.md
PuffeRL is PufferLib's high-performance training algorithm based on CleanRL's PPO with LSTMs, enhanced with proprietary research improvements. It achieves training at millions of steps per second through optimized vectorization and efficient implementation.
The PuffeRL trainer provides three core methods:
# Collect environment interactions
rollout_data = trainer.evaluate()
# Train on collected batch
train_metrics = trainer.train()
# Aggregate and log results
trainer.mean_and_log()
Quick start training via command line:
# Basic training
puffer train environment_name --train.device cuda --train.learning-rate 0.001
# Custom configuration
puffer train environment_name \
--train.device cuda \
--train.batch-size 32768 \
--train.learning-rate 0.0003 \
--train.num-iterations 10000
import pufferlib
from pufferlib import PuffeRL
# Initialize environment
env = pufferlib.make('environment_name', num_envs=256)
# Create trainer
trainer = PuffeRL(
env=env,
policy=my_policy,
device='cuda',
learning_rate=3e-4,
batch_size=32768,
n_epochs=4,
gamma=0.99,
gae_lambda=0.95,
clip_coef=0.2,
ent_coef=0.01,
vf_coef=0.5,
max_grad_norm=0.5
)
# Training loop
for iteration in range(num_iterations):
# Collect rollouts
rollout_data = trainer.evaluate()
# Train on batch
train_metrics = trainer.train()
# Log results
trainer.mean_and_log()
Use torchrun for distributed training across multiple GPUs:
torchrun --nproc_per_node=4 train.py \
--train.device cuda \
--train.batch-size 131072
For distributed training across multiple nodes:
# On main node (rank 0)
torchrun --nproc_per_node=8 \
--nnodes=4 \
--node_rank=0 \
--master_addr=MASTER_IP \
--master_port=29500 \
train.py
# On worker nodes (rank 1, 2, 3)
torchrun --nproc_per_node=8 \
--nnodes=4 \
--node_rank=NODE_RANK \
--master_addr=MASTER_IP \
--master_port=29500 \
train.py
PufferLib supports multiple logging backends:
from pufferlib import WandbLogger
logger = WandbLogger(
project='my_project',
entity='my_team',
name='experiment_name',
config=trainer_config
)
trainer = PuffeRL(env, policy, logger=logger)
from pufferlib import NeptuneLogger
logger = NeptuneLogger(
project='my_team/my_project',
name='experiment_name',
api_token='YOUR_TOKEN'
)
trainer = PuffeRL(env, policy, logger=logger)
from pufferlib import NoLogger
trainer = PuffeRL(env, policy, logger=NoLogger())
Training logs include:
Performance Metrics:
Learning Metrics:
Environment Metrics:
PufferLib provides a real-time terminal dashboard showing:
# Save checkpoint
trainer.save_checkpoint('checkpoint.pt')
# Save with additional metadata
trainer.save_checkpoint(
'checkpoint.pt',
metadata={'iteration': iteration, 'best_reward': best_reward}
)
# Load checkpoint
trainer.load_checkpoint('checkpoint.pt')
# Resume training
for iteration in range(resume_iteration, num_iterations):
trainer.evaluate()
trainer.train()
trainer.mean_and_log()
The Protein system enables automatic hyperparameter and reward tuning:
from pufferlib import Protein
# Define search space
search_space = {
'learning_rate': [1e-4, 3e-4, 1e-3],
'batch_size': [16384, 32768, 65536],
'ent_coef': [0.001, 0.01, 0.1],
'clip_coef': [0.1, 0.2, 0.3]
}
# Run hyperparameter search
protein = Protein(
env_name='environment_name',
search_space=search_space,
num_trials=100,
metric='mean_reward'
)
best_config = protein.optimize()
# Start with easy tasks, gradually increase difficulty
difficulty_levels = [0.1, 0.3, 0.5, 0.7, 1.0]
for difficulty in difficulty_levels:
env = pufferlib.make('environment_name', difficulty=difficulty)
trainer = PuffeRL(env, policy)
for iteration in range(iterations_per_level):
trainer.evaluate()
trainer.train()
trainer.mean_and_log()
# Wrap environment with custom reward shaping
class RewardShapedEnv(pufferlib.PufferEnv):
def step(self, actions):
obs, rewards, dones, infos = super().step(actions)
# Add shaped rewards
shaped_rewards = rewards + 0.1 * proximity_bonus
return obs, shaped_rewards, dones, infos
# Train in multiple stages with different configurations
stages = [
{'learning_rate': 1e-3, 'iterations': 1000}, # Exploration
{'learning_rate': 3e-4, 'iterations': 5000}, # Main training
{'learning_rate': 1e-4, 'iterations': 2000} # Fine-tuning
]
for stage in stages:
trainer.learning_rate = stage['learning_rate']
for iteration in range(stage['iterations']):
trainer.evaluate()
trainer.train()
trainer.mean_and_log()
nvidia-smitorch.profiler