Running RL Rollouts on Sandboxes Guide (TRL + Daytona)

Overview

This guide demonstrates how to integrate Daytona with TRL (Transformer Reinforcement Learning) in order to run the code generated in rollouts, in Daytona sandboxes. We combine TRL's synchronous trainer with parallelized, asynchronous execution in sandboxes.

In the guide, we use GRPO to train Qwen3-1.7B-Base model on two code writing tasks: a sorting function, and a function to find the maximal contiguous subarray sum.

Features

Sandboxed code execution: Generated code runs in isolated Daytona sandboxes, preventing harmful code from affecting your system
Parallel evaluation: Multiple completions are evaluated concurrently across a pool of sandboxes
Test-based rewards: Reward signal uses the test pass rate
vLLM integration: Uses vLLM in colocate mode for running both training and generation on 1 GPU
Multi-task training: Easily extensible to new coding tasks by adding to the TASKS dictionary

Requirements

Python: Version 3.10 or higher
GPU: Required for training and vLLM inference, 80 GB VRAM or more recommended

Environment Variables

DAYTONA_API_KEY: Required for access to Daytona sandboxes. Get it from Daytona Dashboard

Getting Started

Setup and Run

Create and activate a virtual environment:

bash

python3.10 -m venv venv  
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

bash

pip install -e .

Set your Daytona API key in .env (copy from .env.example):

bash

cp .env.example .env
# edit .env with your API key

Run the training:

bash

python train.py

Configuration

The script has several configurable parameters:

Sandbox Settings

EFFECTIVE_BATCH_SIZE: The effective batch size for training, also equal to the number of Daytona sandboxes to create (default: 500).
MAX_TIMEOUT_SECONDS: Timeout for code execution in each sandbox (default: 1 second). Prevents infinite loops from blocking training.

Model Settings

MODEL_NAME: The base model to train (default: Qwen/Qwen3-1.7B-Base)

Training Settings (GRPOConfig)

Key parameters in the training configuration:

per_device_train_batch_size: Batch size per device (default: 20)
gradient_accumulation_steps: Steps to accumulate before update (default: 25)
max_steps: Total training steps (default: 8)
max_completion_length: Maximum tokens for generated code (default: 512)

Adding New Tasks

To add a new coding task, add an entry to the TASKS dictionary:

python

TASKS = {
    "your_task": {
        "prompt": "Your prompt here...",
        "func_name": "function_name",
        "banned_patterns": ["patterns", "to", "ban"],
        "tests": [
            "test_input_1",
            "test_input_2",
        ],
        "reference": "expected_output_expression",
    },
}

How It Works

The script runs reinforcement learning training using TRL's GRPOTrainer.

Sandbox pool: Daytona sandboxes are created upfront for safe, parallel code execution
Generation: The model generates N completions per prompt via vLLM
Sanitization: Completions using banned patterns (e.g., using built-in functions) are rejected
Evaluation: Each completion runs in a Daytona sandbox against the test suite
Reward: -1 for errors or banned patterns; otherwise, the reward is the fraction of tests passed
Policy update: GRPO reinforces completions that scored higher than their group average
Cleanup: When training finishes, sandboxes are deleted

Output

After training finishes, the trained model is saved in training_results folder. The metrics computed during the training run can be found in training_results/metrics.jsonl.

License

See the main project LICENSE file for details.

Running RL Rollouts on Sandboxes Guide (TRL + Daytona)

Running RL Rollouts on Sandboxes Guide (TRL + Daytona)

Overview

Features

Requirements

Environment Variables

Getting Started

Setup and Run

Configuration

Sandbox Settings

Model Settings

Training Settings (GRPOConfig)

Adding New Tasks

How It Works

Output

License

References