guides/python/reinforcement-learning/trl/README.md
This guide demonstrates how to integrate Daytona with TRL (Transformer Reinforcement Learning) in order to run the code generated in rollouts, in Daytona sandboxes. We combine TRL's synchronous trainer with parallelized, asynchronous execution in sandboxes.
In the guide, we use GRPO to train Qwen3-1.7B-Base model on two code writing tasks: a sorting function, and a function to find the maximal contiguous subarray sum.
TASKS dictionaryDAYTONA_API_KEY: Required for access to Daytona sandboxes. Get it from Daytona Dashboardpython3.10 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -e .
.env (copy from .env.example):cp .env.example .env
# edit .env with your API key
python train.py
The script has several configurable parameters:
EFFECTIVE_BATCH_SIZE: The effective batch size for training, also equal to the number of Daytona sandboxes to create (default: 500).MAX_TIMEOUT_SECONDS: Timeout for code execution in each sandbox (default: 1 second). Prevents infinite loops from blocking training.MODEL_NAME: The base model to train (default: Qwen/Qwen3-1.7B-Base)Key parameters in the training configuration:
per_device_train_batch_size: Batch size per device (default: 20)gradient_accumulation_steps: Steps to accumulate before update (default: 25)max_steps: Total training steps (default: 8)max_completion_length: Maximum tokens for generated code (default: 512)To add a new coding task, add an entry to the TASKS dictionary:
TASKS = {
"your_task": {
"prompt": "Your prompt here...",
"func_name": "function_name",
"banned_patterns": ["patterns", "to", "ban"],
"tests": [
"test_input_1",
"test_input_2",
],
"reference": "expected_output_expression",
},
}
The script runs reinforcement learning training using TRL's GRPOTrainer.
After training finishes, the trained model is saved in training_results folder. The metrics computed during the training run can be found in training_results/metrics.jsonl.
See the main project LICENSE file for details.