guides/python/reinforcement-learning/openenv/README.md
This guide demonstrates how to use OpenEnv with Daytona sandboxes to evaluate and train models on FinQA, a financial question-answering dataset built from SEC 10-K filings. An agent interacts with company financial tables through SQL queries, then submits a final numerical answer.
Two modes are included: a lightweight single-episode demo (run.py) and a full GRPO training loop (train.py) that runs parallel rollouts across hundreds of sandboxes.
Qwen3-14B with LoRA using Group Relative Policy Optimization[!TIP]
run.pydoes not require a GPU. Onlytrain.pyneeds GPU resources.
DAYTONA_API_KEY: Required for Daytona sandbox access. Get it from Daytona Dashboardpython3.10 -m venv venv
source venv/bin/activate
pip install -e .
cp .env.example .env
# edit .env with your API key
python build_snapshot.py
python run.py
Install the training extras:
pip install -e ".[train]"
vLLM requires Python development headers for Triton's JIT compilation. Make sure they're installed:
apt install python3.10-dev
With hundreds of concurrent WebSocket connections, you may hit the default open file descriptor limit. Raise it before starting:
ulimit -n 65536
Run training:
python train.py
Quick smoke test (no real training, just verifies the pipeline):
python train.py --sandboxes 2 --iterations 1 --group-size 2
train.py accepts the following arguments:
--sandboxes — Number of Daytona sandboxes (default: 500)--iterations — Number of training iterations (default: 10)--group-size — Episodes per prompt group for GRPO (default: 6)--target-groups-per-iter — Stop rollout collection once this many groups are formed (default: 100, set 0 to use all rounds)--max-rollout-rounds — Maximum rollout rounds per training iteration (default: 8)--model — HuggingFace model ID (default: Qwen/Qwen3-14B)--snapshot — Daytona snapshot name (default: openenv-finqa)--lr — Learning rate (default: 8e-5)--temperature — Sampling temperature (default: 1.0)--max-steps — Max episode steps (default: 20)--max-gen-tokens — Max tokens per generation (default: 512)--rollout-dispatch-wait-ms — Max wait to accumulate ready episodes before generation dispatch (default: 500)--tensor-parallel-size — vLLM tensor parallelism (default: 2)--gpu-memory-utilization — vLLM GPU memory fraction (default: 0.85)--lora-rank — LoRA rank (default: 16)--lora-alpha — LoRA alpha (default: 32)--lora-dropout — LoRA dropout (default: 0.0)--lora-target-modules — LoRA target modules (default: all attention + MLP projections)--sync-every — Export LoRA adapter to vLLM every N iterations (default: 1)--grpo-update-batch-size — Batch size for GRPO updates (default: 12)--disable-gradient-checkpointing — Disable gradient checkpointing--run-dir — Directory for persistent JSONL logs (default: runs/YYYYMMDD_HHMMSS/)--save-token-ids — Include prompt/completion token ID lists in trajectory logsbuild_snapshot.py accepts:
--snapshot-name — Name for the snapshot (default: openenv-finqa)openenv-finqa snapshot and wait for the server health checkreset() to start a new episode with a random question and companyget_descriptions and get_table_info to discover available tables and their schemassql_query, then submit a final answer with submit_answerQwen3-14B with LoRA on training GPUs, start vLLM on inference GPUs, create sandbox poolThe FinQA environment exposes four tools via the OpenEnv protocol:
| Tool | Arguments | Description |
|---|---|---|
get_descriptions | company_name | List available table names for a company |
get_table_info | company_name, table_name | Get table metadata: columns, types, unique values |
sql_query | query | Execute a SQL query against the company's 10-K data |
submit_answer | answer | Submit a final numerical answer (terminates the episode) |
SANDBOX_COUNT — Number of sandboxes in the pool (default: 500)MAX_CONCURRENT_CREATE — Max concurrent sandbox creations (default: 100)MAX_CONCURRENT_PLAY — Max concurrent active episodes (default: 200)MAX_PLAY_RETRIES — Retries for failed episodes (default: 3)MODEL_NAME — HuggingFace model ID (default: Qwen/Qwen3-14B)TENSOR_PARALLEL_SIZE — vLLM tensor parallelism (default: 2)GPU_MEMORY_UTILIZATION — vLLM GPU memory fraction (default: 0.85)MAX_GEN_TOKENS — Max tokens per generation (default: 512)TEMPERATURE — Sampling temperature (default: 1.0)LEARNING_RATE — Adam learning rate for LoRA (default: 8e-5)LORA_RANK / LORA_ALPHA — LoRA hyperparameters (default: 16 / 32)LORA_DROPOUT — LoRA dropout (default: 0.0)LORA_TARGET_MODULES — Linear layers targeted by LoRA (default: all attention + MLP projections)EPISODES_PER_GROUP — Group size for GRPO advantage computation (default: 6)TRAINING_ITERATIONS — Number of outer training loop iterations (default: 10)TARGET_GROUPS_PER_ITER — Stop rollout collection once this many groups are formed (default: 100)MAX_ROLLOUT_ROUNDS — Maximum rollout rounds per training iteration (default: 8)ROLLOUT_DISPATCH_WAIT_MS — Max wait to accumulate ready episodes before generation dispatch (default: 500)GRPO_UPDATE_BATCH_SIZE — Batch size for gradient updates (default: 12)SYNC_EVERY — Export LoRA adapter to vLLM every N iterations (default: 1)See the main project LICENSE file for details.