docs/source/nemo_gym.md
NVIDIA NeMo Gym is a library for building RL environments for large language models. This integration enables training models in NeMo Gym environments using TRL's GRPOTrainer with vLLM server mode.
The integration supports multi-step and multi-turn rollouts, multi-environment training, and any NeMo Gym environment (thoroughly tested: workplace assistant, reasoning gym, MCQA, and math with judge).
NeMo Gym provides training-ready environments across multiple domains, including but not limited to:
| Environment | Domain | Description |
|---|---|---|
| Workplace Assistant | Agent | Multi-step tool calling in common office scenarios (calendar, email, and more) |
| Math with Judge | Math | Math problems with algorithmic or judge-based verification |
| Code Gen | Coding | Competitive programming problems with code execution |
| MCQA | Knowledge | Multiple-choice question answering |
| Instruction Following | Instruction Following | IFEval/IFBench style tasks |
| Reasoning Gym | Multiple | Single-step procedurally generated verifiable tasks across domains |
For a complete list of available training environments, refer to the NeMo Gym repository.
Complete these one-time setup steps before running training.
Install TRL with vLLM extras
cd trl/
uv venv
source .venv/bin/activate
uv sync --extra vllm
Install NeMo Gym
# deactivate trl venv
deactivate
git clone https://github.com/NVIDIA-NeMo/Gym.git
cd Gym
uv venv --python 3.12
source .venv/bin/activate
uv sync
Many NeMo Gym datasets used to train Nemotron models are available on Hugging Face. Use ng_prepare_data to download and prepare datasets. This command:
agent_ref field to each example that tells NeMo Gym which agent server should handle that exampleNote:
train_multi_environment.pyadds theagent_reffield when loading datasets, so this step is optional if datasets are created another way.
Set Hugging Face Token
Create env.yaml in Gym/ with your HF token:
hf_token: <your_hf_token>
Prepare Dataset
# Enter Gym and activate the venv
cd Gym
source .venv/bin/activate
# Set config paths
config_paths="responses_api_models/vllm_model/configs/vllm_model.yaml,\
resources_servers/workplace_assistant/configs/workplace_assistant.yaml"
# Download data and prep for training
ng_prepare_data "+config_paths=[${config_paths}]" \
+output_dirpath=data/workplace_assistant \
+mode=train_preparation \
+should_download=true \
+data_source=huggingface
This creates train.jsonl and validation.jsonl files in data/workplace_assistant/.
To create a new environment, refer to the environment creation guide. We suggest running an existing one first!
NeMo Gym datasets are stored as JSONL. Each line contains a task with input messages, tool definitions, metadata such as ground truth for verification, and an agent server reference. The following example shows the workplace dataset structure. Metadata fields can differ between datasets, as long as the corresponding resources server uses the fields appropriately.
{
"responses_create_params": {
"input": [
{"role": "system", "content": "..."},
{"role": "user", "content": "Move any of jinsoo's tasks that are in review to completed"}
],
"tools": [...],
"parallel_tool_calls": false,
"temperature": 1
},
"ground_truth": [
{"name": "project_management_update_task", "arguments": "{...}"},
...
],
"category": "workbench_project_management",
"environment_name": "workbench",
"agent_ref": {
"type": "responses_api_agents",
"name": "workplace_assistant_simple_agent"
}
}
For development and testing on a single node.
Update Environment Config
Update env.yaml in Gym/ to include model information:
policy_base_url: http://127.0.0.1:8000/v1
policy_api_key: EMPTY
policy_model_name: Qwen/Qwen2.5-1.5B-Instruct
hf_token: ...
Update Training Config
Update examples/scripts/nemo_gym/config.yaml to point to the dataset generated above, and any other optional modifications.
The following steps run in 3 terminals. It can also be ran with processes in the background, or using tmux.
Start NeMo Gym Servers (Terminal 1)
cd Gym/
source .venv/bin/activate
config_paths="resources_servers/workplace_assistant/configs/workplace_assistant.yaml,\
responses_api_models/vllm_model/configs/vllm_model_for_training.yaml"
ng_run "+config_paths=[${config_paths}]"
This starts:
Start TRL vLLM Server on GPU 0 (Terminal 2)
cd trl/
source .venv/bin/activate
CUDA_VISIBLE_DEVICES=0 trl vllm-serve \
--model Qwen/Qwen2.5-1.5B-Instruct \
--max-model-len 16384 \
--host 0.0.0.0 \
--port 8000
Run Training on GPU 1 (Terminal 3)
source trl/.venv/bin/activate
cd trl/examples/scripts/nemo_gym
export WANDB_API_KEY=...
uv add omegaconf
CUDA_VISIBLE_DEVICES=1 python train_multi_environment.py --config config.yaml
An example five-node training script is provided in submit.sh. Nodes one through four run the training algorithm, while node five runs vLLM inference for NeMo Gym agent rollouts.
Configure the Script
Update submit.sh with your Slurm account, partition, paths to your project directory, and updated training configs.
Submit the Job
sbatch submit.sh
Monitor Training
tail -f logs/<job_id>/*
Tip: Set up wandb logging for detailed training metrics. For more details on TRL's vLLM integration, refer to the vLLM integration page.
Train on multiple NeMo Gym environments simultaneously. This allows learning diverse capabilities, such as tool calling and math reasoning, in a single training run.
Prepare Individual Datasets
Prepare datasets for each environment. The workplace assistant dataset was prepared above. Now lets create a dataset for the mini sudoku environment implemented by the reasoning gym resources server in NeMo Gym:
cd Gym
source .venv/bin/activate
uv add reasoning-gym
cd resources_servers/reasoning_gym
python scripts/create_dataset.py \
--task mini_sudoku \
--size 2000 \
--seed 42 \
--output data/reasoning_gym/train_mini_sudoku.jsonl
python scripts/create_dataset.py \
--task mini_sudoku \
--size 50 \
--seed 24 \
--output data/reasoning_gym/val_mini_sudoku.jsonl
Create Combined Dataset
Combine datasets into a single file with tasks from both environments:
cat data/workplace_assistant/train_workplace.jsonl data/reasoning_gym/train_mini_sudoku.jsonl | shuf > train_multi_env.jsonl
Tip: Ensure datasets are the same size before shuffling for an even blend of tasks. Repeat for the validation dataset.
Update Training Config
Update the config to point to the combined dataset:
model_name: "Qwen/Qwen3-4B-Instruct-2507"
dataset_path: "/path/to/data/train_multi_env.jsonl"
eval_dataset_path: "/path/to/data/val_multi_env.jsonl"
task: "workplace-sudoku" # used in wandb run name
output_dir: "outputs/nemo_gym_multi_env"
# ... rest of config same
Update ng_run
Whether training interactively or via Slurm, update the ng_run command to include config files from each resources server:
cd Gym
source .venv/bin/activate
config_paths="responses_api_models/vllm_model/configs/vllm_model.yaml,\
resources_servers/workplace_assistant/configs/workplace_assistant.yaml,\
resources_servers/reasoning_gym/configs/reasoning_gym.yaml"
ng_run "+config_paths=[${config_paths}]"
This starts servers for both environments. The training script automatically routes each example to the correct agent server based on its agent_ref field.
Run Training
Update the Slurm submission script to use the new training config and both ng_run resources server configs, then submit the job as before.
The training script reads agent_ref from each example's metadata, routes requests to the correct NeMo Gym agent server, and handles different agents and environments in the same batch.