apps/docs/src/content/docs/en/guides/rlm/recursive-language-models.mdx
import { TabItem, Tabs } from '@astrojs/starlight/components'
This guide demonstrates how to build a recursive language model (RLM) agent system that uses Daytona sandboxes, based on the approach pioneered in Recursive Language Models (Zhang, Kraska, Khattab) and further explored by Prime Intellect.
While the original paper and Prime Intellect's implementation focus on single-level recursion (depth=1), this guide extends the concept to unlimited recursion depth — agents can spawn sub-agents, which can spawn their own sub-agents, and so on. Each agent runs in its own isolated Daytona sandbox with a fresh clone of the target repository.
The system implements a recursive agent architecture where agents can delegate subtasks to child agents:
rlm_query() to spawn sub-agents, each with their own sandboxRoot Agent (depth=0)
├── Sub-Agent A (depth=1)
│ ├── Sub-Agent A1 (depth=2)
│ └── Sub-Agent A2 (depth=2)
└── Sub-Agent B (depth=1)
├── Sub-Agent B1 (depth=2)
└── Sub-Agent B2 (depth=2)
Each agent runs in its own isolated Daytona sandbox with a fresh repository clone, enabling parallel exploration.
Clone the Daytona repository and navigate to the example directory:
git clone https://github.com/daytonaio/daytona.git
cd daytona/guides/python/recursive-language-models
python3.10 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -e .
This installs:
daytona - Daytona SDK for sandbox managementlitellm - Unified LLM interface for any providertyper - CLI frameworkpyyaml - Configuration parsingGet your Daytona API key from the Daytona Dashboard and create a .env file:
DAYTONA_API_KEY=your_daytona_api_key
LLM_API_KEY=your_llm_api_key
The LLM_API_KEY is used via LiteLLM, supporting OpenRouter, OpenAI, Anthropic, and other providers.
With setup complete, let's run an agent. Here's an example that investigates TODO comments in scikit-learn:
python run.py https://github.com/scikit-learn/scikit-learn \
-p "Investigate TODO comments across this repository. Spawn sub-agents to explore different modules. Find the easiest TODO and fix it."
This spawns a root agent that explores the codebase, delegates to sub-agents for parallel investigation, and produces a git patch fixing the easiest TODO it finds. We'll walk through the results and trace the execution in detail later, but first, let's look at how the code works.
| Option | Description |
|---|---|
repo | GitHub repository URL (required) |
-p, --prompt | Task prompt for the agent (required) |
-b, --branch | Branch name (optional) |
--commit | Specific commit SHA (optional) |
-c, --config | Path to config file (default: config.yaml) |
-o, --output | Output file for patch (default: stdout) |
Let's walk through the key components of the agent system.
Each agent runs an iteration loop that calls the LLM, extracts code blocks, and executes them. The core loop in agent.py:
for iteration in range(self.config.rlm.max_iterations):
# Check global timeout
if self._is_timeout():
break
# Build user prompt with previous execution result
user_prompt = build_user_prompt(iteration, execution_result)
messages.append({"role": "user", "content": user_prompt})
# Get model completion
response = self.client.completion(messages)
messages.append({"role": "assistant", "content": response})
# Execute code blocks in REPL
repl_result = self.repl.execute_response(response)
# Check for final answer
if repl_result.final_answer is not None:
self._result = repl_result.final_answer
break
# Format result for next iteration
execution_result = format_execution_result(...)
```
Each iteration:
FINAL() to submit resultsWhen agent code calls rlm_query(), a new sub-agent is created with its own sandbox:
# Create sub-agent at depth + 1
sub_agent = RLMAgent(
client=self.client,
sandbox_manager=self.sandbox_manager,
config=self.config,
depth=self.depth + 1,
task=task,
# ... other params
)
# Run sub-agent (blocking)
result = sub_agent.run()
# Return result, truncated if necessary
return result.result or "No result"
```
For parallel spawning, rlm_query_batched() uses a thread pool:
with ThreadPoolExecutor(max_workers=10) as executor:
future_to_idx = {
executor.submit(self._handle_rlm_query, task): i
for i, task in enumerate(tasks)
}
for future in as_completed(future_to_idx):
idx = future_to_idx[future]
results[idx] = future.result()
return results
```
Inside the REPL, agents have access to these functions:
| Function | Description |
|---|---|
rlm_query(task) | Spawn a single sub-agent, returns result string |
rlm_query_batched(tasks) | Spawn multiple sub-agents in parallel |
FINAL(answer) | Submit final result (root: triggers patch extraction) |
FINAL_VAR(var_name) | Submit the value of a variable as result |
edit_file(path, old, new) | Edit a file with syntax validation |
Example spawning pattern used by agents:
<Tabs> <TabItem label="Python" icon="seti:python"> ```python # Spawn multiple sub-agents to explore different modules results = rlm_query_batched([ "Search for TODO comments in sklearn/linear_model/ and assess difficulty", "Search for TODO comments in sklearn/ensemble/ and assess difficulty", "Search for TODO comments in sklearn/tree/ and assess difficulty", ])for i, result in enumerate(results):
print(f"=== Sub-agent {i+1} findings ===")
print(result)
```
Let's trace what happens when we run an agent on a popular machine learning library, scikit-learn:
python run.py https://github.com/scikit-learn/scikit-learn \
-p "Investigate TODO comments across this repository. Spawn sub-agents to explore different modules under sklearn/ in parallel. For each TODO found, assess how difficult it would be to fix (easy/medium/hard). After gathering results, pick the easiest TODO and fix it."
Note that there are about 400 lines in scikit-learn that contain the substring "# TODO".
Step 1: Root agent explores and spawns depth-1 sub-agents
The root agent (depth=0) examines the repository structure, identifies all sklearn modules, and spawns 25 sub-agents in parallel:
<Tabs> <TabItem label="Python" icon="seti:python"> ```python # Define the subdirectories to investigate subdirs = [ "cluster", "compose", "covariance", "cross_decomposition", "datasets", "decomposition", "ensemble", "feature_extraction", "feature_selection", "gaussian_process", "impute", "inspection", "linear_model", "manifold", "metrics", "mixture", "model_selection", "neighbors", "neural_network", "preprocessing", "semi_supervised", "svm", "tree", "utils" ]# Create queries for sub-agents
queries = [
f"Search for 'TODO' comments in 'sklearn/{subdir}/'. For each TODO found, provide: "
f"1. The file path and line number. 2. The content of the TODO. 3. An assessment "
f"of how difficult it would be to fix (easy/medium/hard) with a brief justification."
for subdir in subdirs
]
results = rlm_query_batched(queries)
```
Each of these 25 sub-agents gets its own Daytona sandbox with a fresh clone of scikit-learn.
Step 2: Depth-1 agents spawn depth-2 agents
Some depth-1 agents decide their module is too large and spawn their own sub-agents. For example, the sklearn/metrics/ agent spawned 3 depth-2 agents:
tasks = [
"Identify and assess TODOs in 'sklearn/metrics/cluster/'. Provide file, line, content, and difficulty.",
"Identify and assess TODOs in 'sklearn/metrics/tests/'. Provide file, line, content, and difficulty.",
"Identify and assess TODOs in 'sklearn/metrics/_plot/' and its 'tests' sub-directory."
]
results = rlm_query_batched(tasks)
```
Step 3: Results propagate back
Each sub-agent returns findings via FINAL(). Results flow back up:
Step 4: Root agent synthesizes and acts
The root agent reviews all findings, identifies the easiest TODO, and makes the fix.
Step 5: Git patch produced
<Tabs> <TabItem label="Python" icon="seti:python"> ```python import subprocess subprocess.run(['git', 'add', '-A'], cwd='/workspace') result = subprocess.run(['git', 'diff', '--cached', 'HEAD'], capture_output=True, text=True, cwd='/workspace') FINAL(result.stdout) ``` </TabItem> </Tabs>Generated patch:
diff --git a/sklearn/utils/_array_api.py b/sklearn/utils/_array_api.py
--- a/sklearn/utils/_array_api.py
+++ b/sklearn/utils/_array_api.py
@@ -19,8 +19,7 @@ from sklearn.externals.array_api_compat import numpy as np_compat
from sklearn.utils._dataframe import is_df_or_series
from sklearn.utils.fixes import parse_version
-# TODO: complete __all__
-__all__ = ["xpx"] # we import xpx here just to re-export it, need this to appease ruff
+__all__ = ['device', 'get_namespace', 'get_namespace_and_device', 'indexing_dtype', 'move_to', 'size', 'supported_float_dtypes', 'xpx', 'yield_namespace_device_dtype_combinations', 'yield_namespaces']
The agent found the easiest TODO (# TODO: complete __all__ in sklearn/utils/_array_api.py) and completed the __all__ list with all public symbols from the module.
Configure the agent in config.yaml:
# RLM configuration
rlm:
max_sandboxes: 50
max_iterations: 50
global_timeout: 3600
result_truncation_limit: 10000
```
| Parameter | Default | Description |
|---|---|---|
model.name | openrouter/google/gemini-3-flash-preview | LLM model in LiteLLM format |
rlm.max_sandboxes | 50 | Maximum total sandboxes across entire rollout |
rlm.max_iterations | 50 | Maximum iterations per agent |
rlm.global_timeout | 3600 | Total timeout in seconds |
rlm.result_truncation_limit | 10000 | Max chars in sub-agent results |
:::tip[Scaling Tips]
max_sandboxes for tasks requiring more parallel explorationResults are saved to the results/ directory as JSON files. Use the built-in viewer:
python -m http.server 8000
# Open http://localhost:8000/viewer/
The viewer provides:
Current language models aren't specifically trained to leverage recursive delegation, so RLMs don't necessarily outperform single-agent approaches on benchmarks yet. However, the architecture demonstrates compelling properties for complex tasks.
In our scikit-learn example, 40 agents ran in parallel across the agent tree, each with its own isolated sandbox, completing the entire run in just over 5 minutes. This level of parallelism, where each agent can freely modify files, run tests, and explore without affecting others, would be difficult to achieve without per-agent sandboxes.
Key advantages of this approach:
rlm_query_batched() enables concurrent investigation