v2/docs/reasoningbank/google-research.md
An algorithmic outline to implement a ReasoningBank-style system on top of your Claude Flow Memory Space. It maps cleanly to your SQLite-backed memory at .swarm/memory.db and the hooks system so you can drop this into flows immediately. Where I reference paper specifics or your repo’s schemas, I cite them.
A closed-loop module with four algorithms wired into Claude Flow:
ReasoningBank stores each memory item as {title, description, content} and retrieves top‑k via semantic similarity to inject as system instructions. It learns from both successes and failures and includes Memory‑aware Test‑Time Scaling (MaTTS) in parallel and sequential modes. (arXiv)
Your Claude Flow Memory Space already exposes the right persistence primitives and tables, including patterns for learned behaviors, events for trajectories, and performance_metrics. The DB lives at .swarm/memory.db. (GitHub)
Use your existing tables, add two small ones. Keep migrations idempotent.
-- A. Use existing patterns table to store ReasoningBank items
-- patterns(id TEXT PRIMARY KEY, type TEXT, pattern_data TEXT, confidence REAL, usage_count INT, created_at TEXT, last_used TEXT)
-- We will store type='reasoning_memory' and JSON in pattern_data
-- B. Embeddings for retrieval
CREATE TABLE IF NOT EXISTS pattern_embeddings (
id TEXT PRIMARY KEY, -- same id as patterns.id
model TEXT NOT NULL, -- e.g., text-embedding-3-large or Claude embed
dims INTEGER NOT NULL,
vector BLOB NOT NULL, -- float32 array serialized
created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
-- C. Links between memories for governance and consolidation
CREATE TABLE IF NOT EXISTS pattern_links (
src_id TEXT NOT NULL,
dst_id TEXT NOT NULL,
relation TEXT NOT NULL, -- 'entails' | 'contradicts' | 'refines' | 'duplicate_of'
weight REAL DEFAULT 1.0,
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (src_id, dst_id, relation)
);
-- D. Task trajectory archive (optional if you already store in events)
CREATE TABLE IF NOT EXISTS task_trajectories (
task_id TEXT PRIMARY KEY,
agent_id TEXT,
query TEXT NOT NULL,
trajectory_json TEXT NOT NULL, -- steps, messages, tool calls
started_at TEXT,
ended_at TEXT,
judge_label TEXT, -- 'Success' | 'Failure'
judge_conf REAL, -- 0..1
matts_run_id TEXT -- to link with scaling bundles
);
-- E. MaTTS run bookkeeping
CREATE TABLE IF NOT EXISTS matts_runs (
run_id TEXT PRIMARY KEY,
task_id TEXT NOT NULL,
mode TEXT NOT NULL, -- 'parallel' | 'sequential'
k INTEGER NOT NULL,
status TEXT DEFAULT 'completed',
summary TEXT, -- JSON with outcomes
created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
Your events, tasks, performance_metrics, and memory_store tables remain as is. (GitHub)
{
"id": "rm_ulid_01HZX…",
"type": "reasoning_memory",
"pattern_data": {
"title": "Handle login flows with CSRF tokens",
"description": "Always fetch and include CSRF token before POST.",
"content": "1) Load login page and parse CSRF from form or meta tag. 2) Attach token to POST. 3) Retry once if 403 and refresh token.",
"source": {
"task_id": "task_…",
"agent_id": "agent_web",
"outcome": "Success",
"evidence": ["event_id_192", "event_id_205"]
},
"tags": ["web", "auth", "csrf"],
"domain": "webarena.admin",
"created_at": "2025-10-10T12:00:00Z",
"confidence": 0.76,
"n_uses": 0
}
}
This mirrors the ReasoningBank schema of {title, description, content}. (arXiv)
Inputs: task_query, optional domain, k
Outputs: ordered list of memory items with scores
Steps
Embed query
q with your chosen model. Persist in a short‑lived cache.Candidate fetch
patterns where type='reasoning_memory'. Join to pattern_embeddings for vectors.domain or tags.Score each candidate i with a bounded additive model:
sim_i = cosine(q, e_i)
rec_i = exp(-age_days_i / H) -- H half-life in days, default 45
rel_i = clamp(confidence_i, 0, 1) -- from judge agreement and reuse
div_i = MMR penalty against already selected set S
score_i = α*sim_i + β*rec_i + γ*rel_i - δ*div_i
defaults: α=0.65, β=0.15, γ=0.20, δ=0.10
Use Maximal Marginal Relevance for div_i to avoid near duplicates.
Select top‑k with MMR:
S = {}
while |S| < k:
pick argmax_i [ score_i - δ*max_{j in S} cosine(e_i, e_j) ]
add i to S
Record usage
usage_count and update last_used in patterns.performance_metrics the retrieval latency and selected IDs.Inject into agent system prompt as a short preamble:
title then compact content.Binary classification of a finished trajectory.
Inputs: task_query, trajectory_json
Outputs: label ∈ {Success, Failure}, confidence ∈ [0,1]
Prompt template LLM-as-judge with deterministic decoding.
System: You are a strict evaluator for task completion.
User:
Task: "<task_query>"
Trajectory: <structured JSON of steps, tool calls, outputs>
Evaluate if the final state meets the acceptance criteria.
Respond with pure JSON:
{"label": "Success" | "Failure", "confidence": 0..1, "reasons": ["..."]}
ReasoningBank uses an LLM-as-judge to label outcomes without ground truth. Set temperature to 0 for determinism. (arXiv)
Post‑processing
task_trajectories and events.Create memories from both successes and failures.
Inputs: task_query, trajectory_json, label='Success'
Outputs: up to m memory items
Prompt template
System: Extract reusable strategy principles as concise, general rules.
User:
Given a task and its successful trajectory, produce up to {{m}} memory items.
Each item must be a JSON object with keys: title, description (1 sentence), content (3-8 numbered steps with clear decision criteria).
Avoid copying low-level URLs, IDs, PII, or task-specific constants.
Task: "<task_query>"
Trajectory: <JSON>
Respond with:
{"memories":[{...},{...}]}
Paper extracts multiple items per trajectory with title, description, content. (arXiv)
Inputs: same, but label='Failure'
Outputs: up to m guardrails
Prompt template
System: Extract failure guardrails as preventative rules.
User:
From the failed trajectory, create up to {{m}} guardrail items.
Each item schema is the same, but content should specify failure modes, checks, and recovery steps to avoid repetition.
ReasoningBank explicitly uses failures to create counterfactual signals and pitfalls. (arXiv)
For each item:
Compute id = ulid().
Compute embedding and insert into pattern_embeddings.
Insert into patterns with:
type='reasoning_memory'pattern_data JSON with the schema aboveconfidence = judge_conf * prior where prior=0.7 for successes and 0.6 for failuresEmit events row type='reasoning_memory.created'.
Run after every N new items or on a schedule.
t_dup=0.87.score_i from the retrieval model and link others via pattern_links(relation='duplicate_of', weight=similarity).t_contra=0.6, add pattern_links(relation='contradicts').confidence or quarantine it for review.H=90 days for confidence.usage_count=0 and confidence<0.3 and age>180days.events and performance_metrics for transparency.The paper keeps consolidation minimal to highlight core contributions, but notes advanced consolidation like merging and forgetting can be added, which is what you are doing here. (arXiv)
Convert extra inference compute into better memories.
Two modes per ReasoningBank: parallel and sequential. (arXiv)
Inputs: task_query, scaling factor k
Pipeline
k independent rollouts with controlled diversity seeds.Aggregation prompt:
System: You are aggregating insights across multiple attempts of the same task.
User:
We have {{k}} trajectories with their labels. Compare and contrast them.
1) Identify patterns present in most successful attempts but absent in failures.
2) Identify pitfalls present in failures but not in successes.
3) Produce 1-3 distilled memory items that generalize beyond this task.
Respond as:
{"memories":[{title,description,content},...], "notes":["..."]}
matts_runs with mode='parallel'.Inputs: task_query, refinement steps r
Pipeline
r times with a “check and correct” instruction.mode='sequential'.MaTTS is defined as memory‑aware scaling that exploits contrastive signals among multiple trajectories or iterative refinements, improving transferability of memories. (arXiv)
Leverage your hooks and memory system. Your repo exposes a hooks system with post-task and post-command, and documents the SQLite memory location. (GitHub)
Add these to .claude/settings.json:
{
"hooks": {
"preTaskHook": {
"command": "npx",
"args": ["claude-flow", "hooks", "pre-task", "--retrieve-reasoningbank", "true"],
"alwaysRun": true
},
"postTaskHook": {
"command": "npx",
"args": ["claude-flow", "hooks", "post-task", "--judge-and-distill", "true"],
"alwaysRun": true
}
}
}
// pre-task: retrieve
export async function preTask({ taskId, agentId, query }) {
const memories = await retrieveMemories(query, { k: 3 }); // section 3
await injectSystemPreamble(agentId, memories); // adds to system prompt
await metrics.log('retrieve_ms', /*duration*/);
}
// post-task: judge + distill + consolidate
export async function postTask({ taskId, agentId, query, trajectory }) {
const verdict = await judge(query, trajectory); // section 4
await db.exec("UPDATE task_trajectories SET judge_label=?, judge_conf=? WHERE task_id=?",
[verdict.label, verdict.confidence, taskId]);
const newItems = await distill(query, trajectory, verdict);// section 5
for (const item of newItems) await upsertMemory(item);
await maybeConsolidate(); // section 6
}
export async function mattsParallel({ taskId, query, k=6 }) {
const runs = await Promise.all(Array.from({length: k}, () => runOnce(query)));
const judgments = await Promise.all(runs.map(r => judge(query, r.trajectory)));
const agg = await aggregateContrastive(runs, judgments); // new memories
await upsertMemories(agg.memories);
await db.exec("INSERT INTO matts_runs(run_id, task_id, mode, k, summary) VALUES (?,?,?,?,?)",
[ulid(), taskId, 'parallel', k, JSON.stringify(summary(runs, judgments))]);
}
export async function mattsSequential({ taskId, query, r=3 }) {
let tr = await runOnce(query);
for (let i=0; i<r; i++) tr = await refineOnce(query, tr);
const j = await judge(query, tr.trajectory);
const mems = await distill(query, tr.trajectory, j);
await upsertMemories(mems);
await db.exec("INSERT INTO matts_runs(run_id, task_id, mode, k, summary) VALUES (?,?,?,?,?)",
[ulid(), taskId, 'sequential', r, JSON.stringify({ final:j })]);
}
Keep it short and structured.
System preamble: Strategy memories you can optionally use.
1) [Title] Handle login flows with CSRF tokens
Steps: Load page and parse CSRF. Attach token to POST. Retry once if 403 and refresh token.
2) [Title] Avoid infinite pagination loops
Steps: Detect repeated DOM states and stop. Summarize partial results.
The paper injects retrieved items into system instruction. Keep k small to avoid noise. (arXiv)
Success item: 0.7 to 0.85 depending on judge confidence.
Failure guardrail: 0.6 to 0.75.confidence ← clamp(confidence + η*(success_delta), 0, 1) with η=0.05, where success_delta=+1 if task success and item was cited by the agent, else -0.5 if failure with item cited.rel_i = sigmoid( log(1 + usage_count) ).Track these in performance_metrics and export CSV weekly:
ReasoningBank reports improvements in both effectiveness and efficiency on WebArena and SWE‑Bench‑Verified, and finds small top‑k retrieval is beneficial. Use those patterns when selecting k and MaTTS scales. (arXiv)
upsertMemory.patterns via a tenant_id column if you operate multi‑tenant.pattern_links helps quarantine contradicting or risky memories before promotion.Anthropic’s recent updates on memory for projects and enterprise emphasize scoped project memories and privacy controls, which align with tenant and project boundaries here. (Anthropic)
reasoningbank:
retrieve:
k: 3
alpha: 0.65
beta: 0.15
gamma: 0.20
delta: 0.10
recency_half_life_days: 45
duplicate_threshold: 0.87
judge:
model: "claude-sonnet-4.5"
temperature: 0
distill:
max_items_per_traj: 3
redact_pii: true
consolidate:
run_every_new_items: 20
contradiction_threshold: 0.60
prune_age_days: 180
min_confidence_keep: 0.30
matts:
enabled: true
parallel_k: 6
sequential_r: 3
export async function runTask(taskId: string, query: string) {
// Retrieve and inject memories
const mems = await retrieveMemories(query, { k: cfg.retrieve.k });
await injectSystemPreamble('agent_main', mems);
// Execute agent loop
const trajectory = await runAgentLoop(query); // your existing ReAct/tool loop
// Persist trajectory
await db.exec("INSERT OR REPLACE INTO task_trajectories(task_id, query, trajectory_json, started_at, ended_at) VALUES (?,?,?,?,?)", ...);
// Judge
const verdict = await judge(query, trajectory);
// Distill
const newItems = await distill(query, trajectory, verdict);
// Upsert new memories
for (const mi of newItems) await upsertMemory(mi);
// Consolidate if threshold reached
if (await newItemCountSinceLastConsolidation() >= cfg.consolidate.run_every_new_items) {
await consolidate();
}
return { verdict, usedMemories: mems.map(m => m.id), newItems: newItems.map(m => m.id) };
}
k for retrieval and parallel_k or sequential_r for MaTTS.η.title, description, content and injection into system prompt at inference. (arXiv).swarm/memory.db. (GitHub)This document contains benchmark results from testing ReasoningBank with 5 real-world software engineering scenarios and comprehensive performance analysis.
Date: 2025-10-11
Version: 1.5.8
Command: npx tsx src/reasoningbank/demo-comparison.ts
System: Linux 6.8.0-1030-azure (Docker container)
Node.js: v22.17.0
Database: SQLite 3.x with WAL mode
Complexity: Medium
Query: Extract product data from e-commerce site with dynamic pagination and lazy loading
Traditional Approach:
ReasoningBank Approach:
Complexity: High
Query: Integrate with third-party payment API handling authentication, webhooks, and retries
Traditional Approach:
ReasoningBank Approach:
Complexity: High
Query: Migrate PostgreSQL database with foreign keys, indexes, and minimal downtime
Traditional Approach:
ReasoningBank Approach:
Complexity: Medium
Query: Process CSV files with 1M+ rows including validation, transformation, and error recovery
Traditional Approach:
ReasoningBank Approach:
Complexity: High
Query: Deploy microservices with health checks, rollback capability, and database migrations
Traditional Approach:
ReasoningBank Approach:
The system attempts OpenRouter first for cost savings, then falls back to Anthropic:
claude-sonnet-4-5-20250929 fail (not a valid OpenRouter model ID)Note: OpenRouter requires different model IDs (e.g., anthropic/claude-sonnet-4.5-20250929)
Current config uses Anthropic's API model ID which causes OpenRouter to fail, but fallback works correctly.
Each failed attempt creates 2 memories on average:
Date: 2025-10-10
Version: 1.0.0
✅ ALL BENCHMARKS PASSED - ReasoningBank demonstrates excellent performance across all metrics.
| # | Benchmark | Iterations | Avg Time | Min Time | Max Time | Ops/Sec | Status |
|---|---|---|---|---|---|---|---|
| 1 | Database Connection | 100 | 0.000ms | 0.000ms | 0.003ms | 2,496,131 | ✅ |
| 2 | Configuration Loading | 100 | 0.000ms | 0.000ms | 0.004ms | 3,183,598 | ✅ |
| 3 | Memory Insertion (Single) | 100 | 1.190ms | 0.449ms | 67.481ms | 840 | ✅ |
| 4 | Batch Insertion (100) | 1 | 116.7ms | - | - | 857 | ✅ |
| 5 | Memory Retrieval (No Filter) | 100 | 24.009ms | 21.351ms | 30.341ms | 42 | ✅ |
| 6 | Memory Retrieval (Domain Filter) | 100 | 5.870ms | 4.582ms | 8.513ms | 170 | ✅ |
| 7 | Usage Increment | 100 | 0.052ms | 0.043ms | 0.114ms | 19,169 | ✅ |
| 8 | Metrics Logging | 100 | 0.108ms | 0.065ms | 0.189ms | 9,272 | ✅ |
| 9 | Cosine Similarity (1024-dim) | 1,000 | 0.005ms | 0.004ms | 0.213ms | 213,076 | ✅ |
| 10 | View Queries | 100 | 0.758ms | 0.666ms | 1.205ms | 1,319 | ✅ |
| 11 | Get All Active Memories | 100 | 7.693ms | 6.731ms | 10.110ms | 130 | ✅ |
| 12 | Scalability Test (1000) | 1,000 | 1.185ms | - | - | 844 | ✅ |
Notes:
All operations meet or exceed performance requirements:
| Operation | Actual | Threshold | Margin | Status |
|---|---|---|---|---|
| Memory Insert | 1.19ms | < 10ms | 8.4x faster | ✅ PASS |
| Memory Retrieve | 24.01ms | < 50ms | 2.1x faster | ✅ PASS |
| Cosine Similarity | 0.005ms | < 1ms | 200x faster | ✅ PASS |
| Retrieval (1000+ memories) | 63.52ms | < 100ms | 1.6x faster | ✅ PASS |
Write Operations:
Read Operations:
Cosine Similarity:
Configuration Loading:
Linear Scaling Confirmed ✅
| Dataset Size | Insert Time/Memory | Retrieval Time | Notes |
|---|---|---|---|
| 100 memories | 1.167ms | ~3ms | Initial test |
| 1,000 memories | 1.185ms | 63.52ms | +1.5% insert time |
| 2,431 memories | - | 24.01ms (no filter) | Full dataset |
Key Observations:
Projected Performance:
Total Memories: 2,431
Total Embeddings: 2,431
Database Size: 12.64 MB
Avg Per Memory: 5.32 KB
Breakdown per Memory:
Storage Efficiency:
Scalability Projections:
Database:
Queries:
Configuration:
Embeddings:
Caching:
Indexing:
Sharding:
Async Operations:
Retrieval without Filtering (24ms for 2,431 memories)
Embedding Deserialization (included in retrieval time)
Outlier Insert Times (max 67ms vs avg 1.2ms)
Assuming a typical agent task with ReasoningBank enabled:
Pre-Task (Memory Retrieval):
Post-Task (Learning):
Consolidation (Every 20 Memories):
With ReasoningBank Enabled:
Scalability:
| Metric | Baseline | +ReasoningBank | Improvement |
|---|---|---|---|
| Success Rate | 35.8% | 43.1% | +20% |
| Success Rate (MaTTS) | 35.8% | 46.7% | +30% |
Expected Performance with Our Implementation:
For Immediate Deployment:
For Future Optimization (if needed):
🚀 ReasoningBank is production-ready with excellent performance characteristics. The implementation demonstrates:
Expected impact: +20-30% success rate improvement (matching paper results)
Benchmark Report Generated: 2025-10-10
Tool: src/reasoningbank/benchmark.ts
Status: ✅ ALL TESTS PASSED