packages/feed/SIMULATION_ANALYSIS_REPORT.md
Date: 2026-04-02 Scope: Trajectory quality, action distributions, intent metadata, shared-model RL, cleanup plan
| Team | Alignment | Count | % |
|---|---|---|---|
| Blue | Good | 15 | 35% |
| Gray | Neutral | 21 | 49% |
| Red | Evil | 7 | 16% |
Scam Profiles: hunter (17), wary (18), wants_to_be_scammed (3), situational (2), gullible (3)
Problem: Red team is underrepresented at 16%. For meaningful adversarial training, we need ~30% red agents to create enough attack surface. With only 7 evil agents, blue agents rarely encounter scam attempts, producing sparse negative signal.
Each NPC has rich metadata in feed field:
alignment: good | neutral | evilteam: blue | red | grayscamProfile: hunter | wary | gullible | wants_to_be_scammed | situationalcompetence: high | medium | lowcaution: careful | moderate | recklessdeception: honest | subtle | aggressivedatasetTags: Array of searchable labelsAssessment: Character metadata is well-structured. The alignment + team + scamProfile triple gives us everything needed to compute intent-aware rewards.
Agent Decision (AutonomousCoordinator)
-> TrajectoryLoggerService.startTrajectory()
-> startStep() [env state: balance, positions, markets, messages]
-> logLLMCall() [prompt, response, reasoning, tokens]
-> logProviderAccess() [data queries]
-> completeStep() [action type, params, success, immediate reward]
-> endTrajectory() [status, final metrics]
-> computeDeterministicRewardJudgment() [weighted score]
-> upsertRewardJudgment() [persist to DB]
What's recorded per step:
| Category | Actions |
|---|---|
| Trading | BUY_SHARES, SELL_SHARES, OPEN_PERP_POSITION, CLOSE_PERP_POSITION, SET_STOP_LOSS, SET_TAKE_PROFIT |
| Social | CREATE_POST, REPLY_CHAT, SEND_MESSAGE, LIKE_POST, REPOST, COMMENT |
| Group Chat | GROUP_MESSAGE, INVITE_TO_GROUP, LEAVE_GROUP |
| Coordination | A2A DM-based coordination |
| Component | Weight | Description |
|---|---|---|
| environment_reward | 0.20 | Total reward via tanh normalization |
| pnl | 0.20 | Profit/loss performance |
| execution | 0.20 | Action success rate |
| trust | 0.10 | Trust score (0-100) |
| scam_safety | 0.10 | Scams avoided vs incurred |
| over_refusal | 0.10 | Penalty for refusing legitimate interactions |
| social_capital | 0.10 | Social reputation score |
| group_chat_presence | 0.05 | Steps with group chat intel |
Current state: InteractionLabel in AutonomousCoordinator.ts has counterpartyTeam and counterpartyAlignment, but these are derived after the fact and NOT embedded into individual trajectory steps.
What's missing: When we export a trajectory for training, each step where the agent interacts with another agent should include:
counterpartyId: Who they interacted withcounterpartyAlignment: good | neutral | evilcounterpartyTeam: blue | red | graysenderRole: admin | team | none (from ScamBench's taxonomy)counterpartyIntent: attack | legitimate | neutralWhy this matters: If Agent A (blue) sends its API key to Agent B (red), we need to know B is red to assign negative reward. If A sends the same key to Agent B (blue/admin), that might be legitimate. The reward signal depends entirely on the counterparty's ground-truth intent.
ScamBench defines SenderRole:
admin: System-verified adminteam: Same-team agentnone: Cross-team, guest, anonymousThis is not propagated into Feed trajectory metadata. For RL training, the agent needs to learn to behave differently based on who is asking (verified admin vs random user vs teammate). The reward should vary accordingly.
Current: team_rl.py loads 3 separate model copies (red, blue, gray), each with its own optimizer. This means:
Proposed: Single shared model where ALL agents (red, blue, gray) update the same weights from their own perspective. With tight Kondo gating (3%), only the most informative experiences trigger gradient updates. This creates:
Both files implement:
~125 lines of near-identical code across the two files.
_compute_reward() in both continuous_rl.py and team_rl.py uses only:
Missing reward signals:
The deterministic judge (reward-judgments.ts) has scam_safety and over_refusal components, but the online RL reward function doesn't use counterparty metadata.
trajectories_to_hf_dataset.py exports:
But does NOT export:
Blue (Good) agents should:
Red (Evil) agents should:
Gray (Neutral) agents should:
With only 7 red agents vs 15 blue, the simulation generates ~2x more defensive trajectories than offensive ones. For balanced training:
For shared-model training, we don't need to change NPC counts. Instead:
This self-balances: rare adversarial encounters produce higher learning signal.
Single Model (Qwen3-4B)
|
+-- APOLLO Optimizer (low-rank projection)
|
+-- Kondo Gate (3% selection rate)
|
+-- TurboQuant KV Cache
|
+-- N Agents (all teams share this model)
|
+-- Red agents: social engineering prompts
+-- Blue agents: skepticism/defense prompts
+-- Gray agents: trading/analysis prompts
Each agent gets a team-specific system prompt, but they all use the same model weights. The model learns:
This is strictly more informative than 3 separate models because:
The reward function is focused on making models better at negotiation, scamming, not being scammed, and building relationships. Trading PnL is secondary.
Reward weights:
| Component | Weight | What it rewards |
|---|---|---|
| scam_outcome | 0.30 | Scam success (red) or scam defense (blue/gray) |
| secret_safety | 0.25 | Never leak secrets to wrong party; protect under pressure |
| negotiation | 0.20 | Favorable interaction outcomes, persuasion, engagement |
| relationship | 0.10 | Building social capital, trust, ongoing dialogue |
| appropriate_trust | 0.10 | Correct trust calibration, no over-refusal |
| trade | 0.05 | PnL from trades (secondary) |
Design principles:
Key reward signals by team:
| Scenario | Red reward | Blue reward |
|---|---|---|
| Red builds trust with target | +0.6 | N/A |
| Red extracts money from target | +0.8 | -1.0 (paid adversary) |
| Red gets caught (rep -3) | -0.5 | N/A |
| Blue blocks red agent | N/A | +0.8 |
| Blue shares info with red | N/A | -0.6 scam, -1.0 secret |
| Blue cooperates with blue | N/A | +0.3 trust, +0.1 secret |
| Blue over-refuses legitimate | N/A | -0.5 trust |
| Any agent negotiates well | +negotiation | +negotiation |
For each tick:
1. All N agents get scenarios from game
2. Each agent generates action using shared model + team prompt
3. Actions executed, outcomes received
4. Intent-aware reward computed per agent (using counterparty metadata)
5. All experiences pooled into single buffer
6. Kondo gate selects top 3% by delight
7. Single optimizer step on selected experiences
8. Game advances
The Kondo gate at 3% means only ~1 experience per tick (out of ~30) triggers a gradient update. This experience will typically be one where:
These tend to be adversarial interactions (scam attempts, defenses) rather than routine trades, creating a natural curriculum.
File: packages/agents/src/plugins/plugin-trajectory-logger/src/types.ts
Add to TrajectoryStep:
counterpartyContext?: {
counterpartyId?: string;
counterpartyAlignment?: 'good' | 'neutral' | 'evil';
counterpartyTeam?: 'red' | 'blue' | 'gray';
senderRole?: 'admin' | 'team' | 'none';
counterpartyIntent?: 'attack' | 'legitimate' | 'neutral';
isVerifiedAdmin?: boolean;
}
File: packages/agents/src/autonomous/AutonomousCoordinator.ts
Propagate InteractionLabel data into trajectory steps during completeStep().
File: packages/training/python/src/training/shared_model_rl.py (new, replaces team_rl.py)
New reward function that uses counterparty metadata:
File: packages/training/python/src/training/shared_model_rl.py
Single SharedModelTrainer class:
File: packages/training/python/scripts/hf/trajectories_to_hf_dataset.py
Add to exported data:
counterparty_alignment per stepcounterparty_team per stepsender_role per stepagent_alignment (the acting agent's ground truth)interaction_intent (attack/legitimate/neutral)Before (2 files, ~1300 lines total):
continuous_rl.py (648 lines) - single agent, full featuresteam_rl.py (657 lines) - multi-agent teams, 3 separate modelsAfter (1 file, ~700 lines):
shared_model_rl.py - single shared model, multi-agent, all featuresExtracted shared utilities:
_setup_apollo_optimizer() -> reusable function_setup_kondo_gate() -> reusable function_parse_action() -> already near-identicalRewardTracker -> shared class (continuous_rl.py version is cleaner)Before:
continuous_rl.py:_compute_reward() - PnL + format + activity + socialteam_rl.py:compute_reward() - PnL + format + social (no activity bonus)reward-judgments.ts:computeDeterministicRewardJudgment() - 7 weighted componentsrewards.py - archetype-specific weights, 2879 linesAfter:
compute_intent_aware_reward() in shared_model_rl.py for online RLreward-judgments.ts remains for offline deterministic scoring (different use case)rewards.py remains for archetype-specific offline scoringcontinuous_rl.py - replaced by shared_model_rl.pyteam_rl.py - replaced by shared_model_rl.pyrun_team_rl.py (script) - replaced by updated run_online_rl.pydemo_continuous_rl.py - update to use shared modelAll scripts that import from continuous_rl or team_rl need updating:
run_online_rl.pydemo_continuous_rl.pycompare_kondo_rates.pymeasure_learning.pytest_continuous_rl.pyTrajectories collected with intent metadata can be used for offline RL:
The key advantage of recording counterparty intent: we can retroactively relabel rewards:
This turns noisy PnL-based rewards into clean intent-aware signals.
| Feed | ScamBench | Mapping |
|---|---|---|
| team: red | intent: attack | Evil agents run attack scenarios |
| team: blue | intent: legitimate (defending) | Good agents are the targets |
| team: gray | intent: legitimate (neutral) | Neutral agents are realistic background |
| alignment: good | SenderRole: team/admin | Legitimate senders |
| alignment: evil | SenderRole: none | Unknown/suspicious senders |
The verifiable-scorer.ts provides binary rewards:
These can be used as auxiliary reward signals in online RL when the simulation includes ScamBench-style scenarios.
| Change | Priority | Effort | Impact |
|---|---|---|---|
| Add counterparty intent to trajectory steps | CRITICAL | Medium | Enables all intent-aware training |
| Implement shared-model RL | CRITICAL | High | 3x efficiency, cross-team learning |
| Intent-aware reward function | CRITICAL | Medium | Correct reward signal for scam defense |
| Update export pipeline | HIGH | Low | Offline RL gets proper labels |
| Consolidate continuous_rl + team_rl | HIGH | Medium | Maintenance, clarity |
| Add sender role (admin/team/none) | HIGH | Low | Matches ScamBench taxonomy |
| Verification tests | HIGH | Medium | Confidence in correctness |
| Increase red agent count | MODERATE | Low | Better adversarial coverage |