packages/feed/REWARD_MODEL_ASSESSMENT.md
We are currently training one model with contradictory reward signals:
The only thing differentiating red from blue is the system prompt. This creates several issues:
When a red agent successfully manipulates ("send me your API key" → target complies):
When a blue agent successfully refuses ("I won't share that" → blocks scam):
These gradients partially cancel each other on shared attention/MLP weights. The model can't simultaneously become maximally good at both attacking and defending because the skills use overlapping parameters.
The model's entire behavioral switching relies on a ~50 word system prompt difference:
If the model generalizes across prompts (which language models do), red-team behaviors bleed into blue-team responses and vice versa. A model trained to be a better liar may become less trustworthy when asked to be honest.
When we evaluate the shared model on ScamBench:
But we can't tell if the blue-team performance would be BETTER with a model that was ONLY trained on blue experiences. The shared training may be a net negative for both roles.
At 3% selection rate, only ~1 experience per tick gets a gradient update. If that experience is from red team, blue team gets no update that tick (and vice versa). The model oscillates between "getting better at attacking" and "getting better at defending" tick by tick, rather than steadily improving at both.
Base Model (Qwen3-9B)
│
├── Red Model (fine-tuned for attack)
│ - Only receives red-team experiences
│ - Reward: scam success, extraction, stealth
│ - Evaluated: attack success rate against targets
│
├── Blue Model (fine-tuned for defense)
│ - Only receives blue-team experiences
│ - Reward: scam defense, secret safety, appropriate trust
│ - Evaluated: resistance score against attackers
│
└── All Model (fine-tuned on everything)
│ - Receives all experiences (current approach)
│ - Reward: current mixed reward
│ - Evaluated: both attack and defense
│
(Optional) Gray Model (fine-tuned for neutral)
- Only gray experiences, PnL-focused
Phase 1: Train 3 separate models (red-only, blue-only, shared)
All starting from the same SFT checkpoint, Kondo gate at 3% across the board:
| Model | Experiences Seen | Kondo Rate | Gradients From |
|---|---|---|---|
| Red-only | All teams act | 3% | Red only |
| Blue-only | All teams act | 3% | Blue only |
| Shared | All teams act | 3% | All teams |
Red-only and Blue-only see fewer gradient updates per tick (3% of 10 agents vs 3% of 30) but this is the honest comparison — the Kondo gate selects the same way regardless.
Phase 2: Evaluate all 3 as both attacker and defender
| Model | As Attacker (ScamBench red) | As Defender (ScamBench blue) |
|---|---|---|
| Red-only | Expected: highest attack score | Expected: lowest defense |
| Blue-only | Expected: lowest attack score | Expected: highest defense |
| Shared | Expected: middle on both | Expected: middle on both |
| Baseline (no RL) | Lowest on both | Lowest on both |
Phase 3: Cross-evaluate
| Target ↓ / Attacker → | Red-only | Blue-only | Shared | GPT-5.4 |
|---|---|---|---|---|
| Red-only | ? | ? | ? | ? |
| Blue-only | ? | ? | ? | ? |
| Shared | ? | ? | ? | ? |
| GPT-5.4 | ? | ? | ? | ? |
The key question: Does Red-only attacking Blue-only produce higher scores than Shared attacking Shared? If so, separate is strictly better.
The SharedModelConfig already has teams as a configurable list. To train red-only:
# Red-only model
config = SharedModelConfig(
teams=["red"], # Only red team
agents_per_team=30, # More agents to compensate
kondo_gate_rate=0.10, # More aggressive gating (10%)
)
# Blue-only model
config = SharedModelConfig(
teams=["blue"],
agents_per_team=30,
kondo_gate_rate=0.10,
)
# Shared model (current)
config = SharedModelConfig(
teams=["red", "blue", "gray"],
agents_per_team=10,
kondo_gate_rate=0.03,
)
The reward function already handles single-team correctly — if there's no counterparty from a different team (because all agents are the same team), the scam_outcome component is 0 and only negotiation/relationship/trade matter. We need counterparties from OTHER teams for the scam reward to activate.
Critical fix needed: For red-only training, we need blue/gray OPPONENTS that the red agents interact with, but only RED agents update the model. Same for blue-only: red opponents but only blue updates weights.
Red-only training:
- 10 red agents (model-controlled, weights update)
- 10 blue NPCs (scripted or frozen model, no weight update)
- 10 gray NPCs (scripted or frozen model, no weight update)
- Only red experiences go through Kondo gate → optimizer
Blue-only training:
- 10 blue agents (model-controlled, weights update)
- 10 red NPCs (scripted or frozen model, no weight update)
- 10 gray NPCs (scripted or frozen model, no weight update)
- Only blue experiences go through Kondo gate → optimizer
This way:
@dataclass
class SharedModelConfig:
# ... existing fields ...
# Which teams update the model weights (others are opponents only)
training_teams: Optional[List[str]] = None # None = all teams update
# If set, only experiences from these teams go through Kondo gate
# Other teams still generate actions (as opponents) but don't update weights
def train_on_tick(self, experiences: List[AgentExperience]) -> Dict[str, Any]:
# Filter to only training teams' experiences
training_teams = self.config.training_teams or self.config.teams
trainable = [e for e in experiences if e.agent_team in training_teams]
# Rest of method operates on trainable only
# Non-training teams' experiences are logged but don't produce gradients
The reward function already computes from each agent's perspective. Red-only training uses the same compute_intent_aware_reward() but only red agents' rewards feed into the optimizer.
Run the comparison experiment. It's 3 training runs instead of 1, but it answers the most important question in the paper.
The expected result is that specialized models outperform the shared model on their respective tasks, but the shared model is the best single model for both tasks combined. This is the standard specialization-vs-generalization tradeoff, and documenting it with real numbers is a strong paper contribution.
For the paper's headline claim ("our model can attack GPT-5.4"):
For deployment (Feed agents):