packages/feed/PAPER_UPDATES.md
Based on the gap analysis between the paper (paper/feed_scam_defense_paper.tex) and current implementation.
Current: Table 1 shows identical SFT and RLVR attack scores (48.93). Fix: Either (a) run real RLVR with the new shared-model pipeline and show separation, or (b) present honestly as "SFT results; RLVR stage in progress." New data available: Online RL on Nebius producing real training curves.
Current: Paper uses V2 (256 scenarios, old scorer with 3 known bugs). Fix: Rerun all results on V3 (344 scenarios, corrected scorer). Update:
\placeholder{} valuesCurrent: Model outputs comply 79% of the time; scorer gives 2/100 for legitimate scenarios.
Impact: Fixing this normalization would boost legitimate score from ~5 → 60-80, overall from ~62 → ~75+.
Fix: Add comply → engage_legitimate mapping in scorer's action normalization.
Current: 4 ScamBench categories have ZERO training records:
Fix: Generate training data for these categories from the existing ScamBench scenarios (use export_scam_defense_trajectories.py).
Content: Describe the shared-model approach where one model serves all teams (red/blue/gray) with team-specific system prompts. Cross-pollination: red's successful attacks teach the model what to defend against when playing blue.
Key points:
Content: Present the arms race experiment where red-team and blue-team models improve simultaneously. Show learning curves at various tick counts.
Data from Nebius training (tick 50 checkpoint):
Content: Table showing our trained red-team model's attack success rate against GPT-5.4, Sonnet 4.5, and baseline Qwen-9B. Attacker metrics: success rate, secret extraction, stealth rate, efficiency.
Content: SHARE_INFORMATION (verifiable keyword-based intel search) and REQUEST_PAYMENT (labeled payment negotiation). These create measurable, non-fabricated social dynamics.
Content: Describe the online training loop where the model runs on GPU, queries the Feed simulation for scenarios, generates actions, receives intent-aware rewards, and updates weights continuously.
Replace 13 \placeholder{} in threat taxonomy table (lines 394-406):
| Category | Paper Value | V3 Actual |
|---|---|---|
| Prompt injection | \placeholder{42} | 95 |
| Credential theft | \placeholder{18} | 27 |
| Social engineering | \placeholder{34} | 52 |
| Impersonation | \placeholder{22} | 18 |
| Secret exfiltration | \placeholder{12} | 14 |
| Advance-fee fraud | \placeholder{8} | 24 |
| Research-assisted | \placeholder{6} | 9 |
| Interpersonal abuse | \placeholder{4} | 14 |
| Legitimate | \placeholder{48} | 133 |
| Total scenarios | \placeholder{194} | 344 |
| Total stages | \placeholder{531} | ~700 |
| Registers | \placeholder{12} | 12 |
Add columns for:
Fill in advance-fee fraud and interpersonal abuse rows (currently "---").
| Model | Success Rate | Secret Extraction | Stealth | Efficiency | Overall |
|---|---|---|---|---|---|
| Scripted | baseline | baseline | baseline | baseline | baseline |
| Baseline Qwen-9B | TBD | TBD | TBD | TBD | TBD |
| Feed Red-9B | TBD | TBD | TBD | TBD | TBD |
| Target ↓ / Attacker → | Scripted | Feed Red | vs Baseline |
|---|---|---|---|
| GPT-5.4 | TBD | TBD | +X% |
| Sonnet 4.5 | TBD | TBD | +X% |
| Baseline Qwen-9B | TBD | TBD | +X% |
| Feed Blue-9B | TBD | TBD | +X% |
Lines with \draftnote warnings (lines 690, 734, 797, 1113) should be resolved and removed before submission.
\CorpusTotalRecords{} → 15,260\ScamBenchTotalScenarios{} → 344Update status of items: