packages/feed/docs/training-data-quality-plan.md
The simulation generates training data across posts, trades, events, markets, and social interactions. If the data has systematic biases (repetitive topics, formulaic structure, entity concentration), models trained on it will inherit those biases. This plan defines what to measure, how to measure it, and what "healthy" looks like.
Every LLM prompt in the system has shared components (reality grounding, name mappings, character descriptions). If these components dominate the signal, a model trained on the output will learn the scaffolding instead of the content. Examples of what goes wrong:
We need to detect these patterns before they become training artifacts.
What: How concentrated are actor/organization mentions across all generated content?
Metrics:
| Metric | Formula | Healthy Range | Warning Threshold |
|---|---|---|---|
| Gini coefficient | Standard Gini on entity mention counts | 0.3–0.6 (moderate inequality matching tier weighting) | > 0.75 (severe concentration) |
| Top-1 share | mentions(top entity) / total mentions | < 15% | > 25% |
| Top-5 share | mentions(top 5) / total mentions | < 40% | > 60% |
| Entity coverage | unique entities mentioned / total entities available | > 50% | < 30% |
| HHI (Herfindahl) | Σ(share_i²) across all entities | < 0.10 | > 0.15 |
Where to measure:
Visualization:
What: Are outputs following templates or are they genuinely varied in structure?
Metrics:
| Metric | Formula | Healthy Range | Warning Threshold |
|---|---|---|---|
| Unique opening trigrams | unique first-3-words / total posts | > 0.7 | < 0.5 |
| Opening repetition rate | max(count(opening)) / total posts | < 5% | > 10% |
| Sentence type distribution | % questions vs statements vs exclamations | Each > 10% | Any < 5% or > 70% |
| Post length std dev | σ of character count across posts | > 40 chars | < 20 chars |
| Post length skewness | Skew of length distribution | -0.5 to 0.5 | |
| 200-char ceiling hits | % of posts at exactly 195-200 chars | < 15% | > 30% |
| Vocabulary richness (TTR) | unique words / total words (per 1000-word window) | > 0.4 | < 0.25 |
| Pairwise Jaccard (consecutive) | avg Jaccard between consecutive posts by same actor | < 0.15 | > 0.25 |
Where to measure:
Visualization:
What: Is the simulation stuck on one topic or cycling through diverse themes?
Metrics:
| Metric | Formula | Healthy Range | Warning Threshold |
|---|---|---|---|
| Daily topic HHI | Σ(topic_share²) across active markets per day | < 0.15 | > 0.25 |
| Topic persistence | days a topic remains dominant before rotation | 1-2 days | > 3 days |
| Cross-day topic overlap | Jaccard(today's topics, yesterday's topics) | < 0.4 | > 0.6 |
| Event type distribution | % per event type (scandal, rumor, development, etc.) | Each 10-25% | Any > 40% or < 5% |
| Market category balance | markets per category (tech, crypto, politics, etc.) | Each > 10% | Any > 50% or < 5% |
| Question near-duplicate rate | % of questions with Jaccard > 0.4 to another active question | < 10% | > 20% |
| Satirical theme usage | unique themes referenced / total themes available | > 40% per day | < 20% |
Where to measure:
Visualization:
What: Are NPC behaviors diverse or do they all do the same thing?
Metrics:
| Metric | Formula | Healthy Range | Warning Threshold |
|---|---|---|---|
| Trade action balance | distribution of buy_yes / buy_no / open_long / open_short / hold | No action > 40% | Any > 50% |
| Contrarian rate | % of trades against market consensus | 20-35% | < 10% or > 50% |
| Post type balance | ambient / reaction / reply / commentary / conspiracy | No type > 40% | Any > 50% |
| Engagement type balance | likes / comments / reposts ratio | Each > 15% | Any < 5% |
| Active NPC coverage | unique NPCs that posted or traded / total NPCs | > 40% per day | < 20% |
| Decision reasoning diversity | unique reasoning tokens / total reasoning tokens | > 0.5 | < 0.3 |
| Hold rate | % of trading decisions that are "hold" | 20-50% | > 70% (all passive) or < 10% (all active) |
| Market-side balance | YES vs NO positions across all prediction markets | 40-60% split | < 30% or > 70% |
Where to measure:
Visualization:
What: Do actors maintain consistent voices AND sound different from each other?
Metrics:
| Metric | Formula | Healthy Range | Warning Threshold |
|---|---|---|---|
| Intra-actor consistency | avg cosine similarity between an actor's posts | > 0.3 | < 0.15 (no consistent voice) |
| Inter-actor differentiation | avg cosine similarity between different actors' posts | < 0.3 | > 0.5 (all sound the same) |
| Voice fingerprint accuracy | % of posts correctly attributed to actor by a classifier | > 60% | < 40% |
| Caps usage per actor | % of chars that are uppercase, per actor | Varies by actor (Trump ~60%, Vitalik ~5%) | All actors within 10% of each other |
| Avg post length per actor | mean chars per actor | Varies (Trump ~150, Vitalik ~30) | All actors within 20 chars of each other |
| Slang/jargon signature | unique terms per actor that other actors don't use | > 3 per actor | < 1 per actor |
Where to measure:
Visualization:
What: Are there artificial periodicities or clustering that a model would learn?
Metrics:
| Metric | Formula | Healthy Range | Warning Threshold |
|---|---|---|---|
| Posts per hour uniformity | coefficient of variation across hours | CV < 0.5 (active hours vary) | CV > 1.0 (extreme bunching) |
| Event clustering | max events in any 5-minute window / avg events per 5-min | < 5x | > 10x |
| Market creation spacing | std dev of time between market creations | > 30 min | < 5 min (all created at once) |
| Resolution timing | distribution of resolution hour-of-day | Spread across hours | > 50% in one hour |
| Activity autocorrelation | lag-1 autocorrelation of posts per tick | < 0.3 | > 0.6 (predictable pattern) |
| Weekend/weekday ratio | posts on weekends / posts on weekdays | 0.5-1.0 | < 0.2 or > 1.5 |
Where to measure:
Visualization:
What: Patterns that specifically corrupt model training.
Metrics:
| Metric | Formula | Healthy Range | Warning Threshold |
|---|---|---|---|
| Prompt prefix concentration | % of training examples starting with identical tokens | < 5% for any prefix | > 15% |
| Label balance (sentiment) | distribution of sentiment values | Roughly normal around 0 | > 60% positive or negative |
| Label balance (pointsToward) | true vs false vs null ratio | Each > 20% | Any < 10% |
| Market outcome balance | YES vs NO resolution ratio | 40-60% | < 30% or > 70% |
| Reasoning length distribution | chars in trading reasoning field | Normal-ish, mean ~50 | Bimodal (empty or max length) |
| Cross-feature correlation | correlation between entity and action | < 0.3 | > 0.5 (entity predicts action) |
| Data leakage indicators | mentions of "predetermined", "scripted", game mechanics | 0 | Any > 0 |
| Parody name compliance | % of posts using real names instead of parody | 0% | > 5% |
| Hashtag/emoji leakage | % of posts containing hashtags or emojis | 0% | > 2% |
Where to measure:
Visualization:
report:training-quality (Text Report)bun run report:training-quality # Full report
bun run report:training-quality -- --category entities # Just entity metrics
bun run report:training-quality -- --days 7 # Last 7 days
bun run report:training-quality -- --export json # Machine-readable output
bun run report:training-quality -- --warnings-only # Just the problems
Output format:
=== TRAINING DATA QUALITY REPORT ===
Period: 2026-03-25 to 2026-03-31 (7 days)
Total: 4,821 posts | 892 trades | 147 events | 42 markets
ENTITY DISTRIBUTION
Gini coefficient: 0.62 ✅ (target: 0.3-0.6)
Top-1 share: AIlon Musk 18% ⚠️ WARNING (target: <15%)
Top-5 share: 41% ⚠️ WARNING (target: <40%)
Entity coverage: 67% ✅ (89/133 actors mentioned)
HHI: 0.08 ✅
STRUCTURAL DIVERSITY
Unique opening trigrams: 73% ✅
Post length std dev: 52 chars ✅
200-char ceiling hits: 22% ⚠️ WARNING (target: <15%)
Vocabulary richness: 0.47 ✅
...
WARNINGS SUMMARY
⚠️ 3 warnings found
1. AIlon Musk appears in 18% of all content (target: <15%)
2. 22% of posts hit 200-char ceiling (consider varying limits)
3. Event type "scandal" is 38% of all events (target: <25%)
report:training-viz (HTML Visualization)bun run report:training-viz # Generate HTML report
bun run report:training-viz -- --open # Generate and open in browser
bun run report:training-viz -- --days 30 # 30-day analysis
Generates a self-contained HTML file with inline SVG charts:
report:training-compare (Before/After)bun run report:training-compare -- --before 2026-03-24 --after 2026-03-31
Compares metrics between two time periods to show improvement:
ENTITY DISTRIBUTION
Gini: 0.78 → 0.62 ✅ IMPROVED (-20%)
Top-1 share: 31% → 18% ✅ IMPROVED (-42%)
STRUCTURAL DIVERSITY
Opening diversity: 0.42 → 0.73 ✅ IMPROVED (+74%)
Length std dev: 18 → 52 ✅ IMPROVED (+189%)
| Source | Table | Key Fields | Volume |
|---|---|---|---|
| Posts | Post | authorId, content, type, sentiment, createdAt | ~500/day |
| Trades | NPCTrade | npcActorId, action, marketType, ticker, amount, reason | ~200/tick |
| Events | WorldEvent | eventType, description, actors[], relatedQuestion | ~10/tick |
| Markets | Question | text, topicKey, status, outcome, resolvedOutcome | ~5-10 active |
| Positions | Position | userId, side, avgPrice, shares | ~100-500 active |
| Comments | Comment | authorId, content, postId | ~50/day |
| Reactions | Reaction | userId, type, postId | ~100/day |
| Headlines | ParodyHeadline | parodyTitle, originalSource, qualityScore | ~20/tick |
| DailyTopics | DailyTopic | topicKey, topicLabel, sourceType | 1/day |
For continuous monitoring, these checks should run after each game tick or as a periodic job:
interface TrainingQualityAlert {
metric: string;
value: number;
threshold: number;
severity: 'warning' | 'critical';
message: string;
}
const ALERT_RULES = [
// Entity concentration
{ metric: 'entity_gini', threshold: 0.75, severity: 'warning' },
{ metric: 'entity_top1_share', threshold: 0.25, severity: 'critical' },
// Structural diversity
{ metric: 'opening_trigram_diversity', threshold: 0.5, severity: 'warning', direction: 'below' },
{ metric: 'post_length_stddev', threshold: 20, severity: 'warning', direction: 'below' },
{ metric: 'ceiling_hit_rate', threshold: 0.30, severity: 'warning' },
// Topic concentration
{ metric: 'daily_topic_hhi', threshold: 0.25, severity: 'warning' },
{ metric: 'question_near_duplicate_rate', threshold: 0.20, severity: 'critical' },
// Action balance
{ metric: 'trade_action_max_share', threshold: 0.50, severity: 'warning' },
{ metric: 'hold_rate', threshold: 0.70, severity: 'warning' },
// Training-specific
{ metric: 'real_name_leakage', threshold: 0.05, severity: 'critical' },
{ metric: 'hashtag_leakage', threshold: 0.02, severity: 'critical' },
{ metric: 'sentiment_skew', threshold: 1.5, severity: 'warning' },
{ metric: 'market_outcome_imbalance', threshold: 0.70, severity: 'warning' },
];
report:training-quality scriptA healthy simulation producing quality training data should have:
A model trained on this data should learn:
A model should NOT learn: