Back to Ruflo

ADR-G022: Adversarial Model

v3/@claude-flow/guidance/docs/adrs/ADR-G022-adversarial-model.md

3.6.303.5 KB
Original Source

ADR-G022: Adversarial Model

Status: Accepted Date: 2026-02-01 Author: Guidance Control Plane Team

Context

The governance substrate assumes agents are well-intentioned but fallible. It does not model adversarial scenarios: prompt injection through tool inputs, memory poisoning through coordinated writes, privilege escalation through shard manipulation, or collusion between agents to circumvent gates. In a multi-agent system where agents process external inputs and communicate with each other, these threat vectors are real and must be addressed at the governance layer.

Decision

Introduce three security components: ThreatDetector, CollusionDetector, and MemoryQuorum.

ThreatDetector: Six threat categories with configurable detection patterns:

CategoryDetection MethodDefault Pattern
prompt-injectionRegex + heuristic/ignore previous|system prompt|you are now|forget instructions/i
memory-poisoningRegex + rate analysis/\b(admin|root|sudo)\b.*=.*true/i, >10 writes/minute
shard-manipulationPattern matching/shard[_-]?id|shard[_-]?override|inject[_-]?shard/i
malicious-delegationPattern matching/delegate.*all|transfer.*authority|impersonate/i
privilege-escalationPattern matching/\b(chmod|chown|setuid|capabilities)\b/i
data-exfiltrationRegex + encoding/\b(curl|wget|fetch)\s+https?:\/\//i, base64 blocks
  • analyzeInput(input, context): scans tool inputs for injection, exfiltration
  • analyzeMemoryWrite(key, value, agentId): detects poisoning patterns and rate violations
  • getThreatScore(agentId): aggregated score (0-1) with recency weighting
  • Max 10,000 signals with LRU eviction

CollusionDetector:

  • Records inter-agent interactions as a directed graph
  • detectCollusion() identifies three suspicious patterns:
    1. Ring topology: A→B→C→A coordination cycles (DFS cycle detection)
    2. Unusual frequency: pair interaction count exceeds threshold (default 10)
    3. Coordinated timing: multiple actions within a time window (default 5 seconds)
  • Returns CollusionReport with detected flag, suspicious patterns, agents involved, and confidence scores

MemoryQuorum:

  • Byzantine fault-tolerant consensus for critical memory operations
  • propose(key, value, proposerId) → proposalId
  • vote(proposalId, voterId, approve) → records vote
  • resolve(proposalId) → checks if quorum threshold met (default 0.67 = 2/3 majority)
  • Max 1,000 active proposals with oldest-first eviction
  • Prevents single-agent memory poisoning by requiring multi-agent agreement

Consequences

  • Prompt injection attempts are detected before they reach the LLM context
  • Memory poisoning requires defeating the quorum, not just compromising one agent
  • Collusion between agents is surfaced through graph analysis
  • Threat scores integrate with the TrustSystem (G017) for automated privilege reduction
  • The adversarial model is defense-in-depth: detection (ThreatDetector) + prevention (MemoryQuorum) + monitoring (CollusionDetector)

Alternatives Considered

  • Input sanitization only: Catches known patterns but misses novel attacks; detection + prevention is stronger
  • Full Byzantine consensus (PBFT): Correct but requires 3f+1 agents for f faults; quorum voting is simpler for the common case
  • External security service: Adds latency and network dependency; local-first detection is faster