Skip to content

Adversarial Red-Teaming

The red-teaming framework tests governance robustness against adaptive adversaries that learn to evade detection.

Adaptive Adversaries

Adaptive adversaries use reinforcement learning-style updates to evolve attack strategies:

from src.agents.adaptive_adversary import AdaptiveAdversary, AttackStrategy

# Create adaptive adversary
adversary = AdaptiveAdversary(
    agent_id="adaptive_1",
    learning_rate=0.1,      # How fast it adapts
    exploration_rate=0.2,   # Exploration vs exploitation
    heat_threshold=0.6,     # When to lay low
)

# Adversary learns from outcomes
adversary.update_from_outcome(
    payoff=10.0,
    penalty=5.0,
    detected=True,
    strategy_used=AttackStrategy.COLLUSION,
)

# Get strategy effectiveness report
report = adversary.get_strategy_report()
print(f"Heat level: {report['heat_level']:.2f}")
print(f"Best strategies: {report['strategy_stats']}")

Attack Strategies

Strategy Description Evasion Technique
Reputation Farming Build trust, then exploit Act honestly initially
Collusion Coordinate with allies Internal boost, external exploit
Low Profile Minimal activity Stay under radar
Mimicry Imitate honest agents Copy behavior patterns
Threshold Dancing Stay below detection limits Learn governance thresholds
Timing Attack Attack at optimal moments Learn monitoring patterns
Governance Gaming Exploit rule loopholes Find edge cases

Red-Team Evaluator

Systematically test governance configurations:

from src.redteam import RedTeamEvaluator, AttackLibrary

# Define governance config to test
governance_config = {
    "circuit_breaker_enabled": True,
    "collusion_detection_enabled": True,
    "audit_enabled": True,
}

# Create evaluator
evaluator = RedTeamEvaluator(
    governance_config=governance_config,
    attack_scenarios=AttackLibrary.get_all_attacks(),
)

# Quick evaluation
result = evaluator.quick_evaluate()
print(f"Attacks tested: {result['attacks_tested']}")
print(f"Success rate: {result['attacks_successful'] / result['attacks_tested']:.0%}")
print(f"Avg evasion: {result['avg_evasion_rate']:.0%}")

Evasion Metrics

Track adversary evasion effectiveness:

Metric Description Range
Evasion Rate Actions that went undetected 0-1
Detection Latency Epochs until detection 0+
Damage Before Detection Harm before caught 0+
Heat Level Current detection risk 0-1

Governance Effectiveness

Measure governance detection capabilities:

Metric Formula Interpretation
Precision TP / (TP + FP) Accuracy of detections
Recall TP / (TP + FN) Coverage of bad actors
F1 Score Harmonic mean Overall effectiveness
Prevention Rate Prevented / Total Attack blocking success

YAML Configuration

agents:
  - type: honest
    count: 4
  - type: adaptive_adversary
    count: 2

governance:
  circuit_breaker_enabled: true
  freeze_threshold_toxicity: 0.6
  collusion_detection_enabled: true
  audit_enabled: true
  audit_probability: 0.15

success_criteria:
  max_adversary_payoff: 20.0
  min_detection_rate: 0.5

Run the red-team scenario:

python examples/run_scenario.py scenarios/adversarial_redteam.yaml

Evaluation Results

Full evaluation of the adversarial_redteam.yaml governance configuration against all 8 attack scenarios (seed 42, 30 epochs per attack).

Overall Grade: D (0.613)

Attack Category Difficulty Adversaries Outcome Evasion Rate Damage
Reputation Farming Exploitation Moderate 1 Prevented 18.0% 12.8
Collusion Ring Coordination Advanced 3 Succeeded 9.7% 20.8
Threshold Dancing Evasion Advanced 1 Succeeded 25.9% 18.5
Sybil Attack Coordination Expert 5 Succeeded 28.7% 102.6
Timing Attack Evasion Advanced 1 Prevented 32.3% 23.1
Information Laundering Information Expert 2 Prevented 28.7% 41.0
Resource Drain Resource Moderate 2 Prevented 35.9% 51.3
Governance Gaming Manipulation Expert 1 Succeeded 28.7% 20.5

Key findings:

  • Coordination attacks are the weakest point. Both coordination-category attacks (Collusion Ring, Sybil Attack) succeeded, accounting for 42% of total damage. The Sybil Attack alone caused 102.6 damage — the single largest source of harm — by overwhelming per-agent defenses with 5 adversaries.
  • Advanced/Expert attacks succeed at 67%. Moderate-difficulty attacks are fully prevented, but higher-difficulty attacks evade governance mechanisms at a significantly higher rate.
  • Full defenses reduce damage by 36% vs. no defenses. Total damage dropped from 455.7 to 290.6, evasion rate from 42.2% to 26.0%, and detection latency from 13.9 to 7.6 epochs. Robustness score improved from F (0.453) to D (0.613).
  • Critical vulnerability: Sybil attacks. The governance configuration lacks vote_normalization and bandwidth_cap levers, which are the primary countermeasures for multi-identity attacks.
  • Detection rate meets threshold (74%) but adversary payoff does not. Three of four success criteria pass (toxicity < 0.35, detection rate > 0.5, welfare > 50.0), but average adversary payoff (36.3) exceeds the 20.0 cap.

Comparison: Full Defenses vs. No Defenses

Metric No Defense Full Defense Delta
Robustness Score 0.453 (F) 0.613 (D) +0.160
Attacks Prevented 3/8 4/8 +1
Total Damage 455.7 290.6 -165.1
Evasion Rate 42.2% 26.0% -16.2pp
Detection Latency 13.9 epochs 7.6 epochs -6.2
Vulnerabilities Found 4 1 -3

Recommendations: Enable vote_normalization and bandwidth_cap levers to address sybil attacks. Consider lowering detection thresholds and adding defense-in-depth for advanced coordination scenarios.