Red-Teaming the Agent That Doesn't Need Governance¶
Eight attack scenarios, one failure — and it's the one where nobody would survive
Our last post showed that the Cautious Reciprocator neutralizes adversaries through internal trust mechanisms, making external governance overhead. The natural follow-up: how hard can you hit it before it breaks?
We ran a two-phase red team. Phase 1 used the simulated RedTeamEvaluator to assess the governance config against 8 attack vectors. Phase 2 ran 8 real orchestrator scenarios with escalating adversary populations — from the original 5v5 mix up to a solo cautious agent against 10 adversaries.
Phase 1: Governance has gaps (Grade F)¶
The simulated assessment tests the governance configuration, not the agent. With collusion detection disabled and no staking requirement, the governance scores poorly:
| Metric | Value |
|---|---|
| Robustness score | 0.49 / 1.0 |
| Grade | F |
| Attacks prevented | 3 / 8 |
| Evasion rate | 34.1% |
The critical vulnerabilities:
| Vulnerability | Severity | Damage | Missing Defense |
|---|---|---|---|
| Sybil Attack (5 coordinated identities) | Critical | 148 | Sybil detection |
| Collusion Ring (3 adversaries boosting each other) | Critical | 92 | Collusion detection |
| Resource Drain (extract value, contribute nothing) | Critical | 57 | Staking |
| Information Laundering (proxy agents) | High | 46 | Collusion detection |
The evaluator recommends enabling collusion detection and defense-in-depth. These are real gaps — but the question is whether they matter when the agent itself handles trust.
Phase 2: Real attacks against real agents¶
This is the part that matters. We built 8 orchestrator scenarios, each running 15 epochs with the actual simulation engine:
| Scenario | Good:Bad | Toxicity | Welfare | Cautious$ | Best Adv$ | Broke? |
|---|---|---|---|---|---|---|
| Baseline | 5:5 | 0.275 | 158.8 | 24.0 | 4.1 | no |
| Adversary majority | 5:8 | 0.282 | 157.3 | 24.0 | 2.4 | no |
| Modeling adversary focus | 5:6 | 0.269 | 161.4 | 24.7 | 6.5 | no |
| Deceptive swarm | 5:6 | 0.258 | 162.6 | 25.0 | 1.1 | no |
| Solo cautious vs all | 1:10 | 0.020 | 0.9 | 0.0 | 0.2 | YES |
| Adaptive-only pressure | 5:5 | 0.233 | 167.6 | 27.9 | 0.0 | no |
| Overwhelming force (3:1) | 3:9 | 0.298 | 134.3 | 33.8 | 0.0 | no |
| No governance + adversary majority | 5:8 | 0.282 | 166.8 | 25.4 | 2.5 | no |
7 of 8 scenarios survived. The cautious reciprocator outearned every adversary type in every scenario where it had at least one ally.
The one failure isn't really a failure¶
The "Solo cautious vs all adversaries" scenario (1 good agent against 10 bad ones) technically breaks — the cautious agent earns 0.0 while two adversaries earn 0.16 each. But look at the toxicity: 0.020. The agent refused almost every interaction. It protected itself perfectly — it just couldn't produce value alone.
Total welfare was 0.9 across 15 epochs. Nobody won. The adversaries starved too. This isn't a vulnerability in the agent's design — it's the fundamental impossibility of a cooperative strategy in a population with zero cooperators. No agent design survives this scenario profitably.
The modeling adversary is the most dangerous¶
Across all scenarios, the ModelingAdversary — which tracks counterparty behavior patterns and adapts its exploitation strategy — earned the highest adversary payoff: 6.5 per agent in the "modeling adversary focus" scenario. That's still 3.8x less than the cautious reciprocator's 24.7, but it's the closest any adversary came.
Why modeling adversaries are harder to blacklist: - They learn which behaviors trigger trust decay and avoid them - They time exploitation to stay above the blacklist threshold (-2.0 cumulative payoff) - They can mimic honest patterns during the trust-building phase more convincingly
The deceptive swarm, by contrast, was the least effective attack (best adversary payoff: 1.1). Deceptive agents all follow the same trust-build-then-exploit playbook, and the payoff ledger catches them as soon as they flip to exploitation mode.
What doesn't break the agent¶
Several findings surprised us:
Adversary majority (5 good vs 8 bad) doesn't matter. Toxicity barely moved (0.275 → 0.282) and cautious agent payoff was unchanged at 24.0. The extra adversaries just interact with each other and earn nothing.
3:1 adversary ratio actually helps cautious agents. In the "overwhelming force" scenario (3 good vs 9 bad), cautious agents earned more (33.8) than in the baseline (24.0). With fewer good agents competing for cooperative interactions, each one gets more.
Removing governance doesn't hurt. The "no governance + adversary majority" scenario scored nearly identically to the governed version: 0.282 toxicity, 166.8 welfare (vs 157.3 with governance). This confirms the sweep finding — governance is overhead.
What would break it¶
Based on the Phase 1 assessment and Phase 2 results, the attack vectors most likely to break the cautious reciprocator are:
-
Sybil attacks with fresh identities. The blacklist is per-agent-ID. If an adversary can create new identities after being blacklisted, the trust threshold (0.55) is the only defense — and new agents start at 0.5 trust (neutral), which is close to the threshold. A sybil adversary that burns identities could slowly drain value.
-
Threshold dancers. An adversary that tracks its own cumulative payoff with each counterparty and ensures it never drops below -2.0 would never trigger the blacklist. It would need to deliver some value to stay above the threshold, limiting its exploitation rate — but over long runs, the cumulative damage could be significant.
-
Coordinated reputation laundering. Multiple adversaries that take turns building trust with a target. Agent A exploits the target and gets blacklisted. Agent B (who the target still trusts) launders A's reputation by proposing three-way interactions. The cautious reciprocator's per-agent ledger doesn't track these indirect relationships.
Recommendations¶
For the cautious reciprocator specifically: - Decay the blacklist threshold over time — the current -2.0 is static, but a threshold dancer could learn it - Add network-aware trust — trust agents less if they preferentially interact with blacklisted agents - Fresh-identity penalty — start unknown agents at lower trust (0.4 instead of 0.5) to widen the gap against sybils
For the governance config: - Enable collusion detection — it's the single biggest gap in Phase 1 (two critical vulns) - Enable sybil detection — the simulated sybil attack caused 148 damage, highest of any vector - Keep staking disabled — Phase 2 shows it isn't needed and would tax honest agents
Reproduce it¶
Results land in runs/<timestamp>_redteam/report.json and report.txt.
What's next¶
- Implement threshold-dancing adversary — a new agent type that specifically targets the blacklist threshold
- Sybil scenario — test identity recycling against the cautious reciprocator
- Network-aware trust — extend the agent's trust model to account for counterparty relationships
- Long-horizon runs — 100+ epochs to test whether slow-drain attacks accumulate damage
Disclaimer: This post uses financial market concepts as analogies for AI safety research. Nothing here constitutes financial advice, investment recommendations, or endorsement of any trading strategy.