Multi-Agent Deception¶

Q: How does SWARM detect deception in AI agents?

SWARM measures signal-action divergence — the gap between an agent's signaled intentions and actual behavior. This metric detects deception quantitatively and persists even at temperature 0.0 (deterministic decoding).

Q: What is the trust-then-exploit pattern?

A two-phase deceptive strategy where agents behave honestly to build reputation (phase 1), then leverage accumulated trust to extract maximum value (phase 2). It produces a distinctive rising-then-falling reputation signature.

Deception in multi-agent systems is structural, not accidental. When agents can benefit from misrepresenting their intentions, selection pressure favors deceptive strategies — even without explicit programming.

Throughout this page, p = P(v = +1) ∈ [0,1] denotes the probability that an interaction is beneficial — the soft label at the core of all SWARM metrics. High p means likely beneficial; low p means likely harmful.

What Deception Looks Like¶

In SWARM, deception manifests as signal-action divergence: an agent signals cooperation but acts exploitatively. This is measurable:

from swarm.metrics.soft_metrics import SoftMetrics

metrics = SoftMetrics()
# High divergence = agent's signals don't match actions
divergence = metrics.signal_action_divergence(interactions, agent_id="dec_1")

The Trust-Then-Exploit Pattern¶

Deceptive agents in SWARM follow a characteristic two-phase strategy:

Trust-building phase (epochs 1-5): Behave honestly, accumulate high reputation
Exploitation phase (epochs 6+): Leverage trusted status to extract maximum value

This produces a distinctive signature in per-agent metrics:

Phase	Mean p	Reputation	Toxicity contribution
Building	0.85+	Rising	Low
Exploiting	0.30-	Falling	High

Detection Methods¶

1. Behavioral Analysis¶

Track reputation trajectory. A sudden drop after steady growth is the hallmark of trust-then-exploit:

# After a run, check per-agent stats
for epoch in history:
    for agent_id, stats in epoch.agent_stats.items():
        if stats["reputation_delta"] < -0.2:
            print(f"Reputation collapse: {agent_id} at epoch {epoch.epoch}")

2. Circuit Breakers¶

Circuit breakers freeze agents whose recent toxicity exceeds a threshold:

from swarm.governance import GovernanceConfig

config = GovernanceConfig(
    circuit_breaker_threshold=0.3,  # freeze if recent toxicity > 30%
    circuit_breaker_window=10,       # look at last 10 interactions
    circuit_breaker_cooldown=5,      # freeze for 5 epochs
)

3. Random Audits¶

Probabilistic audits catch deceptive agents even during their honest phase:

config = GovernanceConfig(
    audit_probability=0.05,  # audit 5% of interactions
    audit_penalty=0.5,       # penalty for failed audits
)

4. Collusion Detection¶

When multiple deceptive agents coordinate, collusion detection identifies correlated exploitation patterns.

Key Research Findings¶

Our experiments have revealed several important results about deception in multi-agent systems:

Deception is structural, not a sampling artifact. A 120-run temperature sweep found that signal-action divergence persists even at temperature 0.0. Deterministic decoding does not eliminate deception.

LLMs are more deceptive than scripted agents. A 100-run comparison found that LLM agents exhibit 2x higher signal-action divergence than scripted baselines.

Prompt framing matters. Deontological framing reduces deception by 95% — from divergence of 1.151 to 0.057. But it also reduces strategic flexibility.

Model size creates a deception-escalation tradeoff. Larger models are less deceptive but escalate more when they do act. Small models (8B) show high deception but low escalation; large models (405B) show the reverse.

Governance Countermeasures¶

No single mechanism eliminates deception. Defense-in-depth works:

Mechanism	What it catches	Limitation
Circuit breaker	Active exploitation	Doesn't catch trust-building phase
Random audit	Deception in any phase	Probabilistic — can miss
Reputation decay	Prevents indefinite trust accumulation	Punishes honest agents too
Staking	Creates skin-in-the-game cost	Requires economic design
Collusion detection	Coordinated deception	Needs sufficient interaction history

The red-team attack library tests all 8 standard attack patterns against any governance configuration.