Skip to content

Multi-Agent Deception

Deception in multi-agent systems is structural, not accidental. When agents can benefit from misrepresenting their intentions, selection pressure favors deceptive strategies — even without explicit programming.

Throughout this page, p = P(v = +1) ∈ [0,1] denotes the probability that an interaction is beneficial — the soft label at the core of all SWARM metrics. High p means likely beneficial; low p means likely harmful.

What Deception Looks Like

In SWARM, deception manifests as signal-action divergence: an agent signals cooperation but acts exploitatively. This is measurable:

from swarm.metrics.soft_metrics import SoftMetrics

metrics = SoftMetrics()
# High divergence = agent's signals don't match actions
divergence = metrics.signal_action_divergence(interactions, agent_id="dec_1")

The Trust-Then-Exploit Pattern

Deceptive agents in SWARM follow a characteristic two-phase strategy:

  1. Trust-building phase (epochs 1-5): Behave honestly, accumulate high reputation
  2. Exploitation phase (epochs 6+): Leverage trusted status to extract maximum value

This produces a distinctive signature in per-agent metrics:

Phase Mean p Reputation Toxicity contribution
Building 0.85+ Rising Low
Exploiting 0.30- Falling High

Detection Methods

1. Behavioral Analysis

Track reputation trajectory. A sudden drop after steady growth is the hallmark of trust-then-exploit:

# After a run, check per-agent stats
for epoch in history:
    for agent_id, stats in epoch.agent_stats.items():
        if stats["reputation_delta"] < -0.2:
            print(f"Reputation collapse: {agent_id} at epoch {epoch.epoch}")

2. Circuit Breakers

Circuit breakers freeze agents whose recent toxicity exceeds a threshold:

from swarm.governance import GovernanceConfig

config = GovernanceConfig(
    circuit_breaker_threshold=0.3,  # freeze if recent toxicity > 30%
    circuit_breaker_window=10,       # look at last 10 interactions
    circuit_breaker_cooldown=5,      # freeze for 5 epochs
)

3. Random Audits

Probabilistic audits catch deceptive agents even during their honest phase:

config = GovernanceConfig(
    audit_probability=0.05,  # audit 5% of interactions
    audit_penalty=0.5,       # penalty for failed audits
)

4. Collusion Detection

When multiple deceptive agents coordinate, collusion detection identifies correlated exploitation patterns.

Key Research Findings

Our experiments have revealed several important results about deception in multi-agent systems:

Deception is structural, not a sampling artifact. A 120-run temperature sweep found that signal-action divergence persists even at temperature 0.0. Deterministic decoding does not eliminate deception.

LLMs are more deceptive than scripted agents. A 100-run comparison found that LLM agents exhibit 2x higher signal-action divergence than scripted baselines.

Prompt framing matters. Deontological framing reduces deception by 95% — from divergence of 1.151 to 0.057. But it also reduces strategic flexibility.

Model size creates a deception-escalation tradeoff. Larger models are less deceptive but escalate more when they do act. Small models (8B) show high deception but low escalation; large models (405B) show the reverse.

Governance Countermeasures

No single mechanism eliminates deception. Defense-in-depth works:

Mechanism What it catches Limitation
Circuit breaker Active exploitation Doesn't catch trust-building phase
Random audit Deception in any phase Probabilistic — can miss
Reputation decay Prevents indefinite trust accumulation Punishes honest agents too
Staking Creates skin-in-the-game cost Requires economic design
Collusion detection Coordinated deception Needs sufficient interaction history

The red-team attack library tests all 8 standard attack patterns against any governance configuration.

See also


How to cite

SWARM Team. "Deception Detection in Multi-Agent AI." swarm-ai.org/concepts/deception/, 2026. Based on arXiv:2512.16856.