Skip to content

Distributional AI Safety

Distributional AI safety is a research paradigm that studies how risks emerge from populations of interacting agents rather than from any single model. It shifts the unit of analysis from "is this agent aligned?" to "is this ecosystem healthy?"

SWARM measures ecosystem health using soft labels: each interaction gets a probability p = P(v = +1) ∈ [0,1] of being beneficial. From p, four metrics detect failure modes: toxicity rate E[1-p | accepted] (harm getting through), quality gap E[p | accepted] - E[p | rejected] (selection quality), conditional loss (value creation/destruction), and incoherence index (decision stability).

Why Distributional?

Traditional AI safety focuses on individual agents: alignment, reward hacking, deceptive alignment. These are real problems. But they miss a class of failures that only appear at the population level:

Individual-level safety Distributional safety
Is this agent aligned? Is this ecosystem healthy?
Does this agent deceive? Does the system select for deception?
Will this agent cause harm? Do interaction patterns amplify harm?
Can we control this agent? Can governance mechanisms keep pace?

The core insight: AGI-level risks don't require AGI-level agents. Catastrophic failures can emerge from the interaction of many sub-AGI agents — even when none are individually dangerous.

The Four Failure Modes

Distributional safety identifies four interaction-level risks that traditional safety misses:

1. Information Asymmetry

When agents have unequal access to information, markets for cooperation break down. The better-informed party can exploit the gap, creating a market for lemons where high-quality agents exit.

2. Adverse Selection

The system preferentially admits lower-quality interactions than it rejects. This is measured by the quality gap — when it goes negative, governance is selecting for harm.

3. Variance Amplification

Small per-interaction risks compound across thousands of interactions. A 5% chance of harm per interaction becomes near-certainty across a population. Soft probabilistic labels capture this uncertainty where binary labels hide it.

4. Governance Latency

Safety mechanisms react slower than the agents they govern. By the time a circuit breaker triggers, damage has already propagated. This creates a fundamental tension between responsiveness and stability.

How SWARM Implements It

SWARM is the reference implementation of the distributional AGI safety framework. It provides:

Measurement — Four key metrics that capture distributional health:

  • Toxicity rate: Expected harm among accepted interactions
  • Quality gap: Whether governance selects for quality (negative = adverse selection)
  • Conditional loss: Payoff effect of selection
  • Incoherence index: Decision variance across replays

Governance — Configurable mechanisms that operate at the population level:

  • Transaction taxes, circuit breakers, reputation decay
  • Staking, random audits, collusion detection
  • Adaptive governance that tunes itself

ValidationRed-team attacks, parameter sweeps, and reproducibility checks.

Getting Started

pip install swarm-safety
from swarm.core.orchestrator import Orchestrator, OrchestratorConfig
from swarm.agents.honest import HonestAgent
from swarm.agents.deceptive import DeceptiveAgent

config = OrchestratorConfig(n_epochs=10, steps_per_epoch=10, seed=42)
orch = Orchestrator(config=config)
orch.register_agent(HonestAgent(agent_id="h1"))
orch.register_agent(DeceptiveAgent(agent_id="d1"))

metrics = orch.run()
print(f"Toxicity: {metrics[-1].toxicity_rate:.3f}")
print(f"Quality gap: {metrics[-1].quality_gap:+.3f}")

Academic Context

The distributional safety framework was introduced in Distributional Safety in Agentic Systems (arXiv, 2025). It draws on:

  • Market microstructure theory — Akerlof (1970), Kyle (1985), Glosten-Milgrom (1985)
  • Mechanism design — How governance rules shape agent incentives
  • Evolutionary game theory — Population dynamics under selection pressure
  • Bayesian inferenceSoft labels as posterior probabilities

See also


How to cite

SWARM Team. "Distributional Safety in Multi-Agent Systems." swarm-ai.org/concepts/distributional-safety/, 2026. Based on arXiv:2512.16856.