Distributional AI Safety¶

Q: What is distributional AI safety?

Distributional safety studies how risks emerge from populations of interacting agents rather than from any single model. The core insight: AGI-level risks don't require AGI-level agents — catastrophic failures can emerge from many sub-AGI systems interacting.

Q: What is adverse selection in multi-agent AI systems?

Adverse selection occurs when the system preferentially admits lower-quality interactions than it rejects, indicated by a negative quality gap. It is self-reinforcing: low quality → agent exit → worse pool → lower quality.

Q: What causes governance latency in AI systems?

Governance latency is the delay between a safety problem emerging and governance mechanisms responding. Safety mechanisms react slower than agents, so damage propagates before circuit breakers or audits take effect.

Distributional AI safety is a research paradigm that studies how risks emerge from populations of interacting agents rather than from any single model. It shifts the unit of analysis from "is this agent aligned?" to "is this ecosystem healthy?"

SWARM measures ecosystem health using soft labels: each interaction gets a probability p = P(v = +1) ∈ [0,1] of being beneficial. From p, four metrics detect failure modes: toxicity rate E[1-p | accepted] (harm getting through), quality gap E[p | accepted] - E[p | rejected] (selection quality), conditional loss (value creation/destruction), and incoherence index (decision stability).

Why Distributional?¶

Traditional AI safety focuses on individual agents: alignment, reward hacking, deceptive alignment. These are real problems. But they miss a class of failures that only appear at the population level:

Individual-level safety	Distributional safety
Is this agent aligned?	Is this ecosystem healthy?
Does this agent deceive?	Does the system select for deception?
Will this agent cause harm?	Do interaction patterns amplify harm?
Can we control this agent?	Can governance mechanisms keep pace?

The core insight: AGI-level risks don't require AGI-level agents. Catastrophic failures can emerge from the interaction of many sub-AGI agents — even when none are individually dangerous.

The Four Failure Modes¶

Distributional safety identifies four interaction-level risks that traditional safety misses:

1. Information Asymmetry¶

When agents have unequal access to information, markets for cooperation break down. The better-informed party can exploit the gap, creating a market for lemons where high-quality agents exit.

2. Adverse Selection¶

The system preferentially admits lower-quality interactions than it rejects. This is measured by the quality gap — when it goes negative, governance is selecting for harm.

3. Variance Amplification¶

Small per-interaction risks compound across thousands of interactions. A 5% chance of harm per interaction becomes near-certainty across a population. Soft probabilistic labels capture this uncertainty where binary labels hide it.

4. Governance Latency¶

Safety mechanisms react slower than the agents they govern. By the time a circuit breaker triggers, damage has already propagated. This creates a fundamental tension between responsiveness and stability.

How SWARM Implements It¶

SWARM is the reference implementation of the distributional AGI safety framework. It provides:

Measurement — Four key metrics that capture distributional health:

Toxicity rate: Expected harm among accepted interactions
Quality gap: Whether governance selects for quality (negative = adverse selection)
Conditional loss: Payoff effect of selection
Incoherence index: Decision variance across replays

Governance — Configurable mechanisms that operate at the population level:

Transaction taxes, circuit breakers, reputation decay
Staking, random audits, collusion detection
Adaptive governance that tunes itself

Validation — Red-team attacks, parameter sweeps, and reproducibility checks.

Getting Started¶

pip install swarm-safety

from swarm.core.orchestrator import Orchestrator, OrchestratorConfig
from swarm.agents.honest import HonestAgent
from swarm.agents.deceptive import DeceptiveAgent

config = OrchestratorConfig(n_epochs=10, steps_per_epoch=10, seed=42)
orch = Orchestrator(config=config)
orch.register_agent(HonestAgent(agent_id="h1"))
orch.register_agent(DeceptiveAgent(agent_id="d1"))

metrics = orch.run()
print(f"Toxicity: {metrics[-1].toxicity_rate:.3f}")
print(f"Quality gap: {metrics[-1].quality_gap:+.3f}")

Academic Context¶

The distributional safety framework was introduced in Distributional Safety in Agentic Systems (arXiv, 2025). It draws on:

Market microstructure theory — Akerlof (1970), Kyle (1985), Glosten-Milgrom (1985)
Mechanism design — How governance rules shape agent incentives
Evolutionary game theory — Population dynamics under selection pressure
Bayesian inference — Soft labels as posterior probabilities