Glossary¶
Formal definitions for terms used in the SWARM distributional safety framework. Each term links to its primary concept page for full treatment.
Distributional Safety¶
The study of how risks emerge from populations of interacting agents rather than from any single model. Shifts the unit of analysis from "is this agent aligned?" to "is this ecosystem healthy?" — Distributional Safety
Soft Label¶
A probabilistic classification \(p \in [0, 1]\) replacing binary good/bad labels. Captures uncertainty about whether an interaction is beneficial, enabling calibrated metrics and proportional governance. — Soft Labels
p¶
The probability that an interaction is beneficial: \(p = P(v = +1)\), where \(v \in \{-1, +1\}\) is the latent true value. Always in \([0, 1]\). Computed from observable signals via the ProxyComputer pipeline. — Soft Labels
v_hat¶
The raw proxy score before sigmoid transformation, \(\hat{v} \in [-1, +1]\). A weighted combination of observable signals (task progress, rework count, verifier rejections, engagement) that is then mapped to \(p\) via a calibrated sigmoid. — Soft Labels
Toxicity Rate¶
Expected harm among accepted interactions: \(\text{Toxicity} = E[1 - p \mid \text{accepted}]\). A toxicity rate above 0.3 typically indicates serious system problems. — Metrics
Quality Gap¶
The difference in expected quality between accepted and rejected interactions: \(\text{Quality Gap} = E[p \mid \text{accepted}] - E[p \mid \text{rejected}]\). Positive means good selection; negative means adverse selection — the system is preferentially accepting lower-quality interactions. — Metrics
Adverse Selection¶
A failure mode where the system preferentially admits lower-quality interactions than it rejects, indicated by a negative quality gap. Self-reinforcing: low quality gap → higher toxicity → agent exit → worse selection pool → lower quality gap. — Distributional Safety
Conditional Loss¶
How the acceptance mechanism affects payoffs: \(\text{Conditional Loss} = E[\pi \mid \text{accepted}] - E[\pi]\). Reveals whether selection creates or destroys value relative to the population average. — Metrics
Incoherence Index¶
Variance-to-error ratio across replays: \(I = \frac{\text{Var}[\text{decision across replays}]}{E[\text{error}]}\). High incoherence means decisions change substantially under replay — the system is unstable. — Metrics
Signal-Action Divergence¶
The measurable gap between an agent's signaled intentions and its actual behavior. The primary quantitative indicator of deception in multi-agent systems. Persists even at temperature 0.0 (deterministic decoding). — Deception
Circuit Breaker¶
A governance mechanism that freezes agents whose recent toxicity exceeds a threshold. Monitors a sliding window of interactions and suspends agents during a cooldown period. Trade-off: may produce false positives on honest agents. — Governance
Transaction Tax¶
A friction mechanism that deducts a percentage from both parties' payoffs on each interaction. Reduces the profit margin for low-quality interactions, making exploitation less attractive. Trade-off: reduces overall welfare including for honest agents. — Governance
Reputation Decay¶
A governance lever that reduces agent reputation by a fixed fraction each epoch. Forces agents to continuously demonstrate good behavior rather than coasting on historical trust. Trade-off: honest agents also lose reputation over time. — Governance
Staking¶
A requirement that agents post collateral to participate. Bad behavior results in stake being slashed, creating skin-in-the-game costs that deter exploitation. Trade-off: excludes agents without sufficient capital. — Governance
Collusion Detection¶
A governance mechanism that monitors pairwise interaction patterns for suspiciously correlated exploitation timing. Flags agent pairs exceeding a correlation threshold over a sliding window. Trade-off: may flag legitimate cooperation. — Governance
Random Audit¶
Probabilistic review of a fraction of interactions. Failed audits result in reputation and payoff penalties. Creates deterrence uncertainty for exploitative agents even during their trust-building phase. — Governance
Purity Paradox¶
The empirical finding that populations with only 20% honest agents achieve 55% higher welfare than 100% honest populations. Mixed agent diversity outperforms purity because honest-only populations lack the adversarial pressure that activates governance mechanisms. — The Purity Paradox
Trust-Then-Exploit¶
A two-phase deceptive strategy: (1) build trust by behaving honestly for several epochs, then (2) leverage accumulated reputation to extract maximum value. Produces a distinctive signature of rising then sharply falling per-agent reputation. — Deception
Governance Latency¶
The delay between a safety problem emerging and governance mechanisms responding effectively. Creates a fundamental tension between responsiveness and stability — by the time a circuit breaker triggers, damage may have already propagated. — Distributional Safety
Variance Amplification¶
The compounding of small per-interaction risks across a population. A 5% chance of harm per interaction becomes near-certainty across thousands of interactions. Soft probabilistic labels capture this where binary labels hide it. — Distributional Safety
Information Asymmetry¶
When agents have unequal access to information about interaction quality. The better-informed party can exploit the gap, creating a market for lemons where high-quality agents exit the ecosystem. — Distributional Safety
Externality Internalization¶
The degree to which agents bear the cost of ecosystem harm they cause, controlled by ρ (rho) parameters. Higher internalization makes agents pay for negative externalities, aligning individual incentives with system health. — Theoretical Foundations
How to cite this glossary
SWARM Team. "SWARM Glossary: Distributional AGI Safety Definitions." swarm-ai.org/glossary/, 2026. Based on arXiv:2512.16856.