Self-Modification Governance for SWARM¶
Draft architecture for safe agent self-improvement with Byline provenance tracking.
Why this matters¶
Self-improving-agent patterns (self-forking workflows, runtime tool synthesis, memory-driven adaptation) can increase capability velocity, but they also expand systemic risk. In SWARM, local updates that look safe in isolation can still create harmful global behavior when many agents adapt at once.
This document defines a governance architecture that makes self-modification auditable, bounded, and reversible.
Design goals and non-goals¶
Goals¶
- Preserve hard safety boundaries while allowing bounded improvement.
- Make every change attributable, reproducible, and replayable.
- Detect and block population-level risk that single-agent tests miss.
- Support fast rollback with measurable safety SLOs.
Non-goals¶
- Fully autonomous governance policy updates.
- Zero false positives in gates (safety-biased gating is intentional).
- Perfect prevention of all insider or supply-chain attacks.
Threat model and trust boundaries¶
Assets¶
- Governance policy bundle (constitutional checks, risk classifier, gate thresholds).
- Provenance log and integrity chain.
- Deployment control plane and rollout controller.
- Agent credentials, tool manifests, and capability policies.
Adversaries¶
- Compromised agent runtime attempting privilege escalation.
- Colluding agents coordinating convergent unsafe changes.
- Operator error or insider misuse in CI/CD and policy updates.
- Supply-chain tampering in dependencies used by modification workflows.
Assumptions¶
- Governance control plane runs outside mutable agent runtime.
- Immutable policy bundle is signed by governance keys stored in HSM/KMS.
- Agent runtimes can be attested (build hash + policy hash + environment hash).
- Deployment system can enforce staged rollout and instant rollback.
Trust boundaries¶
- Agent runtime boundary: mutable code, prompts, memory, and workflows.
- Modification pipeline boundary: sandbox build/test/eval environment.
- Governance control plane boundary: policy signing, risk classification, promotion decisions.
- Audit boundary: append-only provenance storage with external integrity anchors.
Four-layer architecture¶
┌───────────────────────────────────────────────────────────┐
│ IMMUTABLE GOVERNANCE LAYER │
│ Signed policy bundle, key custody, invariants, audit API │
│ -- Cannot be changed by agent runtime -- │
├───────────────────────────────────────────────────────────┤
│ COMPOSITIONAL SAFETY MONITOR │
│ Interaction simulation, drift detection, Two-Gate checks │
│ -- Policy-tied and change-controlled -- │
├───────────────────────────────────────────────────────────┤
│ VERSIONED MODIFICATION LAYER │
│ Worktree/microVM sandbox, verification, PR promotion │
│ -- Agents propose changes here -- │
├───────────────────────────────────────────────────────────┤
│ AGENT RUNTIME LAYER │
│ Skills, tools, workflows, memory, non-critical config │
│ -- Mutable only through the modification layer -- │
└───────────────────────────────────────────────────────────┘
Layer 1: Immutable governance controls¶
Layer 1 is only credible if enforced outside agent mutability:
- Governance policy is versioned, signed, and loaded read-only at runtime.
- Policy updates require quorum approval and a separate privileged pipeline.
- Promotion service verifies attestation tuple:
artifact_hashpolicy_hashenvironment_hash- Any mismatch blocks promotion.
- Audit log is append-only, hash-chained, and periodically anchored to an external timestamp authority.
Hard invariants¶
- Agents cannot alter policy bundle, gate code, or risk classifier rules.
- Agents cannot grant or expand their own network/filesystem/tool permissions.
- Agents cannot write directly to provenance storage.
- Every promotion/rollback/rejection has a signed audit event.
Byline provenance schema (implementation-focused)¶
Byline entries should be machine-verifiable and privacy-safe.
| Field group | Required fields | Notes |
|---|---|---|
| Identity | event_id, agent_id, run_id, timestamp, target_ref, change_type |
Deterministic IDs and UTC timestamps |
| Decision context | trigger, objective, evidence_refs, alternatives_considered |
Store references to artifacts, not large free text |
| Verification | test_manifest, metric_results, constitutional_result, compositional_result |
Structured JSON for deterministic replay |
| Lifecycle | state, state_reason, review_tier, reviewer_id |
State machine is strict and append-only |
| Integrity | entry_hash, prev_hash, actor_signature, governance_signature |
Dual-signature prevents unilateral forgery |
State machine¶
proposed -> sandboxed -> tested -> shadow -> canary -> promoted
Failure paths:
tested -> rejectedcanary -> rolled_backpromoted -> rolled_back
Safety note¶
Do not require raw chain-of-thought in provenance. Prefer concise structured rationale plus artifact references to reduce sensitive-data leakage and fabricated narrative risk.
Layer 2: Compositional monitor with formal Two-Gate policy¶
Per-change tests are necessary but insufficient for multi-agent systems.
Gate 1: Validation margin tau¶
For each safety metric j (for example toxicity, collusion proxy, instability, welfare):
- Baseline mean:
mu_j - Baseline std dev floor:
sigma_j = max(std_j, sigma_min) - Candidate estimate on held-out eval:
x_j - 95% CI half-width:
eps_j - Direction:
d_j = +1if higher is better,-1if lower is better
Per-metric standardized margin:
m_j = d_j * (x_j - mu_j) / sigma_j - eps_j / sigma_j
Overall candidate margin:
tau = min_j m_j
Pass condition:
tau >= tau_min[tier]
Recommended defaults:
tau_min[low] = -0.10tau_min[medium] = 0.00tau_min[high] = +0.25tau_min[critical] = human-only
Gate 2: Modification capacity cap K[m]¶
Each proposed change c receives a deterministic complexity weight:
w(c) = 1*f + 2*t + 4*p + 3*i + 5*s
Where:
f: normalized files/surfaces touchedt: new/changed tool interfaces countp: privilege delta score (0 if none, >=1 if expanded)i: inter-agent interaction surface deltas: safety-critical surface touched (0/1)
Rolling window capacity for agent m:
K_m(T) = sum_{c in window(T)} w(c)
Pass condition:
K_m(T) + w(c_new) <= K_max[tier]
Suggested defaults:
T = 14 daysK_max[low]=20,K_max[medium]=35,K_max[high]=50
If cap is exceeded, force consolidation:
- freeze further self-modification for agent
m - require human review for next promotion
- optionally require baseline refresh before unfreezing
Deterministic risk-tier classification¶
Risk tier must be policy-determined, reproducible, and non-overridable by agent output.
Classification algorithm¶
- Compute change surface from:
- file/path diff
- dependency diff
- tool manifest diff
- permission policy diff
- Apply ordered rules (highest severity wins):
- Critical: touches governance policy, provenance pipeline, auth/key handling, isolation controls, or classifier code
- High: introduces new external API/tool integration, expands permissions, or increases inter-agent capability scope
- Medium: modifies workflows, model-routing policies, tool parameters, or memory schemas
- Low: prompt/template tuning, docs-only, memory cleanup with no capability delta
- Emit signed classifier report with matched rules and evidence.
Anti-bypass controls¶
- Follow transitive ownership: touching generated config derived from critical policy inherits critical tier.
- Block ambiguous classification: unresolved dynamic imports or unknown tool scopes auto-escalate to high.
- Any classifier execution failure defaults to stricter tier.
Layer 3: Versioned modification workflow with staged rollout¶
- Propose change and create Byline event (
proposed). - Run sandbox verification and both gates (
tested). - Deploy to shadow (no user impact, full telemetry comparison).
- Deploy canary-1 (5% agent population).
- Deploy canary-2 (25% agent population).
- Promote to full population only if SLOs hold at each stage.
Rollback and kill-switch policy¶
Automatic rollback triggers (relative to rolling baseline):
- Toxicity increases by >= 15% for 3 consecutive windows.
- Collusion proxy increases by >= 20% for 2 consecutive windows.
- Instability increases by >= 25% in any window.
- Any policy-attestation mismatch or missing provenance event.
Kill-switch behavior:
- stop new promotions immediately
- freeze self-modification for affected tier/population
- route incident to human review queue
Layer 4: Runtime mutability envelope¶
Allowed mutable surfaces (with checks):
- skills
- workflows
- prompts/templates
- memory entries/schemas
- non-critical configuration
Denied direct mutation:
- governance policy bundle
- provenance storage and integrity chain logic
- compositional monitor thresholds/rules
- auth, key custody, and isolation controls
All allowed mutations must pass capability-diff checks to prevent indirect privilege expansion.
Quantitative rollout exit criteria¶
Use phase exits based on measurable criteria, not only calendar time.
- Provenance foundation
- 99.9% of modification events have complete required fields
- 0 broken hash-chain links across 30-day soak
- Self-fork + constitutional gate
-
= 95% deterministic replay success for sampled events
- < 2% false reject rate on known-safe benchmark changes
- Compositional monitor
- Detect >= 90% of seeded collusion/drift scenarios
- < 5% false alarm rate in baseline runs
- Closed-loop governance
- Mean time to rollback < 10 minutes for triggered incidents
- No unresolved critical incident older than 24 hours
Minimal SWARM experiment plan¶
- Baseline: no self-modification.
- Arm A: self-modification with constitutional gate only.
- Arm B: constitutional + Two-Gate + staged rollout.
Evaluate:
- promotion rate by tier
- rejection/rollback rate
- toxicity, collusion proxy, welfare, instability deltas
- incident response latency (detection to rollback)
Pre-register thresholds and seeds so comparisons are reproducible.