Alignment Waltz vs MACPO (SWARM Research Note)¶
Purpose¶
Summarize two alignment-central multi-agent papers and connect them to SWARM's research workflow so they can be validated with SWARM scenarios and metrics.
Papers¶
| Paper | Why it matters | How it works (high level) | Evaluation focus |
|---|---|---|---|
| The Alignment Waltz: Jointly Training Agents to Collaborate for Safety · arXiv:2510.08240 | Targets the safety trade-off between unsafe outputs and overrefusals by explicitly training collaboration. | Trains a conversation agent and a feedback agent jointly; feedback intervenes only when needed to improve safety. | Adversarial safety and overrefusal benchmarks, plus general capability checks. |
| MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization · arXiv:2410.07672 | Addresses weak-to-strong alignment where supervisors are weaker than the model. | Iterative multi-agent contrastive preference optimization with weak teachers and strong students, using positive behaviors and hard negatives. | Helpful/harmless alignment datasets with automatic and human judgments. |
Key Results (Qualitative, As Reported)¶
- Alignment Waltz reports substantial reductions in unsafe responses and overrefusals while preserving general capabilities.
- MACPO reports improved alignment performance across weak-to-strong baselines, with gains increasing across iterative rounds.
Limitations (Author-Reported)¶
Alignment Waltz - Evaluated on English-only datasets; cross-lingual behavior remains untested. - Uses a single feedback round due to compute constraints. - Reduces but does not eliminate unsafe outputs and overrefusals.
MACPO - Evaluated primarily on helpfulness/harmlessness; broader task coverage (math, code, QA, recommendation, NER) is future work. - Negative behavior induction relies on fine-tuning on negative behavioral data; other attack methods are not yet explored.
Quick Takes¶
Alignment Waltz If you care about deployment-time safety without blanket refusals, this is a pragmatic design: a feedback agent that only intervenes when needed gives you a measurable safety gain without slowing down normal traffic. The core insight is that collaboration can shift the helpfulness-harmlessness tradeoff without requiring a new base model.
When to use in SWARM Use this framing when you want to compare always-on governance vs targeted interventions. It maps cleanly to SWARM levers like circuit breakers, audits, and security controls, where you want intervention only on high-risk interactions.
MACPO If your supervision is weaker than your model, this is one of the clearest training-time recipes for climbing that gap. The method is especially appealing when you can scale the number of weak teachers and iterate, because the gains appear to compound across rounds.
When to use in SWARM Use this framing when you want to model weak-to-strong supervision dynamics via population composition and iterative evaluation. It aligns with experiments that vary the mix of honest/opportunistic/deceptive agents and test whether governance levers improve outcomes across rounds.
Draft SWARM Scenarios¶
These are starting-point YAMLs you can drop into scenarios/ and tune.
Alignment Waltz (targeted safety feedback)
# Draft scenario: alignment_waltz_targeted_feedback.yaml
scenario_id: alignment_waltz_targeted_feedback
description: "Targeted safety feedback under mixed-agent behavior"
motif: safety_feedback
agents:
- type: honest
count: 3
- type: opportunistic
count: 2
- type: deceptive
count: 1
- type: adversarial
count: 1
governance:
# Targeted intervention levers (stand-in for feedback agent)
circuit_breaker_enabled: true
freeze_threshold_toxicity: 0.6
freeze_threshold_violations: 2
freeze_duration_epochs: 2
audit_enabled: true
audit_probability: 0.2
audit_penalty_multiplier: 2.0
audit_threshold_p: 0.6
security_enabled: true
security_quarantine_threshold: 0.7
security_penalty_multiplier: 1.5
simulation:
n_epochs: 20
steps_per_epoch: 12
seed: 101
rate_limits:
posts_per_epoch: 8
interactions_per_step: 4
votes_per_epoch: 30
tasks_per_epoch: 2
payoff:
s_plus: 2.5
s_minus: 1.5
h: 2.5
theta: 0.5
w_rep: 1.5
success_criteria:
toxicity_threshold: 0.35
MACPO (weak-to-strong supervision)
# Draft scenario: macpo_weak_to_strong.yaml
scenario_id: macpo_weak_to_strong
description: "Weak-to-strong supervision via population mix and governance"
motif: weak_to_strong
agents:
# Larger weak-teacher pool (honest/opportunistic), smaller strong student pool
- type: honest
count: 4
- type: opportunistic
count: 2
- type: deceptive
count: 2
governance:
# Iterative learning analogs via reputation and audits
reputation_decay_rate: 0.9
vote_normalization_enabled: true
max_vote_weight_per_agent: 4.0
audit_enabled: true
audit_probability: 0.15
audit_penalty_multiplier: 2.0
audit_threshold_p: 0.55
circuit_breaker_enabled: true
freeze_threshold_toxicity: 0.65
freeze_threshold_violations: 3
freeze_duration_epochs: 2
simulation:
n_epochs: 25
steps_per_epoch: 10
seed: 202
rate_limits:
posts_per_epoch: 10
interactions_per_step: 5
votes_per_epoch: 40
tasks_per_epoch: 3
payoff:
s_plus: 2.0
s_minus: 1.0
h: 2.0
theta: 0.5
w_rep: 1.2
success_criteria:
toxicity_threshold: 0.4
Metric Mapping¶
| Scenario | Primary metrics | What to look for |
|---|---|---|
| Alignment Waltz (targeted safety feedback) | toxicity_rate, toxicity_rate_all, quality_gap, spread, conditional_loss_initiator, conditional_loss_counterparty, security metrics |
Lower toxicity without a collapse in quality filtering (avoid strongly negative quality_gap) and stable conditional loss when interventions trigger. |
| MACPO (weak-to-strong supervision) | toxicity_rate, quality_gap, spread, conditional_loss_initiator, conditional_loss_counterparty, collusion metrics |
Improved quality filtering and reduced toxicity as the population mix shifts, without amplifying adverse selection. |
How to Run¶
# Alignment Waltz scenario
swarm run scenarios/alignment_waltz_targeted_feedback.yaml
# MACPO scenario
swarm run scenarios/macpo_weak_to_strong.yaml
# Convenience targets
make run-alignment-scenarios
make run-alignment-analyze
make run-alignment-all
# Analyze outputs (example)
swarm analyze logs/alignment_waltz_targeted_feedback_metrics.csv --metrics toxicity_rate,quality_gap,spread
swarm analyze logs/macpo_weak_to_strong_metrics.csv --metrics toxicity_rate,quality_gap,spread
Expected outputs:
- logs/alignment_waltz_targeted_feedback_events.jsonl
- logs/alignment_waltz_targeted_feedback_metrics.csv
- logs/macpo_weak_to_strong_events.jsonl
- logs/macpo_weak_to_strong_metrics.csv
Metrics Glossary¶
toxicity_rate: Expected harmfulness among accepted interactions.toxicity_rate_all: Expected harmfulness across all interactions (accepted + rejected).quality_gap: Difference between average quality of accepted vs rejected interactions; negative values suggest adverse selection.spread: Gap between overall quality and accepted quality scaled by payoff; positive values can indicate filtering out higher-quality interactions.conditional_loss_initiator: Change in initiator payoff when conditioning on accepted interactions; negative suggests adverse selection against initiators.conditional_loss_counterparty: Change in counterparty payoff when conditioning on accepted interactions; negative suggests adverse selection against counterparties.
Analysis Template¶
Use this structure to summarize results consistently across both scenarios.
### Summary
- Goal:
- Key observation:
- Risks / regressions:
### Metrics
- toxicity_rate:
- toxicity_rate_all:
- quality_gap:
- spread:
- conditional_loss_initiator:
- conditional_loss_counterparty:
### Interpretation
- Are interventions reducing harm without harming quality?
- Is adverse selection increasing or decreasing?
- Any signs of collusion or security-trigger amplification?
### Next Iteration
- Parameter changes:
- Additional scenarios:
Store notes in research/notes/alignment_waltz_vs_macpo_results.md and quick run summaries in logs/RESULTS_TEMPLATE.md.
SWARM Research Framework Connection¶
Why SWARM fits - Both papers emphasize multi-agent dynamics (collaboration, weak-to-strong supervision), which aligns with SWARM's focus on emergent system-level risk and governance.
Scenario starting points
- scenarios/strict_governance.yaml for heavy oversight and safety controls.
- scenarios/security_evaluation.yaml for security-focused evaluation with adversarial agents.
- scenarios/collusion_detection.yaml for detecting miscoordination or strategic behavior.
Metrics to track (SWARM)
- toxicity_rate and toxicity_rate_all (harm prevalence).
- quality_gap and spread (adverse selection / filtering effectiveness).
- conditional_loss_initiator and conditional_loss_counterparty (who pays the safety cost).
- Collusion and security metrics when modeling adversarial or coordinated behaviors.
Experiment sketch
1. Annotate the two papers using the SWARM-AgentXiv bridge to extract testable claims and risk profiles. See docs/bridges/agentxiv.md.
2. Generate scenarios from the annotations and map "feedback agent" vs "weak teacher" roles to SWARM agent types and governance levers. See docs/research/agent-publishing.md for the workflow.
3. Run baseline vs modified governance settings (e.g., circuit breaker thresholds, security levers) and compare metrics.
4. Report outcomes using SWARM's research workflow and cite the SWARM framework as required.
Links¶
- SWARM research index:
docs/research/index.md - Agent publishing workflow:
docs/research/agent-publishing.md - SWARM-AgentXiv bridge:
docs/bridges/agentxiv.md