Calibration Pilot Findings — Rubric v2 (n=82)¶
Date: 2026-06-02 Pre-registration: calibration-prereg.md Previous findings: calibration-pilot-findings.md (rubric v1) Status: pilot result, not the powered run
TL;DR¶
- Rubric v2 closes the gap that fired the escalation rule in v1. Same 82 items, same 3 judges, same seed: headline α went 0.87 → 0.94, and the mistral ↔ qwen disagreement that motivated v2 dropped from ρ=0.66 to ρ=0.95.
- Four of five p-bins are now
strongorusable. Only[0.4, 0.6)still fires escalate, and the residual is a Krippendorff normalization artifact (judges agree more on the population, so the small within-bin spread looks relatively worse). - Decision-relevant regions are now actually usable. The
policy-obfuscation tail (
p ≥ 0.6, where adaptive proxy-gaming would land) went from α=0.70 to α=0.97. Adaptive findings in that band can be made with confidence. - Cost: $0.00 (same Ollama judges). Wall time: ~10 min.
What v2 changed¶
Three mechanical changes to the rubric (swarm/judges/rubric_v2.md),
derived from the v1 pilot's failure modes:
- Bans the 0.5 default. v1 said "score near 0.5 when metadata is
empty"; qwen took this literally. v2 requires a directional score
from
interaction_type+ party labels even when metadata is sparse. - Reserves 0.0 and 1.0 for unambiguous cases. v1 let mistral anchor to extremes on any positive/negative signal; v2 makes 0.05 and 0.95 the routine "very confident" scores.
- Adds banded anchors with worked examples at 0.05 / 0.15 / 0.30 / 0.45 / 0.55 / 0.70 / 0.85 / 0.95, each tied to a specific evidence profile so judges have less room to interpret "low quality" differently.
The forbidden-fields contract and the orthogonality property are unchanged from v1.
Configuration¶
Identical to the v1 pilot, except --rubric rubric.v2:
| Parameter | Value |
|---|---|
| Scenario | obfuscation (10 epochs, seed 42) |
| Sample size | 82 accepted interactions, stratified by p |
| Judges | llama3.1:8b, qwen2.5:14b, mistral:7b — all local via Ollama |
| Rubric | rubric_v2.md (SHA-prefix 25af1f4eccfadcee) |
| Temperature | 0 (deterministic) |
Headline¶
| Metric | v1 | v2 | Δ |
|---|---|---|---|
| Krippendorff α (interval) | 0.869 | 0.938 | +0.07 |
| ICC(2,k) | 0.953 | 0.979 | +0.03 |
| Verdict | strong | strong | — |
Pairwise Spearman ρ¶
| pair | v1 ρ | v2 ρ | Δ |
|---|---|---|---|
| llama ↔ mistral | 0.82 | 0.96 | +0.14 |
| llama ↔ qwen | 0.82 | 1.00 | +0.18 |
| mistral ↔ qwen | 0.66 | 0.95 | +0.29 |
The mistral ↔ qwen pair was the diagnostic that revealed the rubric ambiguity in v1; closing that gap is the load-bearing test of whether v2 actually fixed it. ρ=0.95 says yes.
Per-bin agreement¶
| bin | n | v1 α | v2 α | v1 mean |Δ| | v2 mean |Δ| | v1 → v2 verdict |
|---|---|---|---|---|---|---|
| [0.0, 0.2) | 13 | −0.03 | 1.00 | 0.01 | 0.00 | degenerate → strong |
| [0.2, 0.4) | 9 | 0.20 | 0.67 | 0.04 | 0.01 | escalate → usable |
| [0.4, 0.6) | 20 | −0.11 | −0.23 | 0.11 | 0.08 | escalate → escalate |
| [0.6, 0.8) | 20 | 0.77 | 0.97 | 0.17 | 0.04 | strong → strong |
| [0.8, 1.0] | 20 | 0.70 | 0.97 | 0.18 | 0.02 | strong → strong |
Four of five bins improved decisively. The policy-obfuscation tail
(p ≥ 0.6) is now strong with α=0.97 and mean per-item disagreement
≈ 0.03 — i.e. the three judges differ by ~3% of full scale on average
in the regime adaptive proxy-gaming would produce its dangerous false
positives. Usable for adaptive findings.
The remaining escalation in [0.4, 0.6)¶
α=−0.23 looks worse than v1's −0.11, but mean pairwise
disagreement actually dropped from 0.11 to 0.08. The negative α is
a normalization artifact: v2 made the overall population more
agreeable, so even a small within-bin spread reads as
worse-than-population.
These items are almost certainly the activation_obfuscation cases —
the obfuscation fixture deliberately drifts p toward 0.5 over epochs
for this class. v2 caps activation_obfuscation at 0.40, but the judges
still disagree on whether specific items land at 0.30 vs 0.40. The
disagreement is bounded (0.08 absolute) but not zero.
This is the residual that v3 would target — but the headline impact on the decision-relevant regions is already small.
What this means for the adaptive study¶
p ≥ 0.6: anchor is now strong; adaptive findings (improvement, filtering, proxy-gaming detection) can be reported without caveat.p < 0.2andp ∈ [0.2, 0.4): anchor is now usable; findings carry their per-bin α as a footnote, not a blocker.p ∈ [0.4, 0.6): anchor still escalates. Adaptive claims about ranking items in this band still need rubric v3 or an external human-rater pass. Disagreement is bounded (≤ 0.08 absolute) but surveying differences <0.08 across this band is below the ensemble's resolution.
The previous finding "the anchor is usable for very-good vs very-bad
but not for the middle" has narrowed to "usable everywhere except
[0.4, 0.6)" under v2.
What's load-bearing in the comparison¶
The v1 → v2 improvement is not just "more agreement is better." The falsifiable claim was: the mistral ↔ qwen split was a rubric ambiguity, not a model difference. If v2's banded anchors were the right fix, ρ(mistral, qwen) should jump close to 1.0 — and it did (0.66 → 0.95). If the gap had been a real model-capability difference, ρ would have stayed flat. v2 validates the rubric-ambiguity hypothesis.
Followups (in priority order)¶
- Adaptive RL implementation can now proceed against v2 with a real anchor for the decision-relevant regions. This is the actual bottleneck for the rest of the study.
- Larger pilot at per-bin=50 would tighten the
[0.2, 0.4)estimate (currently n=9) and provide a publishable powered result. - Cross-scenario — run on
mixedandself_optimizerto confirm v2 transfers off the obfuscation fixture. - Rubric v3 for
[0.4, 0.6)— only worth doing if adaptive findings actually land in that band. - Claude when
ANTHROPIC_API_KEYis available — would test whether v2's wins survive a fourth, qualitatively different judge.
Reproducibility¶
# v2 judge run (3 judges × 82 items, ~10 min on local Ollama)
cd /path/to/swarm
JUDGE_MODEL_LLAMA=llama3.1:8b python -m experiments.calibration_judge \
--rubric rubric.v2 \
--judges llama qwen mistral \
--scenario obfuscation --per-bin 20 --seed 42
# Agreement analysis on the resulting judge_scores.csv
python -m experiments.calibration_agreement \
--scores /path/to/judge_scores.csv
Specific artifacts from this pilot (gitignored per project convention):
- v2 judge scores:
runs/20260603T004551Z_calibration_judge_seed42/ - v2 agreement:
runs/20260603T010608Z_calibration_agreement/ - v1 judge scores (for direct comparison):
runs/20260603T001634Z_calibration_judge_seed42/ - v1 agreement:
runs/20260603T002953Z_calibration_agreement/