Calibration Study — Pre-Registration¶
Status: pre-registered, not yet run Date: 2026-06-01 Unblocks: adaptive agents study (arms 1–3)
Motivation¶
The adaptive agents study cannot distinguish three causes of a toxicity drop without an external signal:
- Agents improved quality. (governance worked)
- Agents rejected more. (filtering)
- Agents gamed the proxy —
v̂rises while true quality stays flat. (governance defeated)
The framework also rests on the implicit assumption that the proxy-derived
p = sigmoid(v̂) is well-calibrated against SoftInteraction.ground_truth
(the realized v ∈ {+1, -1} label). That has never been measured. Until
both things are anchored, every downstream result is interpretable only at
the level of p — which is exactly what a strategic agent will defeat.
Central goal¶
Produce a triangulated quality signal that adaptive-agent results can be interpreted against:
- a measured proxy-fidelity baseline (
pvsground_truth), - an external judge anchor (LLM-judge scores on accepted interactions),
- inter-rater agreement across judges (so "the judge is also gameable" is bounded, not waved away).
The deliverable is a CSV of accepted interactions with {v_hat, p,
ground_truth, judge_score_claude, judge_score_gpt4o_mini,
judge_score_llama, agreement_metrics} that downstream studies can join
against by run-id.
Arms¶
Arm A — Proxy fidelity¶
Measure how well p = sigmoid(v̂) from ProxyComputer tracks the realized
ground_truth label.
- Procedure: generate interactions with a known latent generator
probability drawn from a controlled grid (e.g.
{0.05, 0.2, 0.4, 0.6, 0.8, 0.95}, ≥500 interactions per bin) and sampleground_truth ∈ {+1, -1}from it. RunProxyComputerto compute(v_hat, p)per interaction. Binp, compute the empirical frequency ofground_truth = +1per bin (reliability diagram). - Metrics: expected calibration error (ECE), maximum calibration error (MCE), Brier score, reliability-diagram bins.
- Sweep: the sigmoid
kparameter; report fidelity vsk. Identify thekthat minimizes ECE. - Output:
runs/<ts>_calibration_fidelity/with per-bin CSV and reliability plot.
Arm B — External judge anchor¶
Build the un-gameable quality signal the adaptive study needs.
- Procedure: sample a held-out set of accepted interactions from
static-baseline runs (≥1000 interactions, stratified across scenarios
and
pbins). For each interaction, present its content artifact (text/output, not the proxy observables) to an LLM judge with a fixed rubric. Judge returns a score on[0, 1]. - Prompt discipline: the judge prompt MUST NOT include the
ProxyObservablesfieldsProxyComputerreads — concretely:task_progress_delta,rework_count,verifier_rejections,tool_misuse_flags,counterparty_engagement_delta. The judge rates on content quality dimensions orthogonal to those signals. This is what makes the signal un-gameable from an adaptive agent's perspective — the agent can shift observables without shifting what the judge sees. - Judges: Claude, GPT-4o-mini, Llama 3.x (matches existing Table 9 setup).
- Output:
runs/<ts>_calibration_judge/judge_scores.csvwith one row per interaction × judge.
Arm C — Inter-rater agreement¶
Bound the "the judge is also gameable" concern.
- Procedure: on Arm B's sample, compute pairwise agreement across the three judge models.
- Metrics: Krippendorff's α (continuous), Spearman ρ between judge pairs, intra-class correlation (ICC). Report disagreement-by-bin so we see whether judges disagree more at the tails.
- Decision rule: if α < 0.5, the judge ensemble is too noisy to anchor the adaptive study — escalate (add a 4th judge, switch to a stronger model, or fall back to human spot-checks on a smaller subsample). Pre-register the α threshold so we don't drift it post-hoc.
Critical confound¶
The judge is also a model. A sophisticated adaptive agent could, in principle, learn to game the judge rather than the proxy. Three defenses:
- The judge prompt rubric is fixed and orthogonal to
ProxyComputer's observables, so the gaming channel is not the same one. - Inter-rater agreement (Arm C) bounds the size of model-specific exploits — an attack that fools Claude but not GPT-4o-mini and Llama will surface in α.
- We pre-register an external-quality threshold before the adaptive runs and never tune it afterward. The adaptive study reports judge-rated quality as a held-out signal, not as a target.
These do not eliminate the recursion — they bound it. State the bound explicitly in any downstream finding.
Pre-registered commitments¶
pgrid, sample sizes, andksweep range for Arm A are fixed before data collection.- Judge prompt rubric, judge model versions, and excluded observable
features for Arm B will be version-locked at
swarm/judges/rubric_v1.mdbefore any Arm B data collection. - The Krippendorff α decision threshold for Arm C is set in advance.
- The CSV schema downstream studies join against is frozen before the adaptive study begins.
- All judge calls are logged in append-only JSONL alongside the run artifacts for full reproducibility.
Outcomes¶
pis well-calibrated AND judges agree (α ≥ 0.7): the strong result. Adaptive study unblocked with confidence;pcan be reported as a calibrated proxy; judge serves as the un-gameable anchor.pis poorly calibrated but judges agree: the framework still has a usable external anchor; we report the calibration gap honestly and re-tunek. Adaptive study still unblocked.- Judges disagree (α < 0.5): the anchor itself is too noisy. The adaptive study cannot rely on a single ensemble score. We either add judges, switch models, or fall back to a smaller human-rated subsample. This is itself a publishable result about LLM-judge reliability.
pwell-calibrated AND judges agree with each other, but the proxy and the judge ensemble disagree:pand the external judge measure different things. Most interesting result — interpret carefully; this is where the proxy-gaming threat model lives.
Order of operations¶
- Arm A — proxy fidelity on existing static-baseline data (no new runs needed if logs are sufficient). Cheap, runs first.
- Arm B — build judge pipeline, freeze rubric, run on stratified sample.
- Arm C — compute agreement on Arm B's output.
- Freeze the joined CSV schema. Adaptive study (arms 1–3) is unblocked.
Arm A is independent and can begin immediately. Arms B and C share the judge pipeline and are run in sequence.