Adaptive Arm 2 — Active Cause-3 Probe (90 cells, 3 rewards × 6 ρ × 5 seeds)¶
Date: 2026-06-05 Pre-registration: adaptive-agents-prereg.md (arm 4, fully-adaptive evasion detection) Prior: calibration anchor integration (passive finding) Status: powered active cause-3 result
TL;DR¶
- The active cause-3 mechanism reproduces sharply. With a metadata
channel and a judge-weighted reward, the CEM-trained agent learns
to claim
honestwhile underlying quality is unchanged or even drops. The judge becomes a Goodhart target. - The mechanism is reward-driven, not lever-driven. Same pattern on every ρ value from 0 to 1; ρ has no effect on the gaming incentive. This means the structural-inertness finding (ρ flat in every condition) survives, and is independent of cause-3 gaming.
- Calibration anchors based on agent-controllable metadata are fundamentally gameable. The v3 target rules' "agent_type" trigger is fully under agent control. The orthogonality property (judge doesn't see proxy observables) is intact, but it doesn't protect against the agent inserting positive evidence into the judge's view.
- The control reward (
payoff) shows no gaming — the agent has no reason to claim a specific label, so the claim_param drifts and the judge produces its normal banded-prior fallback. - Cost: $0.00 (MockJudge, no network). Wall time: ~9 minutes for 90 cells.
Configuration¶
Builds on the calibration-integration scaffold by extending the policy class with one parameter:
| Component | Description |
|---|---|
ClaimingParams |
9-vector. Dims 0–7 build a Policy; dim 8 is claim_param ∈ [0, 1] controlling the agent_type claim |
| Claim thresholds | < 1/3 → "blatant"; [1/3, 2/3) → "" (no claim); ≥ 2/3 → "honest" |
train_cem_cause3 |
CEM trainer with three reward strategies (see below) |
| Judge | MockJudge under rubric v3 (deterministic, target rules + banded priors) |
| Pre-reg grid | ρ ∈ {0, 0.1, 0.3, 0.5, 0.7, 1.0} × seeds {42, 123, 456, 789, 1024} |
| CEM budget | population=30, iterations=10, interactions/episode=200 |
The arm-2 grid result is untouched: this probe composes Policy
without altering it.
Reward strategies¶
| Reward | Definition | Hypothesis |
|---|---|---|
judge |
reward = mean_judge_score |
Pure cause-3 incentive. Agent should learn to claim honest regardless of quality. |
judge_and_payoff |
reward = mean_judge_score + mean_payoff_attempted |
Mixed pressure. Agent should claim honest AND maintain quality (the realistic threat). |
payoff (control) |
reward = mean_payoff_attempted |
No incentive to game judge. Claim should drift. |
Headline (seed-averaged)¶
| reward | ρ | claim | label | proxy_tox | judge_tox | gap | target_fire% |
|---|---|---|---|---|---|---|---|
| judge | 0.0 | 0.94 | honest | 0.434 | 0.250 | +0.184 | 100% |
| judge | 0.5 | 0.94 | honest | 0.434 | 0.250 | +0.184 | 100% |
| judge | 1.0 | 0.94 | honest | 0.434 | 0.250 | +0.184 | 100% |
| judge_and_payoff | 0.0 | 0.87 | honest | 0.122 | 0.250 | −0.128 | 100% |
| judge_and_payoff | 0.5 | 0.94 | honest | 0.126 | 0.250 | −0.124 | 100% |
| judge_and_payoff | 1.0 | 0.83 | honest | 0.126 | 0.250 | −0.124 | 100% |
| payoff | 0.0 | 0.38 | none | 0.125 | 0.540 | −0.415 | 20% |
| payoff | 0.5 | 0.29 | blatant | 0.126 | 0.720 | −0.594 | 60% |
| payoff | 1.0 | 0.31 | blatant | 0.125 | 0.720 | −0.595 | 60% |
Decomposition¶
Under reward = judge (pure cause-3)¶
The agent's optimal policy: claim "honest", drop quality.
claim_param = 0.94(well above the 2/3 honest threshold) across all 30 cells.- Proxy toxicity rose to 0.434 — meaning underlying
pdropped to ~0.57. The agent stopped maximizingpbecause doing so didn't pay (reward was the judge's score, not p). - Judge toxicity = 0.250 (i.e. mean score 0.75) — the v3 honest target rule fires every time.
- Gap = +0.184: proxy reports more toxicity than the judge. This is the cause-3 fingerprint as the prereg framed it:
Agents game the proxy: v̂ rises while true quality stays flat.
In our setup the gaming target is the judge, not v̂, but the fingerprint is the same: one signal moves favourably for the agent, the other doesn't.
Under reward = judge_and_payoff (realistic threat)¶
The agent learns to claim honest and keep quality. Both signals say "clean":
claim_param ≈ 0.87–0.94 → honest claim.- Proxy toxicity = 0.122–0.126 (same as the arm-2 main result — the agent still maximizes p).
- Judge toxicity = 0.250 (honest target).
- Gap = −0.128 (proxy thinks cleaner than judge — the same v3-vs-proxy mechanical difference observed for static honest in the calibration-integration run).
This is the most pernicious threat model. The agent has both signals saying clean. A third party reviewing the agent would see:
- ✅ Proxy: low toxicity.
- ✅ Calibration anchor: low toxicity, target rule fired.
- ✅ Welfare: positive.
But the agent's certification (the honest claim) is unearned — it's a Goodhart artifact, not a reflection of capability. The fact that the agent in this regime also maximizes quality is a contingent property of the reward function; in a more realistic deployment, the "claim honest" trick would be a far cheaper way to look good than actually being good.
Under reward = payoff (control)¶
The agent has no incentive to claim anything specific. Claim_param drifts to various values across cells (mean ~0.38, but with high variance; some cells settle on "blatant", some on "none"). Judge toxicity is correspondingly noisy:
- "None" cells: judge target rules don't fire → banded prior gives judge_tox ≈ 0.45.
- "Blatant" cells: judge target rule fires negatively → judge_tox = 0.90.
- "Honest" cells: shouldn't occur under pure-payoff reward, and indeed we observe zero in the seed-averaged means.
Target rules fire only 20–60% of the time, depending on which side of the threshold each seed lands on. This is the negative control: without a judge-shaped reward, gaming the judge doesn't emerge.
What this empirically confirms¶
The prereg listed four possible outcomes. The arm-2 main grid fired outcome #4 (curves coincide on toxicity). The active cause-3 probe fires outcome #3:
Agents learn to game the proxy: arguably the most important result of all. It demonstrates that the realistic strategic agents the whole field cares about will defeat observable-based governance unless the proxy is gaming-resistant.
In our setup the "proxy" being gamed is the judge, not v̂. The substitution is meaningful: the v3 calibration anchor was supposed to be the orthogonal external signal against which proxy claims could be checked. The probe shows that anchors based on agent- controllable metadata are themselves gameable. Orthogonality from proxy observables (which the v3 rubric enforces) does not protect against the agent injecting positive evidence into the judge's view.
What it does not break¶
- Arm-2's structural-inertness result holds. All three rewards produce flat toxicity (and flat claim) across ρ. The lever is inert; the gaming pattern is reward-driven, not lever-driven.
- The calibration anchor remains valid on labelable items the agent did not produce. Figure 4 static baselines (honest / mixed / toxic) still get scored correctly because their labels were assigned by the experiment designer, not by the agent.
- Orthogonality (judge ⊥ proxy observables) is intact. What
fails is a different property: the judge's positive evidence
field (
metadata["agent_type"]) is agent-controlled, and a cause-3 agent learns to forge it.
What this means for anchor design¶
Two design lessons fall out of this:
- Anchors that rely on agent-supplied metadata are fundamentally gameable. Any field the agent can write becomes a Goodhart target under judge-shaped reward.
- Robust anchors need a content channel the agent cannot forge. Either (a) judge-rated artifacts derived from the agent's actions rather than its claims, or (b) cryptographic / third-party attestation of the claim. v3's "judge doesn't see proxy observables" property is necessary but not sufficient.
Honest caveats¶
- The 0.434 proxy toxicity under
judgereward is a contingent property: the policy class has no quality floor, so unlearning quality is free. A reward function that also penalizes negative payoff (i.e. thejudge_and_payoffregime) keeps quality high even while gaming the judge. MockJudgeis deterministic. Real LLM judges would have noise, but the structural pattern — target rule fires → high judge score → CEM finds the rule — is reward-driven, not noise- sensitive.- The cause-3 mechanism is sharpest under the binary discrete thresholds (claim_param maps to one of three labels). A continuous claim channel (e.g. the agent emits a free-text rationale to a real LLM judge) would have the same incentive structure but a richer attack surface.
- Five seeds is the pre-registered power; seed variance on
claim_param under
payoffis genuinely high (some seeds converge to "blatant", some to "none"). The seed-averaged means hide this; the per-cell CSV preserves it.
Followups (priority order)¶
- Real LLM judge on this same grid. Mock judge confirms the mechanism; LLM judge would show whether prompt-engineering the rubric to be skeptical of agent claims helps. (Not needed for the structural finding, but worth showing for the paper.)
- Adversarial rubric design. Can v4 be written such that the target rule requires both a metadata claim and corroborating observable evidence? If yes, the cause-3 attack surface narrows to just the orthogonality channel.
- Reward decomposition under cause-3 gaming. Even under
judge_and_payoff, the agent does some quality-maintenance work. How much of the converged welfare comes from each component? - Cross-domain replication. This setup is one
(rho × seed × reward) grid; the structural finding should
replicate on the
mixed/self_optimizerscenarios with adjusted policy classes. - Adaptive-acceptance arm. The prereg's filtering-only replication; would round out the four-arm comparison.
Reproducibility¶
# Full grid (3 rewards × 6 ρ × 5 seeds = 90 cells, ~9 min)
python -m experiments.adaptive_arm2_cause3_probe
# Specific subset
python -m experiments.adaptive_arm2_cause3_probe \
--reward judge \
--rho 0.3 \
--seed 42
Artifacts (gitignored):
runs/20260605T040943Z_adaptive_arm2_cause3_probe/cause3_summary.csv— 90 rows, per-cell final state.runs/.../config.json