Qwen3-30B v0.2 Training: From Passive Observer to Economy Grinder¶
The first SWARM Economy training run (write-up here) showed that RL can teach an LLM to navigate a multi-agent economy, but hit a reward ceiling at 1.14 due to environment design issues. After shipping v0.2 of the environment — stochastic deceptive bots, better scoring, richer observations, social action rebalancing — we kicked off a new training run. Here's what happened.
Training Setup¶
| Parameter | Value |
|---|---|
| Model | Qwen/Qwen3-30B-A3B-Thinking-2507 |
| Environment | swarm-ai-research/swarm-economy (v0.2) |
| Steps | 200 |
| Batch size | 64 |
| Rollouts per example | 4 |
| Max tokens | 2048 |
| Dataset | 500 scenarios across easy/medium/hard |
| Platform | Prime Intellect hosted RL |
Run ID: (available on Prime Intellect dashboard)
Reward Curve¶
Reward
1.38 | x
1.36 | x x
1.32 | x x x x x x x
1.28 | x x x x x x x x x
1.24 | x x x
1.20 | x x
1.16 | x
1.12 | x x
1.08 | x
1.04 |
1.00 |
0.96 |
0.92 |
0.88 |x
+-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|----->
0 20 40 60 80 100 120 140 160 180 199 Step
Final reward: 1.38 (up from 0.89 at step 0 — a 55% improvement). The v0.1 ceiling of 1.14 was surpassed by step 32 and never looked back. Training completed in ~2.75 days (Feb 19–21).
Metrics Over Training¶
| Step | Reward | Payoff | Interaction Quality | Action Diversity | Error Rate | Truncated |
|---|---|---|---|---|---|---|
| 0 | 0.888 | 0.652 | 0.184 | 0.428 | 9.4% | 68.8% |
| 20 | 1.235 | 0.928 | 0.332 | 0.737 | 0.0% | 10.9% |
| 40 | 1.252 | 0.936 | 0.393 | 0.767 | 12.5% | 0.0% |
| 60 | 1.269 | 0.950 | 0.405 | 0.748 | 0.0% | 0.0% |
| 80 | 1.213 | 0.891 | 0.433 | 0.647 | 28.1% | 0.0% |
| 100 | 1.274 | 0.932 | 0.480 | 0.700 | 17.2% | 0.0% |
| 120 | 1.282 | 0.953 | 0.549 | 0.647 | 17.2% | 3.1% |
| 140 | 1.151 | 0.835 | 0.507 | 0.546 | 50.0% | 0.0% |
| 160 | 1.323 | 0.970 | 0.634 | 0.623 | 6.2% | 0.0% |
| 180 | 1.317 | 0.969 | 0.629 | 0.555 | 6.2% | 0.0% |
| 199 | 1.377 | 0.998 | 0.717 | 0.596 | 0.0% | 0.0% |
The Behavioral Shift: Step 0 vs Step 120¶
The most striking change isn't the reward number — it's what the model does with its turns. We compared rollout samples from step 0 (6 samples) and step 120 (8 samples).
Step 0: High Variance, Low Engagement¶
At the start of training, the model was wildly inconsistent:
| Sample | Reward | Turns Used | Tasks Claimed | Tasks Submitted | Trades |
|---|---|---|---|---|---|
| 0 | 0.702 | 4 | 1 | 0 | 0 |
| 1 | 1.320 | 25 | 5 | 5 | 4 |
| 2 | 1.332 | 25 | 5 | 5 | 4 |
| 3 | 1.005 | 10 | 2 | 2 | 0 |
| 4 | 0.954 | 10 | 2 | 1 | 0 |
| 5 | 0.683 | 3 | 1 | 0 | 0 |
Two samples barely participated (3-4 turns, never submitted work). Two were moderate (10 turns). Two were fully engaged (25 turns, maxed payoff). The model hadn't learned that using all your turns is strictly better than stopping early.
Step 120: Consistent Engagement¶
| Sample | Reward | Turns Used | Tasks Claimed/Sub | Trades Prop/Acc | Quality |
|---|---|---|---|---|---|
| 0 | 1.317 | 25 | 9/5 | 5/1 | 0.500 |
| 1 | 1.263 | 8 | 2/2 | 3/0 | 0.625 |
| 2 | 1.315 | 25 | 9/5 | 6/0 | 0.500 |
| 3 | 1.403 | 25 | 6/4 | 2/8 | 0.619 |
| 4 | 0.995 | 8 | 2/1 | 2/0 | 0.429 |
| 5 | 1.338 | 25 | 5/5 | 8/0 | 0.542 |
| 6 | 1.306 | 25 | 6/5 | 6/0 | 0.478 |
| 7 | 1.347 | 25 | 6/5 | 9/0 | 0.609 |
Six of eight samples use all 25 turns. Payoff is maxed (1.0) in 6/8 samples.
Averaged Comparison¶
| Metric | Step 0 (avg) | Step 120 (avg) | Change |
|---|---|---|---|
| Reward | 0.999 | 1.286 | +28.7% |
| Payoff | 0.734 | 0.948 | +29.2% |
| Interaction Quality | 0.204 | 0.538 | +163% |
| Action Diversity | 0.612 | 0.674 | +10.1% |
| Turns Used | 12.8 | 20.8 | +62% |
| Tasks Claimed | 2.7 | 5.6 | +109% |
| Tasks Submitted | 2.2 | 4.0 | +85% |
| Trades Proposed | 1.3 | 5.1 | +292% |
What the Model Learned¶
1. Use All Your Turns¶
The single biggest improvement. At step 0, two out of six rollouts used 3-4 turns and produced near-zero payoff. By step 120, the model consistently maxes out its turn budget. This alone accounts for a large chunk of the reward gain — you can't earn payoff if you're not acting.
2. Grind Tasks Harder¶
Task claims went from 2.7 to 5.6 per rollout, submissions from 2.2 to 4.0. The model learned that task completion is the most reliable payoff source and prioritizes it. The claim-to-submit ratio also improved (41% waste at step 0, down to 28% at step 120), suggesting the model is getting better at only claiming tasks it can complete.
3. Trade Aggressively¶
Trade proposals jumped from 1.3 to 5.1 per rollout. The model learned to propose trades even when most get rejected — the expected value is positive because the occasional accepted trade provides a large payoff boost.
4. The Best Strategy: Accept, Don't Just Propose¶
The highest-reward sample (1.403) had a distinctive pattern: it proposed only 2 trades but accepted 8. Every other sample proposed 5-9 trades and had 0-1 accepted. This agent played the receiver role, selectively accepting incoming proposals from bots rather than spamming its own. It also achieved the highest reputation (0.678) and quality (0.619).
This is an emergent strategy the model hasn't fully generalized yet — only 1 of 8 samples discovered it. If training continues, we'd expect this behavior to spread through the policy.
5. Interaction Quality Is the Fastest Mover¶
Quality nearly quadrupled from 0.184 to 0.717 over training. The v0.2 scoring changes (difficulty-scaled length thresholds, effort sigmoid, task-relevance keywords) created a gradient the model could follow — unlike v0.1's keyword-gameable heuristic. This was the single largest relative improvement across all metrics.
What's Still Weak¶
Reputation Stagnation¶
Reputation reward has barely moved — it was 0.544 at step 0 and 0.541 at step 120. The model is optimizing for payoff at the expense of long-term reputation building. With reputation weight at 0.3 (reduced from 0.6 in v0.1), this is rational — the model correctly identifies payoff as the dominant reward signal.
For the next run, we may increase reputation weight back to 0.5 or add a reputation floor penalty to force the model to maintain reputation above a threshold.
Error Rate Spike (Recovered)¶
The error rate (failed tool calls, invalid actions) spiked dramatically — peaking at 50% at step 140, with another spike to 28% around step 80. But it recovered fully: the final step logged 0% errors. The mid-training spikes suggest the model went through phases of trying new strategies that produced invalid actions before settling on ones that work.
Two Short Rollouts Persist¶
Two of eight samples at step 120 only used 8 turns. These still achieved decent rewards (0.995 and 1.263) but leave payoff on the table. The model hasn't fully eliminated early-stopping behavior — possibly because the thinking model occasionally produces long reasoning chains that consume tokens without calling tools.
Off-Policy Level Rising¶
The off-policy level climbed from 2 to 10 over the last 15 steps, and the scheduler started cancelling old rollout requests. This suggests rollout generation is getting slower (longer completions = more tokens = slower inference), creating staleness in the training data. May need to increase max_off_policy_steps or cap completion length.
v0.1 vs v0.2 Training Comparison¶
| Metric | v0.1 Final (step 199) | v0.2 Final (step 199) | Change |
|---|---|---|---|
| Reward | 1.131 | 1.377 | +21.8% |
| Payoff | 0.992 | 0.998 | +0.6% |
| Interaction Quality | 0.619 | 0.717 | +15.8% |
| Action Diversity | N/A | 0.596 | — |
| Error Rate | — | 0.0% | — |
| Reward Ceiling | 1.14 | 1.40+ | +23% |
| Self-proposal errors | ~4/sim | 0 | Eliminated |
| propose_trade spam | 9.2/step | 5.1/step | -44% |
| Training time | ~4 days | ~2.75 days | Faster |
The v0.2 environment broke through the v0.1 ceiling decisively. The highest single-sample reward (1.403) exceeds anything from v0.1 by 23%. Every metric either matched or exceeded v0.1, and the two biggest v0.1 failure modes (self-proposal errors and propose_trade spam) were eliminated. The environment fixes — stochastic bots, better scoring, richer observations — directly translated to a higher reward ceiling and more diverse strategies.
Base Model Eval: How Much Did RL Help?¶
Training metrics show improvement over the course of training, but step 0 metrics can be misleading — the early batches include rollouts that fail to engage at all, dragging the average down. To get a fair comparison, we ran the base Qwen3-30B-A3B-Thinking model (no RL, no adapter) through the same environment: 20 examples, 4 rollouts each, 76 completed rollouts on medium difficulty.
Base Model vs RL-Trained (Final)¶
| Metric | Base Model (76 rollouts) | RL Final (step 199) | Change |
|---|---|---|---|
| Reward | 1.211 | 1.377 | +13.7% |
| Payoff | 0.899 | 0.998 | +11.0% |
| Reputation | 0.545 | — | — |
| Interaction Quality | 0.389 | 0.717 | +84.3% |
| Action Diversity | 0.704 | 0.596 | -15.3% |
| Survival | 1.000 | 1.000 | Same |
| Avg Turns | 25.0 | 25.0 | Same |
What This Tells Us¶
The base Thinking model is already strong. Unlike the Instruct variant (which had 58% early termination in v0.1), the Thinking model uses all 25 turns, achieves 100% survival, and gets a 1.21 average reward without any RL. The extended reasoning chain helps it plan multi-step strategies out of the box.
RL's biggest contribution is interaction quality, which nearly doubled from 0.39 to 0.72. The model learned to produce genuinely better task submissions and more favorable trades — not just more of them.
RL trades diversity for payoff. Action diversity dropped from 0.70 to 0.60 after training. The base model explores more broadly; the RL model specializes on high-payoff actions. This is the classic exploration-exploitation tradeoff. The diversity metric (weight 0.1) wasn't strong enough to counteract the payoff gradient (weight 1.0).
Payoff went from 90% to 100%. The base model leaves ~10% of potential payoff on the table. RL squeezed it to near-maximum, likely through better task selection and more persistent trading.
The honest comparison is +13.7%, not +55%. The step 0 training metrics (0.89 reward) include failed rollouts from early batches where the model hadn't warmed up yet. Against a proper baseline eval, RL adds 13.7% — still significant, but the base model does most of the heavy lifting.
Follow-Up Run: Instruct Variant with Higher Diversity Weight¶
The Thinking model's diversity drop (0.70 → 0.60) was the clearest tuning opportunity from the v0.2 run. We bumped the action_diversity_metric weight from 0.1 to 0.25 (environment v0.2.3) and switched to the Instruct variant (Qwen3-30B-A3B-Instruct-2507) after the Thinking model was removed from Prime Intellect's available RL models.
This was also a test of whether the v0.1 early-termination problem — where the Instruct model quit after 3-4 turns in 58% of rollouts — would resurface. The v0.2 environment had only been tested with the Thinking model.
Training Setup (v0.2.3)¶
| Parameter | Value |
|---|---|
| Model | Qwen/Qwen3-30B-A3B-Instruct-2507 |
| Environment | swarm-ai-research/swarm-economy (v0.2.3) |
| Steps | 200 |
| Batch size | 64 |
| Diversity weight | 0.25 (up from 0.1) |
| Training time | ~19 hours (Feb 23) |
Reward Curve¶
Reward
1.47 | x
1.46 | x x x
1.45 | x x x x
1.44 | x
1.43 | x
1.42 | x
1.41 | x
1.40 | x
1.38 |
1.37 | x x
1.36 |
1.35 |
1.34 |
1.33 |x
+-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|----->
0 20 40 60 80 100 120 140 160 180 199 Step
Final reward: 1.46 — the highest of any run. The reward curve is notably smoother than the Thinking model's run, with a steady climb rather than the sawtooth pattern.
Metrics Over Training¶
| Step | Reward | Payoff | Quality | Diversity | Reputation | Errors | Turns |
|---|---|---|---|---|---|---|---|
| 0 | 1.333 | 0.921 | 0.333 | 0.752 | 0.524 | 0.0% | 24.1 |
| 20 | 1.372 | 0.959 | 0.360 | 0.760 | 0.503 | 0.0% | 24.2 |
| 40 | 1.372 | 0.958 | 0.373 | 0.754 | 0.502 | 0.0% | 23.9 |
| 60 | 1.414 | 0.990 | 0.394 | 0.790 | 0.491 | 0.0% | 25.0 |
| 80 | 1.416 | 0.983 | 0.417 | 0.770 | 0.522 | 0.0% | 24.6 |
| 100 | 1.432 | 0.995 | 0.449 | 0.788 | 0.500 | 0.0% | 25.0 |
| 120 | 1.450 | 0.994 | 0.480 | 0.746 | 0.579 | 0.0% | 25.0 |
| 140 | 1.459 | 0.997 | 0.543 | 0.725 | 0.573 | 0.0% | 25.0 |
| 160 | 1.408 | 0.949 | 0.616 | 0.629 | 0.593 | 46.9% | 20.3 |
| 180 | 1.449 | 0.994 | 0.663 | 0.615 | 0.560 | 0.0% | 24.7 |
| 199 | 1.460 | 1.000 | 0.646 | 0.645 | 0.566 | 0.0% | 25.0 |
Peak reward of 1.471 at step 186.
Thinking vs Instruct: Head-to-Head¶
| Metric | Thinking (v0.2, div=0.1) | Instruct (v0.2.3, div=0.25) | Change |
|---|---|---|---|
| Final Reward | 1.377 | 1.460 | +6.0% |
| Payoff | 0.998 | 1.000 | +0.2% |
| Interaction Quality | 0.717 | 0.646 | -9.9% |
| Action Diversity | 0.596 | 0.645 | +8.2% |
| Reputation | 0.541 | 0.566 | +4.6% |
| Error Rate | 0.0% | 0.0% | Same |
| Training Time | ~2.75 days | ~19 hours | 2.5x faster |
| Step 0 Reward | 0.888 | 1.333 | +50% |
What We Learned¶
The early-termination problem is gone. The Instruct model averaged 24-25 turns throughout training — the v0.2 environment fixes (richer observations, better turn incentives) solved the problem that plagued v0.1, regardless of model variant.
The Instruct model starts much stronger. Step 0 reward was 1.33 vs 0.89 for the Thinking model. This isn't an apples-to-apples comparison (the Thinking run used v0.2 with diversity weight 0.1), but it suggests the Instruct model's pre-training is better suited to the environment's tool-calling format. The Thinking model's extended reasoning chains consumed tokens without necessarily producing better actions.
Higher diversity weight worked. Diversity held at 0.645 vs 0.596 — a meaningful 8% improvement. The Instruct model maintained broader action coverage while still maxing payoff. The 0.25 weight is a better balance point than 0.1.
Quality is the tradeoff. The Instruct model's interaction quality (0.646) lagged the Thinking model's (0.717) by 10%. The Thinking model's internal reasoning likely produces more thoughtful task submissions. But the diversity and payoff gains more than compensated in the total reward.
Reputation finally moved. Reputation went from 0.524 to 0.566 — a modest improvement but notably better than the Thinking run's flat 0.54. The higher diversity weight may encourage behaviors (voting, replying) that indirectly build reputation.
Training is 2.5x faster. The Instruct model completed in 19 hours vs 2.75 days. Shorter completions (no extended reasoning chains) mean faster rollout generation and less off-policy staleness.
Hard Difficulty: More Deceptive Bots, Same Performance¶
With medium difficulty saturated, we ran the same Instruct model + v0.2.3 config on hard difficulty — more deceptive bots, tighter governance constraints, and less forgiving trade dynamics.
Training Setup (Hard)¶
| Parameter | Value |
|---|---|
| Model | Qwen/Qwen3-30B-A3B-Instruct-2507 |
| Environment | swarm-ai-research/swarm-economy (v0.2.3) |
| Difficulty | Hard |
| Steps | 200 |
| Batch size | 64 |
| Diversity weight | 0.25 |
| Training time | ~16 hours (Feb 27) |
Reward Curve¶
Reward
1.47 | x
1.46 | x x x x
1.45 | x x x x
1.44 | x
1.42 | x
1.41 | x
1.40 |
1.38 |
1.36 |
1.34 |
1.33 |x
1.31 | x
+-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|----->
0 20 40 60 80 100 120 140 160 180 199 Step
Final reward: 1.46 — matching medium difficulty exactly. Peak reward of 1.473 at step 175.
Metrics Over Training¶
| Step | Reward | Payoff | Quality | Diversity | Reputation | Errors | Turns |
|---|---|---|---|---|---|---|---|
| 0 | 1.332 | 0.935 | 0.326 | 0.785 | 0.451 | 0.0% | 23.8 |
| 20 | 1.313 | 0.898 | 0.397 | 0.751 | 0.493 | 25.0% | 19.3 |
| 40 | 1.416 | 0.976 | 0.497 | 0.798 | 0.468 | 0.0% | 25.0 |
| 60 | 1.449 | 0.999 | 0.639 | 0.722 | 0.472 | 9.4% | 24.8 |
| 80 | 1.452 | 0.999 | 0.634 | 0.723 | 0.483 | 0.0% | 25.0 |
| 100 | 1.441 | 0.988 | 0.564 | 0.763 | 0.496 | 23.4% | 22.9 |
| 120 | 1.457 | 0.997 | 0.569 | 0.783 | 0.500 | 0.0% | 25.0 |
| 140 | 1.465 | 0.996 | 0.637 | 0.724 | 0.536 | 0.0% | 24.8 |
| 160 | 1.468 | 0.999 | 0.643 | 0.728 | 0.528 | 0.0% | 25.0 |
| 180 | 1.461 | 1.000 | 0.665 | 0.725 | 0.491 | 0.0% | 25.0 |
| 198 | 1.462 | 1.000 | 0.665 | 0.725 | 0.491 | 0.0% | 25.0 |
Medium vs Hard: Head-to-Head¶
| Metric | Instruct (medium) | Instruct (hard) | Change |
|---|---|---|---|
| Final Reward | 1.460 | 1.462 | +0.1% |
| Payoff | 1.000 | 1.000 | Same |
| Quality | 0.646 | 0.665 | +2.9% |
| Diversity | 0.645 | 0.725 | +12.4% |
| Reputation | 0.566 | 0.491 | -13.3% |
| Training Time | ~19 hours | ~16 hours | Faster |
What This Means¶
Hard difficulty didn't hurt. The model matched medium on total reward (1.462 vs 1.460) and actually improved on diversity (0.725 vs 0.645) and quality (0.665 vs 0.646). This was unexpected — we anticipated the harder environment would create a meaningful performance gap.
Diversity is highest on hard. At 0.725, this is the best diversity score across all runs. The harder environment may force the model to use a wider range of actions to cope with deceptive bots, which the diversity metric rewards. On medium difficulty, the model could afford to specialize; on hard, specialization is riskier.
Reputation is the casualty. Reputation dropped to 0.491, the lowest across all runs. With more deceptive bots, it's harder to build trust — the model's proposals get rejected more often, and deceptive counterparties damage mutual reputation. This confirms reputation as the key metric to target in future runs.
Quality improved over medium. This is counterintuitive — harder scenarios should be harder to do well on. One hypothesis: the harder environment provides stronger learning signal for quality, since low-quality submissions get punished more severely by stricter governance. The model learns to write better submissions because it has to.
The environment may need a harder "hard." If the model can match medium-difficulty performance on hard with no additional training tricks, the difficulty gap may not be large enough. Future work could add an "extreme" difficulty tier or make hard difficulty meaningfully more challenging.
Reputation Weight Boost: Higher Reward, Stubborn Reputation¶
Reputation was the weakest metric across every run (0.491–0.566), so we increased its weight from 0.3 to 0.5 — the highest it's been since v0.1. Same Instruct model, hard difficulty, v0.2.3 environment.
Training Setup (Rep=0.5)¶
| Parameter | Value |
|---|---|
| Model | Qwen/Qwen3-30B-A3B-Instruct-2507 |
| Environment | swarm-ai-research/swarm-economy (v0.2.3) |
| Difficulty | Hard |
| Reputation weight | 0.5 (up from 0.3) |
| Diversity weight | 0.25 |
| Training time | ~18 hours (Mar 1) |
Reward Curve¶
Reward
1.57 | x
1.55 | x x x
1.54 | x x
1.51 | x x
1.48 | x
1.44 | x
1.41 |x
1.40 | x
1.36 | x
1.23 | x
+-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|----->
0 20 40 60 80 100 120 140 160 180 199 Step
Final reward: 1.56 — new all-time high. Peak of 1.569 at step 121. But the curve is noticeably noisier than previous runs, with deep dips at steps 100 (1.23) and 180 (1.36) driven by error rate spikes.
Metrics Over Training¶
| Step | Reward | Payoff | Quality | Diversity | Reputation | Errors | Turns |
|---|---|---|---|---|---|---|---|
| 0 | 1.412 | 0.913 | 0.310 | 0.750 | 0.501 | 0.0% | 23.7 |
| 20 | 1.438 | 0.940 | 0.411 | 0.722 | 0.472 | 12.5% | 22.7 |
| 40 | 1.508 | 0.987 | 0.531 | 0.731 | 0.465 | 0.0% | 24.8 |
| 60 | 1.540 | 0.989 | 0.668 | 0.677 | 0.496 | 6.2% | 24.1 |
| 80 | 1.534 | 0.987 | 0.701 | 0.606 | 0.510 | 6.2% | 23.8 |
| 100 | 1.225 | 0.777 | 0.427 | 0.412 | 0.518 | 76.6% | 11.8 |
| 120 | 1.546 | 0.988 | 0.673 | 0.645 | 0.524 | 0.0% | 24.4 |
| 140 | 1.408 | 0.901 | 0.574 | 0.584 | 0.493 | 53.1% | 17.9 |
| 160 | 1.509 | 0.966 | 0.627 | 0.641 | 0.515 | 12.5% | 22.8 |
| 180 | 1.358 | 0.855 | 0.519 | 0.559 | 0.519 | 56.2% | 15.5 |
| 198 | 1.556 | 1.000 | 0.647 | 0.706 | 0.497 | 0.0% | 25.0 |
Peak reward of 1.569 at step 121. Quality hit 0.701 at step 80 — matching the Thinking model's best — before error spikes knocked it back.
Rep=0.3 vs Rep=0.5 on Hard: Head-to-Head¶
| Metric | Hard (rep=0.3) | Hard (rep=0.5) | Change |
|---|---|---|---|
| Final Reward | 1.462 | 1.556 | +6.4% |
| Peak Reward | 1.473 | 1.569 | +6.5% |
| Payoff | 1.000 | 1.000 | Same |
| Quality | 0.665 | 0.647 | -2.7% |
| Diversity | 0.725 | 0.706 | -2.6% |
| Reputation | 0.491 | 0.497 | +1.2% |
| Peak Quality | 0.665 | 0.701 | +5.4% |
| Error Rate (avg) | Low | High (volatile) | Worse |
What This Means¶
Higher reputation weight → higher total reward, but not through reputation. The 6.4% reward gain came almost entirely from the higher weight multiplying a roughly constant reputation score (~0.5). Reputation itself barely moved: 0.491 → 0.497. The model found it easier to improve payoff and quality than to actually build reputation.
Reputation appears structurally capped. Across five runs, two model variants, two difficulty levels, and reputation weights from 0.3 to 0.5, reputation has never exceeded 0.57. This isn't a weight problem — it's an environment design problem. Possible causes: - Deceptive bots cap achievable reputation by damaging trust in trades - The 25-turn horizon is too short for reputation-building strategies to pay off - Reputation scoring may not reward the right behaviors (the model does vote and reply, but these don't reliably increase reputation)
Training stability suffers. Error rates spiked to 76.6% at step 100 and 56.2% at step 180 — the worst of any run. The higher reputation weight creates a more complex reward landscape, pushing the model into aggressive exploration that produces invalid actions. The model recovers each time (final step: 0% errors), but the instability wastes training steps.
Quality briefly matched the Thinking model. At step 80, quality hit 0.701 — within 2% of the Thinking model's final 0.717. This suggests the Instruct model can produce Thinking-level quality; it just can't sustain it through the error spikes. More stable training (lower learning rate, gradient clipping) might preserve peak quality.
The reputation problem needs a structural fix, not more weight. Next steps: add a reputation floor penalty (punish dropping below starting reputation), extend the episode horizon to give reputation-building strategies time to pay off, or redesign how bot interactions affect the agent's reputation score.
All Runs Comparison¶
| Metric | v0.1 Thinking | v0.2 Thinking | v0.2.3 Instruct (med) | v0.2.3 Instruct (hard) | v0.2.3 Instruct (hard, rep=0.5) |
|---|---|---|---|---|---|
| Final Reward | 1.131 | 1.377 | 1.460 | 1.462 | 1.556 |
| Peak Reward | 1.403 | — | 1.471 | 1.473 | 1.569 |
| Payoff | 0.992 | 0.998 | 1.000 | 1.000 | 1.000 |
| Quality | 0.619 | 0.717 | 0.646 | 0.665 | 0.647 |
| Diversity | N/A | 0.596 | 0.645 | 0.725 | 0.706 |
| Reputation | N/A | 0.541 | 0.566 | 0.491 | 0.497 |
| Difficulty | Medium | Medium | Medium | Hard | Hard |
| Rep Weight | N/A | 0.3 | 0.3 | 0.3 | 0.5 |
| Div Weight | N/A | 0.1 | 0.25 | 0.25 | 0.25 |
| Training Time | ~4 days | ~2.75 days | ~19 hours | ~16 hours | ~18 hours |
Reproducing This¶
# Install the environment
prime env install swarm-economy
# Run an eval
prime eval run swarm-economy --num-examples 10 --rollouts-per-example 4
# Start a training run (v0.2 config)
prime rl run configs/rl/swarm-economy-v2.toml
What's Next¶
- Structural reputation fix. ~~Increasing reputation weight to 0.5~~ didn't move reputation itself (still ~0.50). The problem is environmental, not weight-based. Options: reputation floor penalty, longer episode horizons, redesigned bot reputation dynamics, or an auxiliary reward for reputation improvement over the episode.
- Eval with adapter deployed. Adapter deployment is currently failing on Prime Intellect's infrastructure. Once resolved, we can run proper head-to-head evals: base model vs trained checkpoints on the same scenarios.
- ~~Curriculum to hard difficulty.~~ Done — the model matched medium-difficulty performance on hard with no degradation.
- ~~Boost diversity weight.~~ Done — increasing from 0.1 to 0.25 preserved diversity at 0.645+ while improving total reward.
- ~~Boost reputation weight.~~ Done — increasing from 0.3 to 0.5 raised total reward to 1.56 but reputation itself didn't budge. Weight tuning is exhausted; structural changes needed.
- Training stability. The rep=0.5 run had severe error spikes (76.6% at step 100). Lower learning rate, gradient clipping, or curriculum-based weight scheduling (start low, ramp up) could smooth the training curve.
- Cross-model comparison. Run the same environment with different base models (Llama-4-Scout, Nemotron-7B) to test whether the learned strategies are model-specific or universal.
- Harder environments. The model saturates both medium and hard at ~1.46-1.56 reward. Next steps include: multi-round deception (bots that build trust then defect), resource scarcity mechanics, or coalition dynamics between bot factions.
Disclaimer: This post simulates a stylized economic environment for AI safety research. Nothing here constitutes financial advice, investment recommendations, or endorsement of any economic policy or trading strategy.