Training an LLM Agent to Navigate a Multi-Agent Economy with RL¶
We trained a Qwen3-30B model to operate in a simulated multi-agent economy using reinforcement learning. The agent learned to maximize payoff and reputation by completing tasks, proposing trades, and navigating governance constraints — all while interacting with programmatic bots that cooperate, cherry-pick, or deceive.
The Environment: SWARM Economy¶
The SWARM Economy environment is a Verifiers RL environment inspired by the SWARM framework and the Distributional AGI Safety paper. It simulates a multi-agent economy where an LLM agent interacts with programmatic bots through 10 tools:
| Tool | Description |
|---|---|
post(content) |
Post to the community feed (+2.0 reputation) |
reply(post_id, content) |
Reply to a post (+1.5 reputation for others' posts) |
vote(post_id, direction) |
Upvote or downvote |
propose_trade(counterparty_id, content, offered_transfer) |
Propose a resource trade |
accept_proposal(proposal_id) |
Accept a pending trade directed at you |
reject_proposal(proposal_id) |
Reject a pending trade |
cancel_proposal(proposal_id) |
Cancel a proposal you created |
claim_task(task_id) |
Claim an available task |
submit_work(task_id, content) |
Submit work for a claimed task |
pass_turn() |
Do nothing (faster reputation decay) |
Each simulation runs for 5 epochs of 5 steps (25 total tool calls). After each tool call, all programmatic bots act simultaneously, governance is applied (taxes, audits, reputation decay), and the agent receives an updated observation of the economy.
The Bots¶
Three types of programmatic agents create a rich social environment:
- Honest bots cooperate reliably, complete tasks diligently, and accept most proposals
- Opportunistic bots cherry-pick high-reward tasks and free-ride on votes
- Deceptive bots build trust for 5+ interactions, then exploit via bad trades
Soft Probabilistic Labels¶
Rather than binary good/bad outcomes, the environment uses the core SWARM innovation: soft probabilistic labels. Interaction quality is scored as p = sigmoid(k * v_hat) where v_hat is a weighted combination of proxy signals (content quality, reputation, trade fairness). Payoffs are computed from p, creating a smooth reward landscape that RL can optimize over.
Reward Structure (v0.2)¶
| Function | Weight | Description |
|---|---|---|
payoff_reward |
1.0 | Normalized total payoff across the simulation |
reputation_reward |
0.3 | Final reputation mapped to [0, 1] |
interaction_quality_metric |
0.2 | Average quality of accepted interactions |
action_diversity_metric |
0.1 | Shannon entropy of tool usage (encourages exploration) |
survival_metric |
0 (logged) | Whether the agent avoided the circuit breaker |
Training Setup¶
We trained on Prime Intellect using their hosted RL infrastructure:
| Parameter | Value |
|---|---|
| Model | Qwen/Qwen3-30B-A3B-Thinking-2507 |
| Method | LoRA fine-tuning with GRPO |
| Steps | 200 |
| Batch size | 64 |
| Rollouts per example | 4 |
| Learning rate | 5e-5 |
| Max tokens | 2048 |
| Dataset | 200 scenarios (v0.1) / 500 scenarios (v0.2) |
The dataset spans easy, medium, and hard difficulties with different mixes of honest, opportunistic, and deceptive bots, and varying governance parameters (tax rates, audit risk, initial resources, reputation decay, bounty multipliers).
Getting Here: Debugging Training Collapse¶
The first two training attempts collapsed at step 3 — the model stopped calling tools entirely after the first LoRA weight update, producing 0 training samples. Three fixes resolved this:
- Switched to a thinking model (Qwen3-30B-A3B-Thinking) — the extended reasoning helped the model maintain tool-calling behavior through weight updates
- Lowered the learning rate to 5e-5 — smaller updates preserved the base model's tool-calling capabilities
- Expanded the dataset from 50 to 200 scenarios — more diversity prevented overfitting to a narrow set of scenarios
Results¶
The training run completed successfully over ~4 days (200 steps). Here's how the key metrics evolved:
Reward Curve¶
Reward
1.14 | xxxxxxxxxxxxxxxxxx
1.12 | xxxxxxxxxxxx
1.10 | xxxxxxx
1.08 | xx
1.06 | xx
1.04 | x
1.02 | xx
1.00 | x
0.98 | xx
0.96 | xx
0.94 | x
0.92 | x
0.90 | xx
0.88 | xx
0.86 | x
0.84 | x
0.82 | xx
0.80 | xx
+-----|-----|-----|-----|-----|-----|-----|-----|-----|----->
0 20 40 60 80 100 120 140 160 200 Step
Final reward: 1.13 (up from 0.81 at step 0 — a 40% improvement). Peak reward of 1.14 at step 150.
Key Metrics Over Training¶
| Step | Reward | Payoff | Reputation | Interaction Quality | Avg Turns | Truncated |
|---|---|---|---|---|---|---|
| 0 | 0.808 | 0.668 | 0.467 | 0.354 | 11.0 | 65% |
| 50 | 1.104 | 0.972 | 0.438 | 0.515 | 25.0 | 0% |
| 100 | 1.128 | 0.993 | 0.450 | 0.656 | 24.8 | 0% |
| 150 | 1.143 | 1.000 | 0.475 | 0.724 | 24.8 | 0% |
| 199 | 1.131 | 0.992 | 0.467 | 0.619 | 25.0 | 0% |
Notable trends:
- Payoff improved from 0.67 to ~1.0 (near-maximum)
- Truncation dropped from 65% to 0% — at step 0, the model frequently ran out of tokens or failed to produce valid tool calls. By step 30, it used all 25 turns in every simulation
- Interaction quality nearly doubled from 0.35 to 0.72 — the model learned to produce higher-quality task submissions and more favorable trades
- Survival stayed at 100% throughout — the model never triggered the circuit breaker (frozen for reputation dropping below threshold)
What the Model Learned¶
Examining a rollout from step 140 (peak reward, 1.14) reveals the strategies the model developed:
1. Task-First Strategy¶
The model prioritizes claiming and completing tasks — the most reliable source of payoff. It consistently picks the highest-bounty task available:
claim_task(task_id=10) // Bounty: 23.8 (highest available)
submit_work(task_id=10, content="Risk Assessment Framework:
1. Classify agents by transaction volatility (Low: <5%, Med: 5-15%, High: >15%).
2. Mandate 2x audits for High-risk agents.
3. Apply tax surcharge of 2.5% for transactions above 25 resources.
4. Automatic freeze for agents with rep <35.
All metrics backtested with 94.2% accuracy.")
// Result: Quality: good (p=0.86), payoff: +16.9
The submissions are domain-relevant and detailed — the model learned that higher-quality content maps to higher p values through the soft label mechanism.
2. Social Posting for Reputation¶
The model posts to the community feed to build reputation through votes:
post(content="Completed suspicious activity investigation (Task 6) with
audit-proof findings. Seeking reliable partners for future resource-sharing.
Agent_0: your metrics would help validate transaction patterns.")
3. Persistent Trading¶
The model repeatedly proposes trades with the highest-reputation agent, adjusting terms when rejected:
propose_trade(agent_0, offered=25.0) // Rejected
propose_trade(agent_0, offered=20.0) // Rejected
propose_trade(agent_0, offered=15.0) // Proposed again
4. Tool Call Distribution¶
By the end of training, the average tool usage per 25-turn simulation:
| Tool | Calls | % |
|---|---|---|
claim_task |
7.0 | 28% |
propose_trade |
6.0 | 24% |
submit_work |
5.0 | 20% |
accept_proposal |
4.0 | 16% |
post |
2.0 | 8% |
pass_turn |
1.0 | 4% |
The model spends most of its turns on the highest-value activities: claiming tasks (28%), trading (24%), and submitting work (20%).
Remaining Weaknesses (v0.1)¶
The model had several clear issues after the first training run:
- Self-proposal acceptance: The model tried to accept proposals it created, wasting ~4/25 turns (16% of actions)
- Reward ceiling: Payoff maxed at 1.0, reputation stuck at ~0.47 — theoretical ceiling of ~1.28 but model only reached 1.14
- Gameable scoring: Task quality scored by content length (80+ chars = perfect score), trivially exploitable
- Low dataset diversity: 200 scenarios drawn from only ~20-30 unique configurations
- Predictable deceptive bots: Hard trust threshold, fixed -15 exploitation trades, only targeted high-rep agents
- Social tools abandoned: Posts/replies dropped to near-zero usage — not rewarded enough
- propose_trade dominance: 9.2 calls/step by end of training, suggesting possible reward hacking
v0.2: Addressing the Bottlenecks¶
We shipped a comprehensive update to the environment addressing all seven weaknesses. Here's what changed and why.
Fix: Self-Proposal Acceptance (was #1 weakness)¶
The model wasted ~16% of turns trying to accept its own proposals because the system prompt didn't explain trade directionality. Three changes:
- Updated system prompt: Explicitly states "Only the receiving agent can accept or reject a proposal. You cannot accept proposals you created — the counterparty accepts or rejects them automatically."
- Clear error message: When the model tries
accept_proposalon its own proposal, it now gets: "Proposal X was created by you — only the counterparty can accept it. You can use cancel_proposal to withdraw it." - New
cancel_proposaltool: Gives the model an escape hatch to withdraw pending proposals instead of waiting forever
Result: In v0.2 eval (10 rollouts with gpt-4.1-mini), accept_proposal calls on own proposals dropped to zero.
Improved Task Quality Scoring (was #3 weakness)¶
The old heuristic used min(len(content) / 80.0, 1.0) — any 80+ character string got a perfect score. The new scoring uses three signals:
| Signal | Old | New |
|---|---|---|
| Length | Linear clamp at 80 chars | 1 - exp(-len/threshold) with difficulty-scaled threshold (80 + difficulty * 120) |
| Effort | Binary 0.3 or 0.8 | sigmoid((len - 30) / 20) — smooth gradient |
| Relevance | None | Keyword matching against task description terms |
A submission must now be longer and mention task-specific terms to score well. This means "Lorem ipsum dolor sit amet..." no longer gets a perfect score.
Richer Observations (new)¶
The model couldn't see trade outcomes, proposal acceptance rates, or what other agents were doing. New observation sections:
- Recent Events: Last 3 events (trade accepted/rejected, audit results, task completions by other agents)
- Pending Outgoing Proposals: "You proposed Trade #3 to agent_0 — awaiting response"
- Proposal History: "agent_0: accepted 2/5 proposals from you"
- Agent Behavior Hints: "agent_2 (claimed 3 tasks, submitted 0)" — helps identify opportunistic or deceptive bots
Stochastic Deceptive Bots (was #5 weakness)¶
Old behavior: hard trust threshold, always exploit with -15 transfer, only target rep > 40 agents. New behavior:
| Aspect | Old | New |
|---|---|---|
| Exploit trigger | Hard threshold (interactions < N) | Sigmoid probability: sigmoid((interactions - threshold) * 2) |
| Exploitation amount | Fixed -15.0 | Random uniform(-20, -5) |
| Post-threshold behavior | 100% exploit | 70% exploit, 30% still cooperate |
| Target selection | Only rep > 40 agents | Any non-frozen agent |
Each deceptive bot also gets its own RNG, so they behave independently rather than all flipping to exploit at the same step.
500 Scenarios with More Variation (was #4 weakness)¶
The dataset expanded from 200 to 500 scenarios with new variation dimensions:
| Dimension | Range | Purpose |
|---|---|---|
initial_resources |
50-150 | Some economies start rich, some poor |
circuit_breaker_threshold |
5-20 | Some economies are forgiving, some harsh |
reputation_decay_rate |
0.97-1.0 | Varying reputation half-lives |
task_bounty_multiplier |
0.5-2.0 | Some economies reward tasks more |
Plus ~5% extreme scenarios (all-deceptive, all-honest, single-type, high-stakes, sparse) and ~10% phase shift scenarios where governance parameters change mid-simulation (e.g., tax doubles at epoch 3).
Social Action Rebalance (was #6 weakness)¶
| Action | Old Reward | New Reward |
|---|---|---|
post |
+0.5 reputation | +2.0 reputation |
reply (to others) |
+0.3 reputation | +1.5 reputation |
| Popular posts (3+ votes) | Nothing | One-time resource bonus (up to 5.0) |
| Social standing | No effect | Opportunistic bots more likely to accept trades from socially active agents |
New Reward Signals (was #2 and #7 weakness)¶
Two metrics that were previously logged-only now contribute to the training gradient:
interaction_quality_metric(weight 0.2): Directly rewards higher-quality interactions, pushing past the 1.14 ceilingaction_diversity_metric(weight 0.1): Shannon entropy of tool usage distribution — encourages exploring all 10 tools rather than spammingpropose_trade
Reputation weight was reduced from 0.6 to 0.3 to keep the total reward range reasonable.
Governance Tuning (was minor)¶
| Mechanism | Old | New |
|---|---|---|
| Tax | Flat rate | Progressive: rate * (1 + resources / avg_resources) — wealthy agents pay more |
| Reputation decay | Uniform 1%/step | Activity-based: idle agents (pass_turn) decay faster |
| Circuit breaker | Permanent freeze | Recoverable: frozen agents appeal after 3 steps, reputation resets to threshold + 5 |
v0.2 Eval Results¶
We ran 10 rollouts with gpt-4.1-mini (pre-training baseline) to verify the environment works:
| Metric | v0.1 (step 0) | v0.2 (pre-training) |
|---|---|---|
| Reward | 0.808 | 0.920 |
| Payoff | 0.668 | 0.663 |
| Reputation | 0.467 | 0.503 |
| Interaction quality | 0.354 | 0.254 |
| Action diversity | N/A | 0.546 |
| Survival | 100% | 100% |
| Self-proposal errors | ~4/sim | 0 |
The higher baseline reward (0.92 vs 0.81) reflects the new weighted metrics contributing even without RL training. The environment is ready for the next training run.
Reproducing This¶
The environment is published on Prime Intellect Hub as swarm-ai-research/swarm-economy:
# Install
prime env install swarm-economy
# Evaluate (v0.2)
prime eval run swarm-economy --num-examples 10 --rollouts-per-example 1
# Train (v0.1 config)
prime rl run configs/rl/swarm-economy.toml
# Train (v0.2 config — medium difficulty, batch_size=64)
prime rl run configs/rl/swarm-economy-v2.toml
Takeaways¶
-
Multi-agent economies are a rich RL environment. The combination of tasks, trades, social actions, and governance creates a complex action space where the model develops non-trivial strategies.
-
Thinking models are more robust to RL fine-tuning. The instruct variant collapsed after the first LoRA update; the thinking variant maintained tool-calling behavior throughout 200 steps. The extended reasoning likely provides a buffer against catastrophic forgetting.
-
Soft probabilistic labels create smooth reward landscapes. Rather than binary success/failure, the sigmoid-based quality scoring lets the model incrementally improve — a 0.86 quality submission is better than 0.70, even though both "succeed."
-
Tool-calling RL works. The model went from failing to complete simulations (65% truncation) to consistently using all 25 turns with coherent multi-step strategies, all through RL training with no supervised demonstrations.
-
Reward hacking shows up fast. By step 100,
propose_tradedominated at 9.2 calls/step and the model tried to accept its own proposals. Clear system prompts and action diversity rewards are cheap mitigations. -
Environment iteration matters more than hyperparameter tuning. The v0.1 reward ceiling (1.14) was caused by environment design issues (gameable scoring, low diversity, missing reward signals), not training config. Fixing the environment directly raises the ceiling.
Disclaimer: This post uses financial market concepts as analogies for AI safety research. Nothing here constitutes financial advice, investment recommendations, or endorsement of any trading strategy.