Why Agent Infrastructure Could Be a $10B Category¶

The core thesis is simple: every serious software company will likely need an agent stack, not just a model API.

That stack includes:

internal AI agents that do useful work,
evaluation systems that detect regressions,
safety systems that bound risk,
orchestration systems that coordinate many specialists.

Today, many teams are still hand-assembling those layers.

The opportunity is a platform that makes this stack programmable, observable, and continuously improvable by default.

The category argument¶

If this thesis is right, "agent infrastructure" starts to look like earlier platform waves:

Databricks for data/ML pipelines,
Snowflake for cloud data infrastructure,
GitHub for developer workflows.

In each case, the winning products abstracted painful primitives and became the standard control plane. The equivalent control plane for production agent organizations is still early and unsettled.

The wild part: self-evolving organizations¶

When you combine:

automated eval harnesses,
multi-agent orchestration,
evolutionary improvement loops,
sandboxed governance and enforcement,

you move from "an agent that completes tasks" toward an AI organization that can redesign parts of itself.

That is close to digital evolution for software systems: selective pressure from tasks and metrics, adaptation through policy/tool changes, and governance constraints that define what adaptations are allowed.

The research stack this category needs¶

If this is going to become a real category (not just a narrative), we need a shared research stack with explicit layers:

Workload layer
Standardized task suites: coding, analysis, ops, and long-horizon coordination.
Difficulty tiers and holdout tasks to prevent benchmark overfitting.
Agent + orchestration layer
Reproducible multi-agent topologies (specialist, manager-worker, market-based).
Versioned prompts, tools, memory policies, and handoff contracts.
Evaluation layer
Capability metrics: task completion, latency, cost, and quality.
Robustness metrics: variance across seeds, model changes, and environment perturbations.
Safety + governance layer
Policy checks for data access, action bounds, and escalation controls.
Incident taxonomy: deception, policy bypass, unsafe tool use, cascading failure.
Evolution layer
Controlled improvement loops (proposal, sandbox test, gated promotion).
Counterfactual replay to compare "what changed" before rollout.
Evidence + reporting layer
Run artifacts (traces, configs, seeds, scores) with reproducible replays.
Publication templates that separate exploratory findings from confirmed results.

What would make the $10B claim solid¶

To move from plausible thesis to defensible category creation, evidence should be repeatable across teams and workloads on four fronts:

Economic delta: measurable productivity or margin gains versus manual agent stacks.
Reliability delta: lower failure rates and faster recovery under production-like workloads.
Safety delta: reduced harmful behaviors without collapsing useful capability.
Operational delta: faster iteration velocity with reproducible, auditable changes.

If a platform repeatedly demonstrates those deltas at enterprise scale, a $10B outcome stops sounding speculative and starts sounding like normal platform economics.