Merchant runtime for agentic commerce

Agentic commerce needs a merchant runtime. Not another shopping chatbot.

McKinsey estimates agentic commerce could orchestrate up to $1T in U.S. B2C retail revenue by 2030. That prize only materializes if merchants can safely execute agent intent against catalog, order, policy, fulfillment, and support systems.

See the runtime View benchmark

$1T

U.S. B2C retail revenue potentially orchestrated by agentic commerce by 2030

92%

ASIN exact match on 500 seed-grouped WebShop test tasks

3ms

Compiled slot extraction latency with zero hot-path LLM calls

1,295×

Faster than LLM slot extraction in Findings v3

Runtime

LLMs learn offline. Deterministic software executes online.

Agents can express commercial intent; merchants must decide what may actually happen. milli.run compiles reasoning and execution possibilities into a special purpose VM, so covered workflows run with software-grade latency, evidence, policy, and rollback controls.

System 2 · Offline

Learn, reason, compile.

LLMs and agents analyze failure traces, simulate edge cases, generate regression fixtures, discover reusable workflows, and propose merchant-specific skills. This layer is powerful and probabilistic, but not on the covered production path.

System 1 · Online

Verify, route, execute.

Live interactions run through deterministic components: intent routing, structured extraction, catalog repair, evidence verification, typed merchant tools, telemetry, confidence gates, and rollback.

Latency<10ms total covered-path product-intent resolution.

CostZero hot-path LLM token cost for compiled workflows.

RegressionNew skills ship with historical and synthetic fixtures.

FallbackNovel or low-confidence paths logged for automated self-improvement.

Runtime claim: the merchant runtime is the safety boundary between open-ended agents and operational systems of record. It is not a chatbot layer; it is the execution layer that constrains what agents can do.

Proof · Findings v3 · 2026-06-04

Compiled execution matches oracle accuracy at 3ms.

On 500 WebShop test tasks using seed-grouped splitting, the compiled pipeline matches oracle accuracy at 3ms per query — 1,295× faster than an LLM extractor and approximately 5,000× faster than production-grade vector RAG — with no LLM cost on the hot path.

System	Execution model	ASIN match	Latency	Hot-path LLM	Cost at 1M queries/day
golden oracle	Pre-labeled ground-truth slots; upper bound for the shared ranker	92%	0ms	No	N/A
spacy+repair	Trained spaCy NER plus deterministic repair, then shared `search_products()` ranker	92%	3ms	No	~$0/day
llm extractor	Ollama `gemma4:12b` extracts slots from instruction text, then feeds the same ranker	90%	4,101ms	Yes	~$2,000/day
vector RAG*	`nomic-embed-text` retrieves top-12 product cards; `gemma4:12b` selects the candidate	~95%*	~15,800ms*	Yes	~$8,000/day

Accuracy

The compiled path ties the oracle ceiling.

The 92% ceiling reflects 42 tasks where no extractor can find the correct ASIN because the products are ambiguous or missing from the catalog's top results. The compiled path misses none that the oracle hits. The LLM extractor loses extra tasks to catalog-token normalization, such as extracting navy blue instead of 12-navy.

Economics

RAG can approach accuracy, but not execution economics.

Dense retrieval plus LLM selection is a fair production baseline. It can match product-family relevance, but pays embedding and full LLM generation cost on every query. The remaining failure mode is variant disambiguation when many color/size ASINs share the same product family.

Methodology: benchmark script eval/compare_extractors.py; 500 tasks from the WebShop test split; seed-grouped train/test split prevents paraphrase leakage. Every system feeds the same search_products() ranker; only slot extraction changes. *Vector RAG was measured on a 20-task sample because the full 500-task run was excluded due to approximately two-hour runtime at roughly 16 seconds per query.

Deployment

Start in support. Expand into agentic commerce execution.

Support is the wedge because it already contains repeated commercial intent: where is my order, can I return this, which variant fits, can I exchange it, is this eligible, what can you recommend. These workflows can be compiled, verified, routed, and expanded.

Phase 1

Ingest

Sync tickets, macros, policies, product catalog, order states, returns, and fulfillment data. Establish baseline coverage and evidence requirements.

Phase 2

Compile

Train routers and extractors, compile covered workflows, generate regression fixtures, and run in shadow mode against real support sessions.

Phase 3

Route

Route covered traffic through deterministic execution paths. Escalate low-confidence interactions and feed fallback traces back into the compiler loop.

Why this is the right wedge: Support workflows surface repeated buying intent before brands are ready to trust fully agentic checkout. The founding team has seen this failure mode in production across AI retail infrastructure, IBM enterprise customer-service AI, demand generation at a top-five CPG company, and outsourced ecommerce support for a major merchant. The lesson is consistent: analytics only produce ROI when they translate into deterministic execution, auditable evidence, enforceable policy, observability, and rollback.

Architecture thesis

Turn agentic commerce into verified merchant action.

Talk to us