McKinsey estimates agentic commerce could orchestrate up to $1T in U.S. B2C retail revenue by 2030. That prize only materializes if merchants can safely execute agent intent against catalog, order, policy, fulfillment, and support systems.
Agents can express commercial intent; merchants must decide what may actually happen. milli.run compiles reasoning and execution possibilities into a special purpose VM, so covered workflows run with software-grade latency, evidence, policy, and rollback controls.
LLMs and agents analyze failure traces, simulate edge cases, generate regression fixtures, discover reusable workflows, and propose merchant-specific skills. This layer is powerful and probabilistic, but not on the covered production path.
Live interactions run through deterministic components: intent routing, structured extraction, catalog repair, evidence verification, typed merchant tools, telemetry, confidence gates, and rollback.
On 500 WebShop test tasks using seed-grouped splitting, the compiled pipeline matches oracle accuracy at 3ms per query — 1,295× faster than an LLM extractor and approximately 5,000× faster than production-grade vector RAG — with no LLM cost on the hot path.
| System | Execution model | ASIN match | Latency | Hot-path LLM | Cost at 1M queries/day |
|---|---|---|---|---|---|
| golden oracle | Pre-labeled ground-truth slots; upper bound for the shared ranker | 92% | 0ms | No | N/A |
| spacy+repair | Trained spaCy NER plus deterministic repair, then shared search_products() ranker |
92% | 3ms | No | ~$0/day |
| llm extractor | Ollama gemma4:12b extracts slots from instruction text, then feeds the same ranker |
90% | 4,101ms | Yes | ~$2,000/day |
| vector RAG* | nomic-embed-text retrieves top-12 product cards; gemma4:12b selects the candidate |
~95%* | ~15,800ms* | Yes | ~$8,000/day |
The 92% ceiling reflects 42 tasks where no extractor can find the correct ASIN because the products are ambiguous or missing from the catalog's top results. The compiled path misses none that the oracle hits. The LLM extractor loses extra tasks to catalog-token normalization, such as extracting navy blue instead of 12-navy.
Dense retrieval plus LLM selection is a fair production baseline. It can match product-family relevance, but pays embedding and full LLM generation cost on every query. The remaining failure mode is variant disambiguation when many color/size ASINs share the same product family.
eval/compare_extractors.py; 500 tasks from the WebShop test split; seed-grouped train/test split prevents paraphrase leakage. Every system feeds the same search_products() ranker; only slot extraction changes. *Vector RAG was measured on a 20-task sample because the full 500-task run was excluded due to approximately two-hour runtime at roughly 16 seconds per query.Support is the wedge because it already contains repeated commercial intent: where is my order, can I return this, which variant fits, can I exchange it, is this eligible, what can you recommend. These workflows can be compiled, verified, routed, and expanded.
Sync tickets, macros, policies, product catalog, order states, returns, and fulfillment data. Establish baseline coverage and evidence requirements.
Train routers and extractors, compile covered workflows, generate regression fixtures, and run in shadow mode against real support sessions.
Route covered traffic through deterministic execution paths. Escalate low-confidence interactions and feed fallback traces back into the compiler loop.
Why this is the right wedge: Support workflows surface repeated buying intent before brands are ready to trust fully agentic checkout. The founding team has seen this failure mode in production across AI retail infrastructure, IBM enterprise customer-service AI, demand generation at a top-five CPG company, and outsourced ecommerce support for a major merchant. The lesson is consistent: analytics only produce ROI when they translate into deterministic execution, auditable evidence, enforceable policy, observability, and rollback.