Skip to content

Building to Learn: What Three Weeks of Prototyping Taught Us About Agentic Marketing Optimization


Contents

Framing

Part I: The Hypothesis

Part II: What We Built (and How It Evolved)

Part III: Discovery Through Building

Part IV: The Deeper Problem We Uncovered

Part V: What We Can Actually Deliver

Part VI: Practical Guidance

Part VII: Where This Goes Next

Conclusion

Appendices


Why We Built This

The market opportunity in agentic commerce is real. The measurement problem is also real. We wanted to understand the gap between those two realities.

So we built an experimental lab as a sandbox for testing hypotheses about AI discoverability optimization. The goal was to learn through building.

Three weeks, start to finish.

When you’re building fast, you hit reality quickly. Assumptions that seemed reasonable in theory broke against actual implementation. Features we thought were essential turned out to be distractions. Problems we thought we’d solved revealed deeper structural challenges.

This paper documents what we learned from three weeks of rapid prototyping where every build cycle surfaced new questions.

The Experimental Approach

We structured this as a research build with:

Starting hypothesis: If LLMs can infer human intent from context, and if AI shopping assistants use LLMs to match products to intent, then we should be able to build a system that scores product-intent alignment and optimizes for it.

Experimental method: Build the minimum viable lab that could test this hypothesis. Run real scenarios. Then run experiments. Validate. Observe where the theory breaks.

Timeline: Three weeks of iterative development with aggressive scope management.

Outcome: We learned a lot and ended up with an app as a lab for testing agentic commerce product discoverability optimization.


Part I: The Hypothesis We Wanted to Test

The Signals That Made This Worth Testing

The agentic commerce shift is visible across multiple signals:

  • Major assistants now operate at very large consumer scale.1
  • Survey data indicates younger users are often earlier adopters of AI-led search and discovery behaviors.2
  • Third-party traffic monitoring shows rapid growth in AI-agent referrals to commerce sites.3
  • Industry forecasts vary, but commonly model AI-mediated commerce impact in the hundreds of billions to trillions over time.4
  • Consumer research shows AI-assisted purchase behavior is already measurable, not hypothetical.3

If AI assistants are becoming a major commerce channel, brands need to understand how to optimize for that channel.

The question is valid. Whether we can answer it with current tools is less clear.

From Keywords to Intent Narratives

Traditional search: “running shoes”

LLM shopping: “I need shoes for marathon training, I pronate slightly, had plantar fasciitis last year, prefer cushioning over ground feel, budget is $150-200, and I care more about injury prevention than race times.”

This isn’t keyword matching. It’s constraint satisfaction with implicit priorities and trade-offs. LLMs handle this well in theory.

The question: can we predict what LLMs will recommend, and optimize product content to improve those recommendations?

Context-Conditioned Intent Activation (CCIA)

Our working hypothesis:

LLMs can reliably infer human intent when given sufficient structured context about the user’s situation, constraints, and goals.

Research supports this direction. Recent NLP and HCI work shows LLM intent inference improves when context is explicit, structured, and constraint-rich.5

The difference between “this person visited our site” and “this marathon runner recovering from plantar fasciitis visited after comparing three stability shoe reviews” is the difference between speculation and precision.

The Logical Leap We Wanted to Test

If LLMs infer intent + AI shopping uses LLMs → we can:

  1. Simulate user intents for a product category
  2. Score how well product descriptions align with those intents
  3. Optimize descriptions to improve alignment
  4. Verify that optimization improves real AI shopping performance

What we learned: Steps 1-3 are buildable. Step 4 is where everything gets complicated.


Part II: What We Built (and How It Evolved)

Initial Design vs. Final Architecture

Week 1 plan:

  • Simple chat interface for intent exploration
  • Alignment scorer with transparent signals
  • Basic optimization suggestions

Week 3 reality:

  • Chat with conversation memory and context building
  • Multi-signal alignment scoring with evidence decomposition
  • Simulation runner with optimization and retest workflows
  • Full experiment framework with query battery generation
  • Multi-tenant workflow scoping across client/brand/product contexts
  • Validation layer distinguishing synthetic vs. observed signals
  • Learning loop with scoped Bayesian-style belief updates, memory distillation/retrieval, and calibration tracking
  • Operational loop maintenance (manual + scheduled) for calibration refresh and memory upkeep
  • Admin onboarding with canonical intent specification and model-gateway controls (BYOK chat/validation models)

We built significantly more than we planned - but also discovered we were solving a different problem than we thought.

The Pivots We Didn’t Expect

Pivot 1: From black-box scores to transparent signals

Initial plan: Use LLM embeddings for alignment scoring.

Reality: Embeddings are opaque. “Your score is 0.42” isn’t actionable. We pivoted to signal-based scoring where recommendations are explainable: “You lost on benefit articulation and use-case specificity.”

Pivot 2: From product-driven to intent-driven queries

Initial plan: Generate test queries from product features.

Reality: This created brand-contaminated, over-specific queries that didn’t represent real user intents. We built three query generation modes (top-down, bottom-up, hybrid) with quality gating and category confidence checks.

Pivot 3: From single scores to multi-judge evaluation

Initial plan: Trust one LLM’s judgment.

Reality: Different models behave differently. We added multi-judge evaluation to identify where results diverge as that divergence itself is valuable intelligence.

Pivot 4: From lab metrics to validation layers

Initial plan: Assume lab scores can predict real rankings.

Reality: That assumption is unproven. We built separate validation flows for synthetic signals (LLM judges) vs. observed reality (actual shopping surface behavior), with calibration tracking to measure prediction accuracy over time.

Pivot 5: From immediate insights to confidence gating

Initial plan: Generate strategic recommendations from lab results.

Reality: Lab-only recommendations risk false precision. We added soft-locking: pattern insights remain locked until validated against external observations with defined accuracy thresholds.

Each pivot revealed an assumption we’d been making. Building forced us to confront those assumptions.

What the Lab Actually Does

Final architecture:

Intent Inference Engine

  • Parses queries to extract goals, needs, constraints
  • Builds structured intent representations
  • Identifies gaps between explicit statements and likely meaning

Alignment Scoring System

  • Evaluates product-intent fit using transparent signals
  • Hard-gates category mismatches before soft-scoring
  • Generates explainable evidence for wins and gaps

Simulation Runner

  • Tests scenarios against query batteries
  • Identifies content gaps
  • Generates and tests optimized copy variants

Experiment Framework

  • Supports three query generation modes with quality gates
  • Runs controlled variant tests across copy options
  • Tracks win rates and consensus metrics

Validation Layer

  • Distinguishes synthetic (LLM judge) from observed (real performance) signals
  • Logs external validation data
  • Tracks prediction accuracy for calibration
  • Centralizes validation workflows outside the experiment design/run flow

Learning Loop

  • Updates scoped beliefs (client_id, brand_id, product_id) with weighted evidence from synthetic + observed signals
  • Distills high-confidence/high-support artifacts to reusable memory and retrieves them for future generation
  • Logs auditable decision events and uncertainty metadata
  • Gates low-quality or contradictory artifacts from reuse
  • Runs maintenance cycles for calibration refresh and memory distillation

This isn’t a simple tool. It’s a research environment for probing the gap between simulated and real AI shopping behavior.


Part III: Discovery Through Building

Discovery 1: Our Assumptions About LLM-as-Judge

What we assumed: If a single-model alignment scorer says Variant A is better aligned with intent, real shopping systems will also prefer Variant A.

What we discovered: This assumption is unproven and possibly wrong.

LLM-as-judge research shows weak or unstable correlation with downstream metrics. LLM judges exhibit style biases, verbosity preferences, and pattern sensitivities that may not reflect production ranking logic.

Production shopping systems use LLMs as one component among many:

  • Behavioral signals (clicks, conversions)
  • Merchant trust scores
  • Schema and feed health
  • Price competitiveness
  • Review sentiment
  • Shipping and return policies

Single-judge lab evaluation collapses this complexity into a narrow proxy. Even with multi-judge consensus, this is useful for semantic clarity, not a direct ranking predictor.

How we adapted: We now explicitly label lab scores as “semantic clarity metrics” rather than “ranking predictions.” We’re building validation infrastructure to measure prediction accuracy against observed behavior.

Why this matters: The market wants certainty. We can offer directional signals and improved content quality. Promising more than that would be dishonest.

Discovery 2: The Scoring Stability Problem

What we assumed: Alignment scores are stable enough to measure lift.

What we discovered: LLM scores are noisier than we expected.

The same text can score differently based on:

  • Minor prompt variations
  • Different evaluation sequences
  • Surrounding context
  • Temperature settings
  • Model updates

Reporting “+69% improvement” implies precision we don’t have. The numbers look authoritative but rest on unstable foundations.

How we adapted: We shifted from absolute scores to pairwise comparisons. “Variant A won 70% of head-to-head tests” is more stable than “Variant A scored 0.71 vs 0.52.”

We also added win-rate metrics, conservative confidence-interval reporting, and multi-judge consensus tracking.

Why this matters: Precise percentages create false confidence. Relative comparisons and directional signals are more honest about measurement uncertainty.

Discovery 3: The Verification Gap Is Structural

What we assumed: We’d be able to verify lab predictions against real AI shopping performance.

What we discovered: As of February 8, 2026, platform reporting makes clean verification structurally limited.6

Google AI Mode / Gemini Shopping (current state, February 2026):

  • No dedicated, standalone analytics API specifically for AI‑generated shopping impressions or clicks.
  • Merchant Center public reporting documentation exposes performance data views, but not a clearly separated “AI Mode” performance breakout.
  • Verification relies on a mix of Search Console/Merchant Center metrics and manual observation of AI experiences, rather than a fully isolated AI‑Mode reporting view.

OpenAI Shopping (current state, February 2026):

  • No dedicated analytics API that reports impressions or clicks for ChatGPT Shopping experiences.
  • OpenAI platform telemetry focuses on API usage and operations, so shopping-presence measurement has to be implemented in the client’s own analytics stack.
  • No official, shopping‑specific performance reporting is available; measurement relies on OpenAI usage data plus client’s downstream site and conversion analytics.

Perplexity (current state, February 2026):

  • Merchant program provides a dashboard with high‑level, aggregate performance insights rather than detailed logs.
  • Reporting offers useful directional visibility but does not function as a full‑fidelity, ad‑platform style analytics suite.
  • No granular, query‑level attribution is currently exposed; insights are summarized across many queries and users.

How we adapted: We built separate validation flows:

  • Synthetic validation (LLM judges as proxy signals)
  • Observed validation (manual logging of real behavior)
  • Calibration tracking (measuring synthetic vs. observed agreement)

This isn’t the closed-loop feedback we wanted. It’s sampling-based hypothesis testing.7

Why this matters: Any verification is inherently incomplete. We’re correlating lab predictions with observed presence, not measuring causal impact.

Discovery 4: Context Quality Breaks Everything

What we assumed: LLMs gracefully handle noisy or incomplete context.

What we discovered: Bad context doesn’t degrade quality, it produces confident nonsense.

This hit us hardest in query generation. When building the bottom-up mode (inferring likely user intent and relevant query from product category and features), we discovered:

  • Sparse or noisy seed context → malformed queries
  • Typos or ambiguous terminology → over-specific outputs
  • Missing category signals → brand-contaminated queries

The LLM didn’t flag uncertainty. It generated garbage confidently.

How we adapted:

  • Added category confidence gating for bottom-up generation
  • Added canonical-spec normalization and stricter input validation rules
  • Implemented quality validation before persistence
  • Added retry logic with stricter filtering

Why this matters: LLMs aren’t magic boxes that handle messiness. They’re components requiring clean inputs and validated outputs. Budget significant engineering for input normalization; the LLM call is the easy part.

Discovery 5: The Scientific Theater Trap

What we assumed: Precise metrics (p-values, confidence intervals, effect sizes) make our findings more credible.

What we discovered: Precision creates false authority when underlying measurements are uncertain.

Our experiment flow can generate statistics like:

  • Alignment score: 0.71
  • Confidence interval: [+40%, +80%]

This looks rigorous. Under the hood:

  • Scores are noisy and model-dependent
  • Distributions aren’t well-understood
  • Validation against real outcomes is incomplete

Reporting statistical significance implies authority we can’t justify with synthetic, LLM-generated metrics.

How we adapted:

  • Primary readouts now prioritize win rates, robust win rates, and average deltas
  • Statistical diagnostics remain secondary context, not proof of external impact
  • Clear labeling: “simulation-based” not “validated”
  • Soft-locked insights requiring external validation thresholds
  • Explicit uncertainty communication

Why this matters: False precision damages trust. Better to be approximately right with acknowledged uncertainty than precisely wrong with unearned confidence.

Discovery 6: Production Systems Are Different Animals

What we assumed: Optimizing for our evaluation criteria would transfer to production systems.

What we discovered: Production shopping LLMs are heavily specialized and likely don’t behave like general-purpose models.

We score with configurable BYOK judge providers (OpenAI, Gemini, Claude/Anthropic, OpenRouter). Production systems use:

  • Different model versions
  • Extensive fine-tuning for commerce
  • Safety and policy layers
  • Specialized prompt structures
  • Proprietary retrieval logic
  • Behavioral feedback integration

A variant that wins in our lab may not be preferred by the internal shopping LLM at OpenAI or Google.

How we adapted:

  • Multi-judge evaluation to identify cross-model consensus
  • Focus on general LLM-friendliness (clear structure, explicit use cases) rather than model-specific optimization
  • Track where results diverge across judges

Why this matters: We can optimize for “good communication with transformer-based systems” but can’t guarantee transfer to production ranking.

Discovery 7: Copy Optimization Isn’t Enough

What we assumed: Better product descriptions would significantly improve AI discoverability.

What we discovered: Copy is one signal among many.

If a competitor has:

  • Better schema and structured data
  • Higher merchant trust scores
  • More competitive pricing
  • Better reviews
  • Faster shipping

…then copy optimization alone won’t overcome those disadvantages.

Google has publicly described the Shopping Graph at tens of billions of listings with frequent updates across price, inventory, and merchant data. Copy quality is table stakes, not a silver bullet.8

How we adapted: We built protocol readiness checks to surface non-copy issues (missing structured data, schema problems, specification gaps) alongside content optimization.

Why this matters: Copy optimization is necessary but not sufficient. Feed health, pricing, operations, and customer experience remain critical.


Part IV: The Deeper Problem We Uncovered

Why Google’s Approach Is Fundamentally Different

Google’s use of Gemini for shopping is tightly integrated with its Shopping Graph:

  1. Gemini handles conversational intent understanding
  2. Intents resolve against the Shopping Graph (large-scale listing graph with frequent updates)
  3. Gemini composes answers from graph candidates

Core ranking still depends on:

  • Feed quality
  • Behavioral signals
  • Price competitiveness
  • Reviews
  • Shipping and policies

Google is “graph-driven with an LLM front-end.” Our lab is “LLM-driven with simulated ranking.”

We can help optimize for the LLM layer. But the graph layer - where most ranking happens - is beyond our reach.

The Behavioral Loop We Can’t Enter

Production systems learn from outcomes:

  • What gets shown and clicked
  • What converts to purchases
  • What leads to returns
  • How users respond to different recommendation styles

This creates a feedback loop where ranking improves based on real behavior.

Third-party tools can’t directly observe the platform-internal loop. We typically do not see impression-level decision logs, and we only see downstream clicks/conversions when tracking is available on owned properties.67

We’re optimizing for a proxy (LLM alignment) that correlates with the target (user satisfaction) but isn’t the same thing.


Part V: What We Can Actually Deliver

What Works Today

1. Intent Coverage Analysis We can identify whether your product content addresses the range of intents likely to matter in your category. This is valuable regardless of AI rankings - intent-aligned copy helps human readers too.

2. Gap Identification We can spot what’s missing: features not mentioned, benefits not articulated, use cases not addressed. Filling gaps improves content quality.

3. Semantic Clarity Optimization We can help structure content for efficient information extraction by LLMs (and humans). Clear benefit framing, explicit use cases, complete specifications.

4. Cross-Model Testing We can show how different LLMs interpret your content, highlighting strengths and weaknesses across the ecosystem.

5. Competitive Intelligence We can compare your content against competitors using the same evaluation framework.

6. Protocol Readiness We can surface non-copy issues: missing structured data, schema problems, specification gaps that affect AI systems.

What Remains Unproven

1. Ranking Predictions We cannot reliably predict how you’ll rank in any specific AI shopping surface. The connection is too indirect.

2. Traffic Impact We cannot predict how optimization will affect AI-referred traffic. We lack the necessary data.

3. Attribution We cannot measure what percentage of conversions came from AI shopping. That requires analytics infrastructure we don’t control.

4. Competitive Positioning We can compare content quality, but we cannot tell you where you rank against competitors in production.

The Honest Value Proposition

We help you write product content that is:

  • Clear for LLM interpretation
  • Comprehensive in addressing likely intents
  • Structured for efficient extraction
  • Competitive with category benchmarks

If AI shopping systems retrieve your product, optimized content increases the probability of recommendation.

But we can’t control retrieval, and we can’t guarantee ranking.

Think of it like SEO content quality: necessary but not sufficient. Good content doesn’t guarantee page-one rankings, but bad content guarantees you won’t rank even with strong authority signals.


Part VI: Practical Guidance

What to Invest In

1. Content Quality (our domain)

  • Intent-aligned descriptions
  • Clear benefit articulation
  • Complete specifications
  • Structured, parseable formats

2. Feed Health (traditional optimization)

  • Accurate product data
  • Correct schema markup
  • Real-time inventory and pricing
  • High-quality images

3. Operational Excellence (business fundamentals)

  • Competitive pricing
  • Reliable shipping
  • Clear return policies
  • Strong reviews

4. Measurement Infrastructure (long-term capability)

  • Track AI referrers where possible
  • Survey customers about discovery paths
  • Monitor AI surfaces (sampling-based)
  • Build correlation datasets

How to Think About AI Optimization Tools

Ask these questions of any tool (including ours):

1. “What have you validated against real outcomes?”

  • Good: “We’ve correlated lab scores with observed presence for N products”
  • Bad: “Our scores predict AI rankings”

2. “How stable are your measurements?”

  • Good: “We use win rates because absolute scores are noisy”
  • Bad: “We achieve 0.71 alignment with 95% confidence”

3. “What can’t you measure?”

  • Good: Explicit acknowledgment of verification gaps
  • Bad: “We measure everything”

4. “How do you handle model variation?”

  • Good: “We test across multiple judges and report divergence”
  • Bad: “Our model represents all AI systems”

Setting Reasonable Expectations

AI optimization makes sense if you:

  • Have strong fundamentals (feed, pricing, operations, reviews)
  • Want better content regardless of AI ranking impact
  • Can tolerate measurement uncertainty
  • Are building capabilities for where commerce is heading

It doesn’t make sense if you:

  • Need guaranteed ranking improvements
  • Require precise ROI measurement
  • Want a quick fix for competitive disadvantage
  • Think optimization substitutes for operational excellence

Part VII: Where This Goes Next

Our Next Experiments

Short-term:

  • Validate lab predictions against observed AI shopping behavior (manual sampling)
  • Build calibration datasets measuring prediction accuracy
  • Test protocol readiness optimization (UCP/ACP structured data)
  • Expand multi-judge evaluation across more models
  • Improve end-to-end UX flow (setup -> run -> validate -> decide) to reduce friction in the lab journey

Medium-term:

  • Native GA4 integration for automated analytics correlation
  • Live AI surface probing for systematic observation
  • Category-specific intent taxonomies with confidence scoring
  • Attribution estimation models (probabilistic, not causal)
  • Scheduled experiment execution and stronger loop orchestration

Long-term:

  • Platform partnerships for verified measurement (if available)
  • Automated calibration refresh as systems evolve
  • Pattern libraries soft-locked until validation thresholds met

What We Still Need to Build

We are far from done. The current app is a working research lab, not a finished platform.

Priority roadmap areas:

  • Full protocol layer maturity (UCP/ACP schema validation, ingestion, transaction surfaces)
  • Stronger real-world verification and observability (live verification harness, attribution tracking, alerts)
  • Multi-tenant productization (self-serve onboarding, permissions, tenant-scoped analytics dashboards)
  • Agentic loop extensions (scheduled runs, active learning from lessons, richer loop orchestration)
  • Infrastructure for scale (Postgres tenancy, queues/workers, unified automation runner)
  • UX/product polish (scenario templates, versioned simulations, batch export, shareable reports, clearer journey design)

What the Market Needs

Honest measurement frameworks that acknowledge gaps rather than hiding them.

Calibration infrastructure for tracking prediction accuracy over time.

Platform transparency around ranking signals and attribution data.

Realistic expectations about what third-party tools can deliver.

The market wants certainty. Engineering delivers probability. The gap is frustrating but navigating it honestly is the only path forward.


Conclusion: Building to Understand

Three weeks isn’t enough time to build a mature product. It’s barely enough time to understand the problem.

That turned out to be the point.

We started believing we could predict AI shopping rankings. We ended up building a lab that revealed how little we - or anyone - can actually verify about AI discoverability optimization.

Every component we implemented raised new questions:

  • LLM-as-judge: useful proxy or misleading signal?
  • Alignment scores: semantic clarity metric or ranking predictor?
  • Multi-judge consensus: cross-model truth or shared bias?
  • Validation layers: calibration path or theater?

We don’t have definitive answers. We have better questions and infrastructure for testing them.

What we learned:

  • Building reveals assumptions faster than planning
  • The verification gap is structural, not solvable with clever engineering
  • Lab metrics have value but can’t substitute for behavioral feedback
  • Honest uncertainty is more valuable than false precision
  • The market wants promises engineering can’t deliver
  • A good lab architecture is necessary, but UX quality determines whether teams can use it effectively

What we’re doing about it:

  • Treating lab outputs as hypotheses, not predictions
  • Building validation infrastructure to measure accuracy
  • Gating strategic recommendations until externally verified
  • Communicating uncertainty explicitly
  • Investing in measurement capabilities for when platforms provide access
  • Continuing to iterate on product design so the workflow feels like an exciting, usable discovery-and-optimization platform, not a collection of disconnected tools

The agentic commerce shift is real. The optimization opportunity exists. The measurement problem is also real and possibly permanent given current platform APIs.

We’re building to understand that gap - and to be ready when (if) it closes.

This is a research report, not a product launch. We’re sharing what we learned from three weeks of rapid prototyping because the market needs realistic expectations more than it needs optimistic promises.

We are still in active iteration, and significant product, UX, and systems work remains before this can be considered a mature platform.

If you’re building in this space: build to learn, measure what you can, acknowledge what you can’t, and don’t confuse lab signals with production outcomes.

We’re still figuring this out. So is everyone else.


Appendix A: Lab Architecture

System Components (What We Actually Built)

Frontend (Next.js)
├─ Chat (intent exploration)
├─ Alignment (scoring + evidence)
├─ Simulation (run → optimize → retest)
├─ Experiments (batteries, variants, metrics)
├─ Validation (synthetic + observed signals)
└─ Admin (onboarding + operational controls)

Backend (FastAPI)
├─ Conversation + intent routes
├─ Alignment + evidence routes  
├─ Simulation routes
├─ Experiment orchestration
├─ Validation logging
├─ Learning loop (beliefs, memory, calibration)
└─ Analytics event ingestion

Services Layer
├─ Intent inference
├─ Alignment scoring (signal-based, transparent)
├─ Query generation (top-down, bottom-up, hybrid)
├─ Experiment runner with multi-judge evaluation
├─ Validation coordinator (synthetic vs. observed)
├─ Memory service (quality-gated artifact distillation)
└─ Calibration tracker (prediction accuracy)

Infrastructure
├─ LLM clients (OpenAI, Gemini, Claude/Anthropic, OpenRouter via BYOK)
├─ SQLite database
├─ Skill prompt storage
└─ Protocol adapters (UCP/ACP)

Key Design Decisions (What We Learned)

Signal-Based Scoring Transparent, decomposable scores over black-box embeddings. “You lost on benefit articulation” is actionable. “Similarity: 0.42” is not.

Hard Category Gating Category mismatch fails immediately before soft-scoring. Prevents recommending laptops when user wants shoes.

Pairwise Comparisons Win rates over absolute scores. More stable measurement foundation.

Validation Separation Synthetic signals (LLM judges) distinguished from observed reality (actual shopping behavior). Different confidence levels.

Soft-Locked Insights Pattern recommendations locked until validated against external observations. Prevents over-confident strategic advice.

Multi-Judge Evaluation Cross-model consensus as stronger evidence than single-model preference. Divergence itself is informative.

Quality-Gated Memory Low-confidence beliefs excluded from reuse. Prevents compounding errors in query generation.


Appendix B: The Theory We Started With

Bayesian Framework for Intent Inference

$$P(H | E) \propto P(E | H) \cdot P(H)$$

Where:

  • $H$ = hypothesis about user intent
  • $E$ = evidence from query and context
  • $P(H)$ = prior belief about likely intents
  • $P(E|H)$ = likelihood of evidence given intent

Each turn updates posterior beliefs. Rich context activates specific intent hypotheses.

Connection to Active Inference

Free Energy Principle: agents minimize surprise (prediction error).

When uncertain → ask clarifying questions (reduce entropy) When confident → recommend (act on beliefs)

Creates natural exploration-exploitation balance.

Implementation Reality

The math is elegant. Practice is messier:

  • LLMs approximate but don’t perform exact Bayesian updates
  • Prompt engineering elicits structured representations
  • Confidence requires external validation
  • “Priors” come from category knowledge, not formal distributions
  • The implemented loop is Bayesian-style weighting over scoped evidence, with auditable belief revisions and memory artifact promotion rules

We use the framework as design guidance, not literal implementation.


Appendix C: Validation Approaches We’re Testing

Currently Implemented

1. Synthetic Validation (LLM Judges)

  • Fast screening signal
  • Cross-model consistency checks
  • BYOK provider/model support
  • Structured result logging

2. Observed Reality Validation

  • Manual logging of actual shopping surface behavior
  • Links to lab predictions
  • Accuracy tracking over time
  • Grounding signal for calibration

3. Analytics Event Ingestion

  • API endpoint for external signals
  • Clicks, conversions, referrer data
  • Correlation analysis infrastructure

4. Loop Operations and Maintenance

  • Loop state, step, belief, memory, and calibration endpoints for explicit control of the learning cycle
  • Loop-maintenance runs that refresh calibration profiles and distill memory artifacts
  • Run history for auditable operations tracking

5. Multi-Tenant + Model Operations

  • Tenant-scoped entities and workflows (clients, brands, products) to prevent cross-scope contamination
  • Admin model gateway for provider keys/models with separate chat/generation and validation settings

In Development

Live AI Surface Probing

  • Automated queries to shopping surfaces
  • Presence and positioning tracking
  • Correlation with lab predictions

GA4 Integration

  • Direct analytics ingestion
  • Automated experiment mapping
  • Reduced manual logging burden

Attribution Estimation

  • Statistical models for AI-referred traffic
  • Referrer pattern analysis
  • Probabilistic (not causal) inference

The Irreducible Gap

Even with all planned features, we cannot:

  • Access internal ranking scores
  • See impression data (what was shown but not clicked)
  • Track full decision paths within assistants
  • Prove causation (only correlation)

Validation improves calibration. It doesn’t eliminate uncertainty.


This reflects our current understanding. The field evolves rapidly. Measurements will improve. Our models will update.

References

For collaboration or questions: Performics Labs

Footnotes

  1. OpenAI Help Center, Shopping in ChatGPT Search (updated Jan 16, 2026).

  2. AP-NORC via AP News, How US adults are using AI (Jul 2025).

  3. Adobe, Generative AI-powered shopping rises with traffic to U.S. retail sites (Jul 15, 2025). 2

  4. Morgan Stanley Research, Here Come the Shopping Bots (Dec 8, 2025); McKinsey, The economic potential of generative AI (2023, updated).

  5. ACL Anthology: Exploring the Use of Natural Language Descriptions of Intents for LLMs in Zero-shot Intent Classification (SIGDIAL 2024); Intent Detection in the Age of LLMs (EMNLP Industry 2024).

  6. Google: Try Google Search’s AI Mode in Labs (May 2025); Merchant Center docs Performance reports and Content API Reporting guide. 2

  7. OpenAI docs Shopping in ChatGPT Search and API Usage Dashboard; 2

  8. Google Shopping Graph references: Try Google Search’s AI Mode in Labs and Behind the scenes of the Google Shopping Graph.

Published on Sunday, February 8, 2026 · Estimated read time: 30 min