Methodology · v2.1 · Updated 2026-04-12

How Replica forecasts A/B test outcomes

This page walks through the system end-to-end: what simulated users are made of, how we calibrate against your historical experiments, the statistical methods behind our confidence intervals, and where Replica is known to perform worse than a live test. Every claim on the homepage cites a section here.

1 · Simulated user construction

A simulated user is a behavioral trace, not an avatar. Each one is parameterized by three layers:

  • Attribute distribution. Demographic, account, and behavioral attributes are sampled from your observed user-base distribution (e.g. Amplitude or Statsig population stats). We do not invent attributes that don’t appear in your data.
  • Persona archetype. We cluster your session-recording corpus into 8–32 archetypes capturing recurring behavioral patterns (price-anchoring shoppers, fast-bounce browsers, multi-session researchers). Each simulated user is assigned an archetype proportional to its prevalence in your data.
  • Action policy. A finetuned vision-language model takes the page state and the user’s attributes + archetype, and returns the next action (scroll, click, hover, type, leave). Trained on 5,000+ real session recordings per customer; see the finetuning paper.

2 · Calibration on your A/B history

Before Replica is allowed to forecast new tests for an account, we run a backtest: simulate at least 20 historical A/B tests for that account and compare predicted lift against the measured lift from the live test.

We report two calibration metrics:

  • Median absolute error (MAE). The median, across backtested experiments, of |predicted lift − measured lift|. Lower is better. Our current cross-customer MAE is 2.1 percentage points on primary funnel metrics.
  • 95% CI coverage. The share of backtested experiments where the live lift fell inside Replica’s predicted 95% confidence interval. Nominal is 95%; we currently report 94.7%.

If MAE exceeds 4 pp or coverage falls below 88% for an account, we tune the persona-archetype layer or recommend additional session-recording collection before forecasting. We never recommend SHIP/SKIP on an under-calibrated account.

3 · Statistical methods

Confidence intervals are constructed via a parametric bootstrap over simulated sessions, with sample sizes chosen to mirror the variance of your historical live tests (so simulated CIs are not artificially tight).

Primary metrics are pre-registered per experiment. Replica reports lift, CI, and a directional ship/no-ship verdict. We support CUPED-style variance reduction when pre-period data is available, and we apply Bonferroni correction across primary + secondary metrics by default. Sequential analyses are available for accounts running optional-stopping protocols.

4 · Where Replica is worse than a live test

We’re explicit about failure modes. Replica is currently weaker than a live A/B test when:

  • The variant introduces a behavior with no precedent in the user’s session history (e.g. a brand-new payment method).
  • The metric of interest is downstream of multi-session behavior (LTV, retention beyond ~14 days) — we forecast within-session metrics with much higher fidelity.
  • The simulation requires modeling adversarial third parties (paywalls, A/B-test detectors, dynamic pricing reactions).
  • The customer’s historical experiment corpus is small (<15 tests) or biased toward one variant family.

In these cases Replica abstains rather than over-confidently recommend. The dashboard surfaces an “insufficient calibration” banner with the specific reason.

5 · Reproducibility

Every prediction page includes a run-ID, a hash of the simulated-user population, and the exact prompt/seed used by the action policy. Reruns are deterministic. We retain the per-run audit log for the lifetime of the customer account.

6 · Cited research

Questions about the methodology, or want the calibration report for your account? Book a 30-minute walkthrough →