How Replica forecasts A/B test outcomes
This page walks through the system end-to-end: what simulated users are made of, how we calibrate against your historical experiments, the statistical methods behind our confidence intervals, and where Replica is known to perform worse than a live test. Every claim on the homepage cites a section here.
1 · Simulated user construction
A simulated user is a behavioral trace, not an avatar. Each one is parameterized by three layers:
- Attribute distribution. Demographic, account, and behavioral attributes are sampled from your observed user-base distribution (e.g. Amplitude or Statsig population stats). We do not invent attributes that don’t appear in your data.
- Persona archetype. We cluster your session-recording corpus into 8–32 archetypes capturing recurring behavioral patterns (price-anchoring shoppers, fast-bounce browsers, multi-session researchers). Each simulated user is assigned an archetype proportional to its prevalence in your data.
- Action policy. A finetuned vision-language model takes the page state and the user’s attributes + archetype, and returns the next action (scroll, click, hover, type, leave). Trained on 5,000+ real session recordings per customer; see the finetuning paper.
2 · Calibration on your A/B history
Before Replica is allowed to forecast new tests for an account, we run a backtest: simulate at least 20 historical A/B tests for that account and compare predicted lift against the measured lift from the live test.
We report two calibration metrics:
- Median absolute error (MAE). The median, across backtested experiments, of |predicted lift − measured lift|. Lower is better. Our current cross-customer MAE is 2.1 percentage points on primary funnel metrics.
- 95% CI coverage. The share of backtested experiments where the live lift fell inside Replica’s predicted 95% confidence interval. Nominal is 95%; we currently report 94.7%.
If MAE exceeds 4 pp or coverage falls below 88% for an account, we tune the persona-archetype layer or recommend additional session-recording collection before forecasting. We never recommend SHIP/SKIP on an under-calibrated account.
3 · Statistical methods
Confidence intervals are constructed via a parametric bootstrap over simulated sessions, with sample sizes chosen to mirror the variance of your historical live tests (so simulated CIs are not artificially tight).
Primary metrics are pre-registered per experiment. Replica reports lift, CI, and a directional ship/no-ship verdict. We support CUPED-style variance reduction when pre-period data is available, and we apply Bonferroni correction across primary + secondary metrics by default. Sequential analyses are available for accounts running optional-stopping protocols.
4 · Where Replica is worse than a live test
We’re explicit about failure modes. Replica is currently weaker than a live A/B test when:
- The variant introduces a behavior with no precedent in the user’s session history (e.g. a brand-new payment method).
- The metric of interest is downstream of multi-session behavior (LTV, retention beyond ~14 days) — we forecast within-session metrics with much higher fidelity.
- The simulation requires modeling adversarial third parties (paywalls, A/B-test detectors, dynamic pricing reactions).
- The customer’s historical experiment corpus is small (<15 tests) or biased toward one variant family.
In these cases Replica abstains rather than over-confidently recommend. The dashboard surfaces an “insufficient calibration” banner with the specific reason.
5 · Reproducibility
Every prediction page includes a run-ID, a hash of the simulated-user population, and the exact prompt/seed used by the action policy. Reruns are deterministic. We retain the per-run audit log for the lifetime of the customer account.
6 · Cited research
- Ablation analysis · Population attributes vs persona archetypes
- Finetuning on session recordings · Action prediction accuracy improvements
- Explainpaper case study · Predicting a live A/B test in <1 hour
Questions about the methodology, or want the calibration report for your account? Book a 30-minute walkthrough →