Research · Ablation

How more real user data improves A/B test simulation accuracy

A systematic ablation study examining how population-level attributes and persona archetypes impact the accuracy of Replica’s A/B test simulations.

41%
Reduction in prediction error from persona-archetype calibration
8
Simulation configurations compared across two primary metrics
Browser-based simulation outperforms LLM screenshot heuristics

Background

What is Ablation Analysis?

Ablation analysis is a structured way to understand which components of a system are actually contributing to its performance. Rather than evaluating a system as a single black box, ablation removes or modifies one component at a time while keeping everything else fixed, then observes how the system's behavior changes.

The intuition is simple: if performance degrades when a component is removed, that component is providing real signal. If performance remains largely unchanged, the component is likely redundant. We use ablation to determine what components are worth investing more time into exploring given their proven impact.

Why Ablation Analysis Matters for Replica

Replica creates digital twins that simulate real user segments such that their session-level reasoning, navigation paths, and decision outcomes accurately resemble real observed desktop and mobile behavior.

Replica simulates A/B test outcomes by modeling how these digital twins, designed to resemble real users, interact with different website variants. These digital twins are created given the following:

  1. Population-level attributes that define the statistical makeup of the real user base (ex: demographics, age, income) and are used to generate the digital twins. In practice, these distributions can be derived directly from a customer's existing data via lightweight integrations (ex: Supabase or Snowflake), ensuring simulations reflect the real population rather than assumed demographics.
  2. Higher-level personas that represent common user archetypes, motivations, and intent.
  3. Computer-use model and prompts, which govern how the digital twins interact with the browser, including the specified task, termination criteria, and contextual description of the website.

A simulation may closely match a real A/B test, but without isolating individual components, it's impossible to understand their relative impact. That leaves us guessing about what to improve, what's over-engineered, and where additional effort will meaningfully move the needle.

For this study, we focus on ablating population-level attributes and personas, while holding the browser interaction model and task framing constant. This allows us to isolate how modeling who users are and what they're trying to do affects A/B test prediction accuracy.

Explainpaper's A/B Test

Explainpaper is a research platform that helps readers understand complex academic papers, with over 400,000 users worldwide. As part of their growth efforts, the team set out to improve their new user signup funnel – one of the highest-leverage parts of their website.

In a recent pilot, Explainpaper hypothesized that emphasizing how their product makes papers easier to understand, rather than simply faster to read, would increase new user signups. To test this, they launched an A/B experiment using Statsig, comparing two website variants with differences in wording and visuals. The experiment focused on two key funnel stages: conversion from the Landing page to the Pricing page, and conversion from the Pricing page to Signup.

We simulated this A/B test using Replica and observed directionally accurate results relative to the real world experiment. In the sections that follow, we use this same A/B test as the foundation for a series of ablation experiments to understand which components of Replica are responsible for that accuracy.

Methodology

Metrics

This A/B test evaluated two primary metrics tied to key stages of the new user signup funnel:

  1. Landing Page → Pricing Page Conversion: the fraction of users who, after viewing the landing page, proceeded to the pricing page.
  2. Pricing Page → Signup Page Conversion: the fraction of users who, after viewing the pricing page, proceeded to the signup page.

Real World Attributes

The only real user attribute that Explainpaper tracks is country of origin, which Statsig automatically infers from a user's device. This country-level distribution can be used directly when generating the population of AI agents, allowing the simulated digital twins to more closely reflect Explainpaper's real user base.

User attribute: Country of origin

United States: x%Germany: y%United Kingdom: z%

Simulations

We ran eight simulations, progressively adding complexity one component at a time, and compared results before and after each addition to measure its impact on simulation accuracy.

  1. Simulation I: LLM expert with screenshots

    Rather than simulating real browser behavior (observe, reason, act), we give the LLM static screenshots of the Control and Treatment website variants and ask whether it would proceed to the Pricing and Signup pages. No user attributes or personas are provided. Akin to just asking ChatGPT to compare screenshots.

  2. Simulation II: Extreme homogeneity

    The first browser-based simulation, where we simulate user behavior in a real web browser but treat all users as generic by providing no user attributes or personas.

  3. Simulation III: Geographic generalization

    Introduce country of origin as a user attribute for the first time, with a simplified population where 100% of simulated users are from the United States.

  4. Simulation IV: Real world attributes

    Use the true country-of-origin attribute distribution to construct the simulated user population, sourced directly from the Statsig integration.

  5. Simulation V: Real world attributes and generated attributes

    In addition to the true country-of-origin distribution, introduce age and sex attributes by using an LLM to estimate plausible population-level distributions for the simulated user population.

  6. Simulation VI: Persona archetypes only

    Remove any user attributes and only provide the LLM-generated persona archetypes for this website.

  7. Simulation VII: Real world attributes and generated personas

    Revert to using only the true country-of-origin user attribute, and introduce personas by having an LLM generate a small set of representative user archetypes along with their population distribution.

  8. Simulation VIII: Real world attributes, generated attributes, and generated personas

    Include all components: the true country-of-origin distribution, LLM-generated age and sex attribute distributions, and LLM-generated personas (user archetypes and their distribution).

Results

Ablation Analysis

We ran each simulation with 2000 runs in Control and 2000 runs in Treatment. The specific attribute and persona inputs for each Simulation are found in the Appendix.

We calculated the lower and upper Relative Lift % as the two-sided 95% confidence interval for relative lift.

We calculated the Relative Lift Abs Error as the absolute difference between the simulation and ground truth.

Endpoint RMSE is a single-number error metric that measures how far a simulated confidence interval is from the ground-truth confidence interval, by comparing their lower and upper endpoints. We use it to assess overall performance of the simulations.

Conversion Landing → Pricing

SimulationRelative Lift %Abs ErrorEndpoint RMSE
Real world ground truth
Statsig A/B test
[-11.01%, 5.76%]
[0%, 0%]
0%
Simulation I
LLM expert with screenshots
[-52.18%, 2.94%]
[-41.16%, -2.82%]
29.17%
Simulation II
Extreme homogeneity
[-12.53%, -2.60%]
[-1.52%, -8.36%]
6.01%
Simulation III
Geographic generalization
[-9.29%, 0.75%]
[1.73%, -5.01%]
3.75%
Simulation IV
Real world attributes
[-8.48%, 2.22%]
[2.53%, -3.54%]
3.08%
Simulation V
Real world + generated attributes
[-12.55%, -1.99%]
[-1.54%, -7.75%]
5.59%
Simulation VI
Persona archetypes only
[-12.50%, -1.09%]
[-1.49%, -6.85%]
4.96%
Simulation VII
Real world attributes + generated personas
[-8.82%, 4.43%]
[2.19%, -1.33%]
1.81%
Simulation VIII
All components
[-11.90%, 0.99%]
[-0.88%, -4.77%]
3.43%

Endpoint RMSE — Conversion Landing → Pricing

Conversion Pricing → Signup

SimulationRelative Lift %Abs ErrorEndpoint RMSE
Real world ground truth
Statsig A/B test
[-25.87%, 9.60%]
[0%, 0%]
0%
Simulation I
LLM expert with screenshots
[0.00%, 0.00%]
[25.87%, -9.60%]
19.51%
Simulation II
Extreme homogeneity
[-7.38%, -1.67%]
[18.49%, -11.26%]
15.31%
Simulation III
Geographic generalization
[-7.44%, -1.90%]
[18.43%, -11.50%]
15.36%
Simulation IV
Real world attributes
[-5.12%, 1.08%]
[20.74%, -8.52%]
15.86%
Simulation V
Real world + generated attributes
[-7.65%, -1.80%]
[18.21%, -11.40%]
15.19%
Simulation VI
Persona archetypes only
[-7.43%, -1.46%]
[18.44%, -11.05%]
15.20%
Simulation VII
Real world attributes + generated personas
[-7.81%, -1.21%]
[18.06%, -10.81%]
14.88%
Simulation VIII
All components
[-9.93%, -3.64%]
[15.94%, -13.24%]
14.65%

Endpoint RMSE — Conversion Pricing → Signup

Key Observations

1. Browser-based simulations significantly outperform LLM screenshot heuristics

The simplest browser-based simulation (Simulation II) significantly outperforms the simplified LLM expert heuristic (Simulation I).

This suggests that simulating real browser behavior (observe, reason, act) is needed to accurately predict A/B test outcomes on websites.

2. Persona archetypes clearly improve accuracy

There was a clear improvement in accuracy for both metrics with the addition of persona archetypes.

  • From Simulation IV (real world attributes) to Simulation VII (real world attributes and generated persona archetypes), the addition of generated personas improved Conv_Landing_To_Pricing's Endpoint RMSE from 3.08% to 1.81% and Conv_Pricing_To_Signup's Endpoint RMSE from 15.86% to 14.88%.
  • From Simulation V (real world attributes + generated attributes) to Simulation VIII (real world attributes + generated attributes + generated persona archetypes), the addition of generated personas improved Conv_Landing_To_Pricing's Endpoint RMSE from 5.59% to 3.43% and Conv_Pricing_To_Signup's Endpoint RMSE from 15.19% to 14.65%.

3. More user data improves overall simulation accuracy

Overall, the addition of more user data in the form of user attributes or persona archetypes improved the accuracy of the simulations.

From Simulation II (no attributes or persona archetypes) to Simulation VIII (real world attributes + generated attributes + generated persona archetypes), the addition of user data improved Conv_Landing_To_Pricing's Endpoint RMSE from 6.01% to 3.43% and Conv_Pricing_To_Signup's Endpoint RMSE from 15.31% to 14.65%.

4. User attributes alone have minimal impact

The inclusion of user attributes (Simulations III, IV, V) minimally improved both metrics compared to the baseline simulation without any attributes (Simulation II).

  • There was no clear pattern of improvement for Conv_Landing_To_Pricing as we added real attributes and generated attributes, with ultimately only a 0.42% improvement in Endpoint RMSE from 6.01% (Simulation II) to 5.59% (Simulation V).
  • There is a minimal improvement for Conv_Pricing_To_Signup as we added real attributes and generated attributes, with ultimately only a 0.12% improvement in Endpoint RMSE from 15.31% (Simulation II) to 15.19% (Simulation V).

Conclusion and Next Steps

  1. Browser simulations are significantly more accurate than simply providing an LLM screenshots – simulating actions on real web browsers is needed to accurately simulate user behavior.
  2. Inclusion of persona archetypes has a clear improvement in simulation accuracy, and so the generation of accurate persona archetypes should be further investigated.
  3. Inclusion of user attributes (e.g. sex, age, country of origin) has a small improvement in simulation accuracy and so if convenient, should be included.
See Replica forecast on your data

We’ll backtest three of your past A/B tests before the call

Book a demoRead the methodology →
Read next
ResearchFinetuning models on session recordings for realistic behaviorCase studyHow Explainpaper used Replica to predict an A/B outcome in <1 hourMethodologyHow Replica forecasts A/B outcomes · v2.1