A systematic ablation study examining how population-level attributes and persona archetypes impact the accuracy of Replica’s A/B test simulations.
Ablation analysis is a structured way to understand which components of a system are actually contributing to its performance. Rather than evaluating a system as a single black box, ablation removes or modifies one component at a time while keeping everything else fixed, then observes how the system's behavior changes.
The intuition is simple: if performance degrades when a component is removed, that component is providing real signal. If performance remains largely unchanged, the component is likely redundant. We use ablation to determine what components are worth investing more time into exploring given their proven impact.
Replica creates digital twins that simulate real user segments such that their session-level reasoning, navigation paths, and decision outcomes accurately resemble real observed desktop and mobile behavior.
Replica simulates A/B test outcomes by modeling how these digital twins, designed to resemble real users, interact with different website variants. These digital twins are created given the following:
A simulation may closely match a real A/B test, but without isolating individual components, it's impossible to understand their relative impact. That leaves us guessing about what to improve, what's over-engineered, and where additional effort will meaningfully move the needle.
For this study, we focus on ablating population-level attributes and personas, while holding the browser interaction model and task framing constant. This allows us to isolate how modeling who users are and what they're trying to do affects A/B test prediction accuracy.
Explainpaper is a research platform that helps readers understand complex academic papers, with over 400,000 users worldwide. As part of their growth efforts, the team set out to improve their new user signup funnel – one of the highest-leverage parts of their website.
In a recent pilot, Explainpaper hypothesized that emphasizing how their product makes papers easier to understand, rather than simply faster to read, would increase new user signups. To test this, they launched an A/B experiment using Statsig, comparing two website variants with differences in wording and visuals. The experiment focused on two key funnel stages: conversion from the Landing page to the Pricing page, and conversion from the Pricing page to Signup.
We simulated this A/B test using Replica and observed directionally accurate results relative to the real world experiment. In the sections that follow, we use this same A/B test as the foundation for a series of ablation experiments to understand which components of Replica are responsible for that accuracy.
This A/B test evaluated two primary metrics tied to key stages of the new user signup funnel:
The only real user attribute that Explainpaper tracks is country of origin, which Statsig automatically infers from a user's device. This country-level distribution can be used directly when generating the population of AI agents, allowing the simulated digital twins to more closely reflect Explainpaper's real user base.
User attribute: Country of origin
We ran eight simulations, progressively adding complexity one component at a time, and compared results before and after each addition to measure its impact on simulation accuracy.
Rather than simulating real browser behavior (observe, reason, act), we give the LLM static screenshots of the Control and Treatment website variants and ask whether it would proceed to the Pricing and Signup pages. No user attributes or personas are provided. Akin to just asking ChatGPT to compare screenshots.
The first browser-based simulation, where we simulate user behavior in a real web browser but treat all users as generic by providing no user attributes or personas.
Introduce country of origin as a user attribute for the first time, with a simplified population where 100% of simulated users are from the United States.
Use the true country-of-origin attribute distribution to construct the simulated user population, sourced directly from the Statsig integration.
In addition to the true country-of-origin distribution, introduce age and sex attributes by using an LLM to estimate plausible population-level distributions for the simulated user population.
Remove any user attributes and only provide the LLM-generated persona archetypes for this website.
Revert to using only the true country-of-origin user attribute, and introduce personas by having an LLM generate a small set of representative user archetypes along with their population distribution.
Include all components: the true country-of-origin distribution, LLM-generated age and sex attribute distributions, and LLM-generated personas (user archetypes and their distribution).
We ran each simulation with 2000 runs in Control and 2000 runs in Treatment. The specific attribute and persona inputs for each Simulation are found in the Appendix.
We calculated the lower and upper Relative Lift % as the two-sided 95% confidence interval for relative lift.
We calculated the Relative Lift Abs Error as the absolute difference between the simulation and ground truth.
Endpoint RMSE is a single-number error metric that measures how far a simulated confidence interval is from the ground-truth confidence interval, by comparing their lower and upper endpoints. We use it to assess overall performance of the simulations.
| Simulation | Relative Lift % | Abs Error | Endpoint RMSE |
|---|---|---|---|
Real world ground truth Statsig A/B test | [-11.01%, 5.76%] | [0%, 0%] | 0% |
Simulation I LLM expert with screenshots | [-52.18%, 2.94%] | [-41.16%, -2.82%] | 29.17% |
Simulation II Extreme homogeneity | [-12.53%, -2.60%] | [-1.52%, -8.36%] | 6.01% |
Simulation III Geographic generalization | [-9.29%, 0.75%] | [1.73%, -5.01%] | 3.75% |
Simulation IV Real world attributes | [-8.48%, 2.22%] | [2.53%, -3.54%] | 3.08% |
Simulation V Real world + generated attributes | [-12.55%, -1.99%] | [-1.54%, -7.75%] | 5.59% |
Simulation VI Persona archetypes only | [-12.50%, -1.09%] | [-1.49%, -6.85%] | 4.96% |
Simulation VII Real world attributes + generated personas | [-8.82%, 4.43%] | [2.19%, -1.33%] | 1.81% |
Simulation VIII All components | [-11.90%, 0.99%] | [-0.88%, -4.77%] | 3.43% |
Endpoint RMSE — Conversion Landing → Pricing
| Simulation | Relative Lift % | Abs Error | Endpoint RMSE |
|---|---|---|---|
Real world ground truth Statsig A/B test | [-25.87%, 9.60%] | [0%, 0%] | 0% |
Simulation I LLM expert with screenshots | [0.00%, 0.00%] | [25.87%, -9.60%] | 19.51% |
Simulation II Extreme homogeneity | [-7.38%, -1.67%] | [18.49%, -11.26%] | 15.31% |
Simulation III Geographic generalization | [-7.44%, -1.90%] | [18.43%, -11.50%] | 15.36% |
Simulation IV Real world attributes | [-5.12%, 1.08%] | [20.74%, -8.52%] | 15.86% |
Simulation V Real world + generated attributes | [-7.65%, -1.80%] | [18.21%, -11.40%] | 15.19% |
Simulation VI Persona archetypes only | [-7.43%, -1.46%] | [18.44%, -11.05%] | 15.20% |
Simulation VII Real world attributes + generated personas | [-7.81%, -1.21%] | [18.06%, -10.81%] | 14.88% |
Simulation VIII All components | [-9.93%, -3.64%] | [15.94%, -13.24%] | 14.65% |
Endpoint RMSE — Conversion Pricing → Signup
The simplest browser-based simulation (Simulation II) significantly outperforms the simplified LLM expert heuristic (Simulation I).
This suggests that simulating real browser behavior (observe, reason, act) is needed to accurately predict A/B test outcomes on websites.
There was a clear improvement in accuracy for both metrics with the addition of persona archetypes.
Overall, the addition of more user data in the form of user attributes or persona archetypes improved the accuracy of the simulations.
From Simulation II (no attributes or persona archetypes) to Simulation VIII (real world attributes + generated attributes + generated persona archetypes), the addition of user data improved Conv_Landing_To_Pricing's Endpoint RMSE from 6.01% to 3.43% and Conv_Pricing_To_Signup's Endpoint RMSE from 15.31% to 14.65%.
The inclusion of user attributes (Simulations III, IV, V) minimally improved both metrics compared to the baseline simulation without any attributes (Simulation II).