Finetuning on just 5,000 real sessions improves action prediction accuracy by 70% and click precision by 2.5× over base GPT-4o — enabling simulated users that browse, hesitate, and drop off like real users.
Replica builds accurate digital twins of real users — AI models that simulate how actual people browse, click, scroll, hesitate, and leave websites. Unlike chatbots or task-completion agents, a digital twin doesn't use a website to accomplish a goal. It behaves like a specific type of user would: exploring, getting confused, abandoning checkout, or converting — all with the timing, spatial patterns, and decision-making of a real human. These digital twins power use cases like A/B test simulation, prelaunch validation, and UX research — letting companies test product changes against realistic synthetic users before deploying to real ones.
The foundation of any digital twin is behavioral fidelity. Generic foundation models can browse websites, but they don't behave like real users. They're too systematic, too fast, too "perfect." They click through forms with machine-like precision, never hesitate before a signup button, never scroll past content they find uninteresting, and never bounce from a page that loads slowly. If the digital twin isn't accurate, the simulations aren't trustworthy.
Our approach: fine-tune multimodal vision-language models on a company's actual recorded user sessions. Each training example includes screenshots at every decision point, a rich action space (clicks with spatial coordinates, scrolls, text inputs, session termination), timing signals that capture cognitive load, and chain-of-thought reasoning that captures why the user acted. The model learns real decision-making patterns — hesitation, exploration, drop-offs, and mistakes — not idealized browsing behavior.
Fine-tuning on just 5,000 real sessions improves action prediction accuracy by 70% and click precision by 2.5x over base GPT-4o.
Companies already record user sessions. Replica maps those recordings to user attributes and builds segment-specific fine-tuned models — segmented by geography, device type, demographics, or any attribute the company tracks. This enables A/B test simulations with digital twins that reflect how specific user segments actually behave, not just an average user. And because the pipeline is standardized, a company can go from raw session recordings to production-ready digital twins without building ML infrastructure.
The bigger picture: aggregate anonymized behavioral data across companies to build reusable, attribute-conditioned models — "US women aged 20-29 on mobile" as a pretrained behavioral prior — available even to companies without enough internal data. Cross-company benchmarking and attribute-level behavioral analysis at scale.
In this article, we walk through the pipeline, the metrics, and the results — using Explainpaper (82,148 recorded sessions) as our case study.
A digital twin needs a language for describing what users do. We designed a structured action schema that captures the core range of real browsing behavior — clicks, scrolls, text input, and the critical decision to leave.
This schema is deliberately simplified for this initial study. Real user behavior includes hover events, drag-and-drop, multi-touch gestures, right-clicks, keyboard shortcuts, tab switches, and more. We chose to start with the four action types that account for the vast majority of meaningful interactions, and plan to expand the schema in future work. The current design validates the core hypothesis — that fine-tuning on real sessions improves behavioral fidelity — without the complexity of a fully exhaustive action vocabulary.
| Field | Click | Scroll | Input | Terminate |
|---|---|---|---|---|
| type | "click" | "scroll" | "input" | "terminate" |
| timing | quick / normal / delayed | quick / normal / delayed | — | — |
| label | element text (max 50 chars) | — | field label (max 50 chars) | — |
| x | 0.0–1.0 (viewport-normalized) | — | — | — |
| y | 0.0–1.0 (viewport-normalized) | — | — | — |
| direction | — | up / down | — | — |
| distance | — | viewport heights (float) | — | — |
| value | — | — | [QUERY] / [TEXT] / [SENSITIVE] | — |
Every action is a JSON object:
{"type":"click", "timing":"delayed", "label":"Sign Up", "x":0.72, "y":0.45}
{"type":"scroll", "timing":"quick", "direction":"down", "distance":1.5}
{"type":"input", "label":"Search papers", "value":"[QUERY]"}
{"type":"terminate"}
Several design decisions make this schema meaningful for digital twin fidelity:
terminate is a prediction the model makes, not a timeout.The pipeline transforms raw session recordings into multimodal fine-tuning examples:
We use GPT-4o-mini to synthesize a 1–2 sentence first-person chain-of-thought for each action. The model receives the screenshot the user saw and the action they took, and generates a natural-language explanation of the user's likely thinking:
Prompt: "You are a real person visiting explainpaper.com. You are given a screenshot of what you currently see and the action you took. Write a brief first-person rationale (1-2 sentences) explaining your thinking."
Input: [screenshot] + {"type":"click","label":"Sign Up","x":0.72,"y":0.45}
Output: "I've read through the features and this looks useful for my research papers. I'll sign up to try it out."
This synthesized rationale becomes the reasoning field in each training example. At inference time, the model generates both a rationale and an action — the reasoning grounds its prediction in visual context and makes the action more interpretable. We achieve a 99.4% synthesis success rate across 50 concurrent requests, with the remaining 0.6% falling back to empty rationales.
| Stage | Count | Notes |
|---|---|---|
| Total sessions recorded | 82,148 | All Explainpaper traffic |
| Real sessions from landing page | 21,610 | Filtered: project_key = '004', url = explainpaper.com |
| High-quality training examples | 14,236 | After filtering bots, bounces, mobile, rage clicks |
| Average events per session | 13.5 |
Train/val/test split (by session ID, seed=42): 11,503 / 1,461 / 1,272 (80/10/10).
This study makes several simplifying assumptions worth calling out:
Each training example is a single conversation turn: the system prompt establishes the user persona, the user message contains the full browsing history (interleaved screenshots and actions), and the assistant message is the next action to predict.
{
"messages": [
{
"role": "system",
"content": "You are a real user browsing a website...\nWebsite: https://www.explainpaper.com/\nYour goal: Decide whether to sign up for an account\n..."
},
{
"role": "user",
"content": [
{"type": "text", "text": "{\"website\":\"https://www.explainpaper.com/\",\"viewport\":{\"width\":1512,\"height\":816}}"},
{"type": "image_url", "image_url": {"url": "step_1.jpg", "detail": "low"}},
{"type": "text", "text": "Action 1:\n{\"reasoning\":\"I just landed on the page. Let me scroll down...\",\"action\":{\"type\":\"scroll\",\"timing\":\"quick\",\"direction\":\"down\",\"distance\":1.2}}"},
{"type": "image_url", "image_url": {"url": "step_2.jpg", "detail": "low"}},
{"type": "text", "text": "Action 2:\n{\"reasoning\":\"I can see the feature list now...\",\"action\":{\"type\":\"click\",\"timing\":\"delayed\",\"label\":\"Try it\",\"x\":0.50,\"y\":0.68}}"},
{"type": "image_url", "image_url": {"url": "step_3.jpg", "detail": "low"}},
{"type": "text", "text": "Step 3, 12s elapsed. Next action:"}
]
},
{
"role": "assistant",
"content": "{\"reasoning\":\"I see the sign-up form now...\",\"action\":{\"type\":\"click\",\"timing\":\"normal\",\"label\":\"Create Account\",\"x\":0.52,\"y\":0.71}}"
}
]
}Key design decisions:
Each configuration (sample size x epoch) produces a checkpoint, giving us 9 fine-tuned models plus baseline — 10 total models evaluated on the same test set.
Every fine-tuned model — regardless of sample size or epoch — beats the baseline on nearly every metric. Here's the full picture:
| Model | Exact Match | Type Acc. | Click Error | Scroll Dir. | Scroll MAE | Input Acc. | Label Match |
|---|---|---|---|---|---|---|---|
| Baseline GPT-4o | 12.4% | 44.5% | 0.282 | 22.9% | 0.908 | 95.8% | 45.8% |
| 1k epoch 1 | 13.9% | 52.8% | 0.132 | 76.7% | 0.736 | 98.8% | 67.2% |
| 1k epoch 2 | 18.7% | 55.8% | 0.136 | 85.0% | 0.664 | 99.4% | 68.6% |
| 1k epoch 3 | 15.6% | 56.2% | 0.129 | 88.9% | 0.711 | 98.8% | 68.7% |
| 2k epoch 1 | 13.5% | 51.9% | 0.129 | 87.4% | 0.677 | 98.6% | 66.7% |
| 2k epoch 2 | 16.0% | 55.9% | 0.124 | 93.7% | 0.751 | 99.4% | 70.3% |
| 2k epoch 3 | 17.6% | 57.2% | 0.132 | 92.4% | 0.667 | 98.8% | 68.7% |
| 5k epoch 1 | 15.2% | 52.4% | 0.141 | 90.6% | 0.718 | 99.4% | 70.8% |
| 5k epoch 2 | 21.1% | 60.7% | 0.115 | 94.9% | 0.684 | 98.8% | 71.1% |
| 5k epoch 3 | 20.5% | 61.7% | 0.120 | 94.5% | 0.687 | 99.4% | 72.4% |
The best model (5k, epoch 2) improves over baseline by +70% on exact match, 2.5x on click precision, and 4.1x on scroll direction. But even the weakest fine-tuned model (1k epoch 1) already cuts click error by more than half and triples scroll direction accuracy. Fine-tuning provides a consistent, dramatic improvement — the question is how much, not whether.
Exact Match Accuracy by Model
But the most revealing insight is in the action type distribution. Baseline GPT-4o predicts "click" for 67% of actions — even when the user scrolled or left the page. It barely produces scroll or terminate predictions. The fine-tuned model learns the real distribution:
| Action type | Ground truth | Baseline predicted | Fine-tuned predicted |
|---|---|---|---|
| Click | 41.2% | 67.2% | 33.4% |
| Scroll | 19.3% | 4.2% | 27.6% |
| Input | 15.4% | 17.7% | 19.1% |
| Terminate | 24.1% | 10.9% | 19.9% |
Action Type Distribution: Ground Truth vs. Predicted (%)
The fine-tuned model's distribution is far closer to reality. It understands that real users scroll, leave, and type — not just click.
| Action Type | Baseline F1 | Fine-tuned F1 | Improvement |
|---|---|---|---|
| Click | 53.5% | 69.8% | +30% |
| Scroll | 23.4% | 56.9% | +143% |
| Input | 55.9% | 72.5% | +30% |
| Terminate | 19.8% | 33.4% | +69% |
F1 Score by Action Type: Baseline vs. Fine-tuned
The confusion matrices tell the story. Baseline GPT-4o's dominant failure mode: it predicts "click" when the user actually scrolled (194 out of 245 scroll actions misclassified as clicks) or terminated (236 out of 306 terminates misclassified as clicks). The fine-tuned model corrects both:
The biggest gain is scroll prediction — F1 jumps from 23% to 57%, a 143% improvement. Scrolling is invisible to text-only models but obvious in screenshots: the model learns that when a user sees a long page with content below the fold, they scroll.
| Training Size | Epoch 1 | Epoch 2 | Epoch 3 |
|---|---|---|---|
| 1k | 13.9% | 18.7% | 15.6% |
| 2k | 13.5% | 16.0% | 17.6% |
| 5k | 15.2% | 21.1% | 20.5% |
Exact Match by Training Size and Epoch
Three patterns emerge:
Not all predictions are equally hard. Segmenting by context reveals where the model excels and where it struggles:
| Position | Steps | Baseline | Fine-tuned (5k) |
|---|---|---|---|
| Early (1–5) | 683 | 10.3% | 16.4% |
| Mid (6–15) | 555 | 14.5% | 25.4% |
| Late (16+) | 25 | 23.1% | 56.0% |
Exact Match by Session Position: Baseline vs. Fine-tuned (5k)
Early actions are the hardest — the model has little context about this user's intent. But as the session progresses and the model accumulates action history + screenshots, it gets dramatically better. At step 16+, the fine-tuned model achieves 56% exact match — it understands where this user has been and can predict where they're going.
| Condition | Baseline | Fine-tuned (5k) |
|---|---|---|
| Actions with visible text labels | 17.0% | 29.2% |
| Actions without labels | 0.0% | 0.0% |
Exact Match by Label Availability: Baseline vs. Fine-tuned (5k)
When the model can match the action to a visible text label ("Sign Up", "Search papers"), it performs well. Without a label (e.g., clicking a blank area or an icon without text), both baseline and fine-tuned models struggle. This suggests that label-rich UI designs are inherently more predictable — a useful insight for UX evaluation.
Timing accuracy: 54.2% overall for the fine-tuned model (vs. 21.2% baseline). The model learns that users pause before high-commitment actions and move quickly through familiar interfaces — capturing the human pace that makes a digital twin feel realistic.
The numbers tell one story. Concrete predictions tell another.
A user lands on Explainpaper's landing page and reads the hero section. The fine-tuned model predicts a delayed click on "Try it" — matching the real user's 7-second pause before committing. Baseline GPT-4o predicted a quick click, as if the user already knew what they wanted.
After reading a paragraph, the real user scrolled down 1.5 viewport heights to see more features, then clicked "Sign Up." Baseline GPT-4o predicted a click on the navigation bar (skipping the scroll entirely). The fine-tuned model correctly predicted {"type":"scroll", "timing":"normal", "direction":"down", "distance":1.3} — not pixel-perfect, but the right behavior.
The model predicted a click on the "Upload PDF" button when the user actually clicked "Sign Up" (adjacent elements in the top-right corner). The click coordinates were close (0.05 distance) but the label was wrong. This reveals a known limitation: when multiple interactive elements are spatially close, the model captures the region of interest but sometimes picks the wrong target.
The key insight: the model learns distributions of human behavior, not deterministic playbooks. Given the same page state, it might predict "scroll down" 60% of the time and "click Sign Up" 40% of the time — reflecting the real split in user behavior. This is exactly what makes it useful for simulation: run 1,000 sessions and get a realistic distribution of outcomes, not 1,000 identical bot runs.
Replica's digital twins are only as useful as they are accurate. This research shows that fine-tuning produces a step-change in behavioral fidelity — every metric improves substantially, and the model learns behavioral patterns that generic foundation models simply cannot capture.
Run experiments against fine-tuned digital twins that click, scroll, hesitate, and drop off like real users. Test conversion impact before shipping to production. When the model predicts that 40% of users will bounce from a new pricing page layout — and the predicted click heatmap matches real user patterns — you can trust the simulation.
Simulate thousands of realistic user sessions on a new feature or redesign. Identify UX issues, drop-off points, and navigation confusion before any real user sees it. The 2.5x improvement in click precision means the model targets the right UI elements, and the 4.1x improvement in scroll direction means it navigates pages the way real users do.
Fine-tune on subsets of users sharing attributes — geography, device type, income bracket, new vs. returning — and simulate how specific audiences will respond. The pipeline is standardized: map session_id → user_id → user_attributes, filter training data by segment, fine-tune separate models. Mobile users scroll more and click less. First-time visitors explore more broadly. Enterprise users go straight to pricing. Each segment gets a model that captures its behavioral signature.
Companies already collect session recordings via tools like FullStory, Hotjar, and LogRocket. Replica ingests these recordings and maps them to user attributes to build attribute-conditioned digital twins from your existing data — no new instrumentation required.
We validated the core hypothesis: finetuning on real sessions dramatically improves behavioral fidelity. Several directions follow:
The signal is clear: more data helps, visual context helps, and real sessions contain learnable structure that foundation models miss. The path to accurate digital twins runs through fine-tuning — and we're just getting started.