Research · Finetuning

Finetuning models on session recordings for realistic user behavior

Finetuning on just 5,000 real sessions improves action prediction accuracy by 70% and click precision by 2.5× over base GPT-4o — enabling simulated users that browse, hesitate, and drop off like real users.

Topic Simulation realism · LLM finetuning·Published March 3, 2026·Read time 14 min

70%

Improvement in action prediction accuracy over base GPT-4o

2.5×

Increase in click-target precision after finetuning

5,000

Real user sessions used in the finetuning corpus

Replica & the Digital Twin Problem

Replica builds accurate digital twins of real users — AI models that simulate how actual people browse, click, scroll, hesitate, and leave websites. Unlike chatbots or task-completion agents, a digital twin doesn't use a website to accomplish a goal. It behaves like a specific type of user would: exploring, getting confused, abandoning checkout, or converting — all with the timing, spatial patterns, and decision-making of a real human. These digital twins power use cases like A/B test simulation, prelaunch validation, and UX research — letting companies test product changes against realistic synthetic users before deploying to real ones.

The foundation of any digital twin is behavioral fidelity. Generic foundation models can browse websites, but they don't behave like real users. They're too systematic, too fast, too "perfect." They click through forms with machine-like precision, never hesitate before a signup button, never scroll past content they find uninteresting, and never bounce from a page that loads slowly. If the digital twin isn't accurate, the simulations aren't trustworthy.

Our approach: fine-tune multimodal vision-language models on a company's actual recorded user sessions. Each training example includes screenshots at every decision point, a rich action space (clicks with spatial coordinates, scrolls, text inputs, session termination), timing signals that capture cognitive load, and chain-of-thought reasoning that captures why the user acted. The model learns real decision-making patterns — hesitation, exploration, drop-offs, and mistakes — not idealized browsing behavior.

Fine-tuning on just 5,000 real sessions improves action prediction accuracy by 70% and click precision by 2.5x over base GPT-4o.

Companies already record user sessions. Replica maps those recordings to user attributes and builds segment-specific fine-tuned models — segmented by geography, device type, demographics, or any attribute the company tracks. This enables A/B test simulations with digital twins that reflect how specific user segments actually behave, not just an average user. And because the pipeline is standardized, a company can go from raw session recordings to production-ready digital twins without building ML infrastructure.

The bigger picture: aggregate anonymized behavioral data across companies to build reusable, attribute-conditioned models — "US women aged 20-29 on mobile" as a pretrained behavioral prior — available even to companies without enough internal data. Cross-company benchmarking and attribute-level behavioral analysis at scale.

In this article, we walk through the pipeline, the metrics, and the results — using Explainpaper (82,148 recorded sessions) as our case study.

The Action Space: Defining "Realistic" Behavior

A digital twin needs a language for describing what users do. We designed a structured action schema that captures the core range of real browsing behavior — clicks, scrolls, text input, and the critical decision to leave.

This schema is deliberately simplified for this initial study. Real user behavior includes hover events, drag-and-drop, multi-touch gestures, right-clicks, keyboard shortcuts, tab switches, and more. We chose to start with the four action types that account for the vast majority of meaningful interactions, and plan to expand the schema in future work. The current design validates the core hypothesis — that fine-tuning on real sessions improves behavioral fidelity — without the complexity of a fully exhaustive action vocabulary.

Action Schema

Field	Click	Scroll	Input	Terminate
type	"click"	"scroll"	"input"	"terminate"
timing	quick / normal / delayed	quick / normal / delayed	—	—
label	element text (max 50 chars)	—	field label (max 50 chars)	—
x	0.0–1.0 (viewport-normalized)	—	—	—
y	0.0–1.0 (viewport-normalized)	—	—	—
direction	—	up / down	—	—
distance	—	viewport heights (float)	—	—
value	—	—	[QUERY] / [TEXT] / [SENSITIVE]	—

Every action is a JSON object:

{"type":"click", "timing":"delayed", "label":"Sign Up", "x":0.72, "y":0.45}

{"type":"scroll", "timing":"quick", "direction":"down", "distance":1.5}

{"type":"input", "label":"Search papers", "value":"[QUERY]"}

{"type":"terminate"}

Several design decisions make this schema meaningful for digital twin fidelity:

Timing as cognitive signal. We bucket the time between actions into three categories: quick (<1s) for reflexive muscle memory, normal (1–5s) for scanning and reading, delayed (>5s) for thinking or deciding. This is what separates a digital twin from a bot — real users hesitate before signup buttons and breeze through familiar navigation. A user landing on Explainpaper might quickly scroll down (timing: "quick"), pause to read the feature list, then click "Try it" with a delayed timing — reflecting genuine deliberation. The model must learn these patterns, not just the click coordinates.
Spatial precision. Click coordinates are normalized to (0, 1) relative to the viewport. The model must learn where users actually click — the center of a button vs. the edge, the nav bar vs. the hero section — not just what element they target. This matters because spatial patterns differ across user segments: experienced users click with precision, while first-time visitors explore more broadly.
Input value categories. We classify text inputs as [QUERY] for search/filter boxes, [TEXT] for general form fields, and [SENSITIVE] for passwords. This preserves behavioral patterns (search-first vs. browse-first users) while protecting user data — no raw input text ever reaches the training set.
Learned session termination. There's no hardcoded rule for when a session ends. The model learns drop-off patterns from real sessions: users who bounce after 3 seconds, users who leave after reading the pricing page, users who complete signup. terminate is a prediction the model makes, not a timeout.

Data Pipeline: Sessions to Training Examples

Architecture Overview

The pipeline transforms raw session recordings into multimodal fine-tuning examples:

Session ingestion. Structured action data is pulled from Supabase — clean, ordered actions with scroll deltas, click coordinates, and timestamps. Raw rrweb events from GCS are used only for DOM label lookup and screenshot capture.
Screenshot capture & privacy. Headless Playwright browsers replay each session via rrweb, capturing a screenshot at every action timestamp. DNN-based face detection blurs faces in screenshots; PII in text (emails, phone numbers) is masked before upload to S3.
Rationale synthesis. This is a critical step. Raw session recordings contain what a user did (clicked here, scrolled there) but not why. A click on "Sign Up" could mean "I'm ready to commit" or "I want to see what's behind the wall." Without rationale, the model learns surface patterns — coordinate sequences — instead of decision-making.

We use GPT-4o-mini to synthesize a 1–2 sentence first-person chain-of-thought for each action. The model receives the screenshot the user saw and the action they took, and generates a natural-language explanation of the user's likely thinking:

Prompt: "You are a real person visiting explainpaper.com. You are given a screenshot of what you currently see and the action you took. Write a brief first-person rationale (1-2 sentences) explaining your thinking."

Input: [screenshot] + {"type":"click","label":"Sign Up","x":0.72,"y":0.45}

Output: "I've read through the features and this looks useful for my research papers. I'll sign up to try it out."

This synthesized rationale becomes the reasoning field in each training example. At inference time, the model generates both a rationale and an action — the reasoning grounds its prediction in visual context and makes the action more interpretable. We achieve a 99.4% synthesis success rate across 50 concurrent requests, with the remaining 0.6% falling back to empty rationales.

Scale

Stage	Count	Notes
Total sessions recorded	82,148	All Explainpaper traffic
Real sessions from landing page	21,610	Filtered: project_key = '004', url = explainpaper.com
High-quality training examples	14,236	After filtering bots, bounces, mobile, rage clicks
Average events per session	13.5

Train/val/test split (by session ID, seed=42): 11,503 / 1,461 / 1,272 (80/10/10).

Assumptions & Limitations

This study makes several simplifying assumptions worth calling out:

Desktop only. We filtered out all mobile sessions. Mobile browsing involves fundamentally different interaction patterns — tap vs. click, swipe vs. scroll, smaller viewports, different UI layouts. Building accurate mobile digital twins is a separate challenge we plan to address with device-specific models.
Single website. All results are from Explainpaper, a relatively simple content site with a linear conversion funnel. Results may differ on complex web apps (e.g., dashboards, multi-step workflows, e-commerce with product grids). We expect the approach to generalize, but the specific accuracy numbers are site-dependent.
Simplified action schema. As noted above, we model four action types. Real browsing includes hover events, drag-and-drop, keyboard shortcuts, multi-tab behavior, and more. These omissions mean certain user behaviors (e.g., hovering to preview tooltips, using Cmd+F to search) are invisible to the current model.
Single user goal. All sessions share the same system prompt: "Decide whether to sign up." In practice, users arrive with diverse intents — some want to read papers, some are evaluating for their team, some landed from a Google search and will bounce. Segment-conditioned prompts are a natural next step.

Training Format: Interleaved Vision + Action History

Each training example is a single conversation turn: the system prompt establishes the user persona, the user message contains the full browsing history (interleaved screenshots and actions), and the assistant message is the next action to predict.

{
  "messages": [
    {
      "role": "system",
      "content": "You are a real user browsing a website...\nWebsite: https://www.explainpaper.com/\nYour goal: Decide whether to sign up for an account\n..."
    },
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "{\"website\":\"https://www.explainpaper.com/\",\"viewport\":{\"width\":1512,\"height\":816}}"},
        {"type": "image_url", "image_url": {"url": "step_1.jpg", "detail": "low"}},
        {"type": "text", "text": "Action 1:\n{\"reasoning\":\"I just landed on the page. Let me scroll down...\",\"action\":{\"type\":\"scroll\",\"timing\":\"quick\",\"direction\":\"down\",\"distance\":1.2}}"},
        {"type": "image_url", "image_url": {"url": "step_2.jpg", "detail": "low"}},
        {"type": "text", "text": "Action 2:\n{\"reasoning\":\"I can see the feature list now...\",\"action\":{\"type\":\"click\",\"timing\":\"delayed\",\"label\":\"Try it\",\"x\":0.50,\"y\":0.68}}"},
        {"type": "image_url", "image_url": {"url": "step_3.jpg", "detail": "low"}},
        {"type": "text", "text": "Step 3, 12s elapsed. Next action:"}
      ]
    },
    {
      "role": "assistant",
      "content": "{\"reasoning\":\"I see the sign-up form now...\",\"action\":{\"type\":\"click\",\"timing\":\"normal\",\"label\":\"Create Account\",\"x\":0.52,\"y\":0.71}}"
    }
  ]
}

Key design decisions:

Full action history, no sliding window. Every training example includes ALL prior actions from the session — no truncation. Context builds naturally, and the model learns that step 15 depends on what happened at step 3. This is computationally expensive but critical: we found that model accuracy increases substantially with more context (56% exact match at step 16+ vs. 16% at steps 1–5).
Interleaved screenshots. Each action in the history is paired with the screenshot the user saw at that moment. The model sees the visual state at every decision point, not just the current one. Screenshots use detail: "low" (512x512) for cost optimization — sufficient for UI layout understanding while keeping token costs manageable.
LLM-synthesized rationale. GPT-4o-mini generates a 1–2 sentence chain-of-thought for each action with a 99.4% success rate. This rationale bridges the gap between raw coordinates and user intent — "I see the sign-up form, so I'll click Create Account" is richer signal than just {"type":"click","x":0.52,"y":0.71}.

Experimental Setup

Base model: gpt-4o-2024-08-06 (supports vision fine-tuning)
Hyperparameters: learning rate multiplier 0.5, batch size 1, 3 epochs
Sample size sweep: 1k, 2k, 5k training examples (subsampled from 11.5k full training set)
Evaluation: 1,272 held-out test examples, 10 metrics with segmented analysis
Baseline: Unmodified gpt-4o-2024-08-06 with identical system prompt and input format

Each configuration (sample size x epoch) produces a checkpoint, giving us 9 fine-tuned models plus baseline — 10 total models evaluated on the same test set.

Results

Fine-tuning Dramatically Improves Every Metric

Every fine-tuned model — regardless of sample size or epoch — beats the baseline on nearly every metric. Here's the full picture:

Model	Exact Match	Type Acc.	Click Error	Scroll Dir.	Scroll MAE	Input Acc.	Label Match
Baseline GPT-4o	12.4%	44.5%	0.282	22.9%	0.908	95.8%	45.8%
1k epoch 1	13.9%	52.8%	0.132	76.7%	0.736	98.8%	67.2%
1k epoch 2	18.7%	55.8%	0.136	85.0%	0.664	99.4%	68.6%
1k epoch 3	15.6%	56.2%	0.129	88.9%	0.711	98.8%	68.7%
2k epoch 1	13.5%	51.9%	0.129	87.4%	0.677	98.6%	66.7%
2k epoch 2	16.0%	55.9%	0.124	93.7%	0.751	99.4%	70.3%
2k epoch 3	17.6%	57.2%	0.132	92.4%	0.667	98.8%	68.7%
5k epoch 1	15.2%	52.4%	0.141	90.6%	0.718	99.4%	70.8%
5k epoch 2	21.1%	60.7%	0.115	94.9%	0.684	98.8%	71.1%
5k epoch 3	20.5%	61.7%	0.120	94.5%	0.687	99.4%	72.4%

The best model (5k, epoch 2) improves over baseline by +70% on exact match, 2.5x on click precision, and 4.1x on scroll direction. But even the weakest fine-tuned model (1k epoch 1) already cuts click error by more than half and triples scroll direction accuracy. Fine-tuning provides a consistent, dramatic improvement — the question is how much, not whether.

Exact Match Accuracy by Model

But the most revealing insight is in the action type distribution. Baseline GPT-4o predicts "click" for 67% of actions — even when the user scrolled or left the page. It barely produces scroll or terminate predictions. The fine-tuned model learns the real distribution:

Action type	Ground truth	Baseline predicted	Fine-tuned predicted
Click	41.2%	67.2%	33.4%
Scroll	19.3%	4.2%	27.6%
Input	15.4%	17.7%	19.1%
Terminate	24.1%	10.9%	19.9%

Action Type Distribution: Ground Truth vs. Predicted (%)

The fine-tuned model's distribution is far closer to reality. It understands that real users scroll, leave, and type — not just click.

Per-Action-Type Breakdown

Action Type	Baseline F1	Fine-tuned F1	Improvement
Click	53.5%	69.8%	+30%
Scroll	23.4%	56.9%	+143%
Input	55.9%	72.5%	+30%
Terminate	19.8%	33.4%	+69%

F1 Score by Action Type: Baseline vs. Fine-tuned

The confusion matrices tell the story. Baseline GPT-4o's dominant failure mode: it predicts "click" when the user actually scrolled (194 out of 245 scroll actions misclassified as clicks) or terminated (236 out of 306 terminates misclassified as clicks). The fine-tuned model corrects both:

Scroll recall: 14.3% → 81.0%. The baseline barely understands scrolling — it predicts scroll for only 54 out of 1,272 actions. The fine-tuned model predicts scroll 447 times, correctly capturing 196 of the 242 actual scrolls.
Terminate recall: 14.4% → 23.0%. Still the hardest action type to predict (knowing when a user will leave is inherently difficult), but the model learns meaningful exit signals — pricing page views, time-on-page thresholds, completed task patterns.
Click precision: 43.2% → 77.8%. By learning to produce scrolls and terminates when appropriate, the model stops over-predicting clicks, and the clicks it does predict are far more accurate.

The biggest gain is scroll prediction — F1 jumps from 23% to 57%, a 143% improvement. Scrolling is invisible to text-only models but obvious in screenshots: the model learns that when a user sees a long page with content below the fold, they scroll.

Scaling: More Data = Better Models

Training Size	Epoch 1	Epoch 2	Epoch 3
1k	13.9%	18.7%	15.6%
2k	13.5%	16.0%	17.6%
5k	15.2%	21.1%	20.5%

Exact Match by Training Size and Epoch

Three patterns emerge:

More data consistently helps. At the best epoch, performance increases monotonically: 18.7% → 17.6% → 21.1%. The 5k model is the clear winner.
Epoch 2 is the sweet spot. For both 1k and 5k, epoch 2 outperforms epoch 3 — a clear overfitting signal. The model memorizes training examples rather than generalizing. For 2k, epoch 3 is slightly better (17.6% vs 16.0%), suggesting smaller datasets benefit from more passes.
Diminishing returns not yet reached. The jump from 2k to 5k (17.6% → 21.1%) is larger than from 1k to 2k. We have 11.5k training examples available — scaling further will likely push accuracy higher.

What Works Best: Segmented Analysis

Not all predictions are equally hard. Segmenting by context reveals where the model excels and where it struggles:

Session position matters — a lot.

Position	Steps	Baseline	Fine-tuned (5k)
Early (1–5)	683	10.3%	16.4%
Mid (6–15)	555	14.5%	25.4%
Late (16+)	25	23.1%	56.0%

Exact Match by Session Position: Baseline vs. Fine-tuned (5k)

Early actions are the hardest — the model has little context about this user's intent. But as the session progresses and the model accumulates action history + screenshots, it gets dramatically better. At step 16+, the fine-tuned model achieves 56% exact match — it understands where this user has been and can predict where they're going.

Labels provide critical signal.

Condition	Baseline	Fine-tuned (5k)
Actions with visible text labels	17.0%	29.2%
Actions without labels	0.0%	0.0%

Exact Match by Label Availability: Baseline vs. Fine-tuned (5k)

When the model can match the action to a visible text label ("Sign Up", "Search papers"), it performs well. Without a label (e.g., clicking a blank area or an icon without text), both baseline and fine-tuned models struggle. This suggests that label-rich UI designs are inherently more predictable — a useful insight for UX evaluation.

Timing accuracy: 54.2% overall for the fine-tuned model (vs. 21.2% baseline). The model learns that users pause before high-commitment actions and move quickly through familiar interfaces — capturing the human pace that makes a digital twin feel realistic.

What the Model Learns

The numbers tell one story. Concrete predictions tell another.

Example 1: Learned hesitation

A user lands on Explainpaper's landing page and reads the hero section. The fine-tuned model predicts a delayed click on "Try it" — matching the real user's 7-second pause before committing. Baseline GPT-4o predicted a quick click, as if the user already knew what they wanted.

Example 2: Scroll-then-click pattern

After reading a paragraph, the real user scrolled down 1.5 viewport heights to see more features, then clicked "Sign Up." Baseline GPT-4o predicted a click on the navigation bar (skipping the scroll entirely). The fine-tuned model correctly predicted {"type":"scroll", "timing":"normal", "direction":"down", "distance":1.3} — not pixel-perfect, but the right behavior.

Example 3: Failure case

The model predicted a click on the "Upload PDF" button when the user actually clicked "Sign Up" (adjacent elements in the top-right corner). The click coordinates were close (0.05 distance) but the label was wrong. This reveals a known limitation: when multiple interactive elements are spatially close, the model captures the region of interest but sometimes picks the wrong target.

The key insight: the model learns distributions of human behavior, not deterministic playbooks. Given the same page state, it might predict "scroll down" 60% of the time and "click Sign Up" 40% of the time — reflecting the real split in user behavior. This is exactly what makes it useful for simulation: run 1,000 sessions and get a realistic distribution of outcomes, not 1,000 identical bot runs.

What This Means for Replica's Digital Twins

The core promise

Replica's digital twins are only as useful as they are accurate. This research shows that fine-tuning produces a step-change in behavioral fidelity — every metric improves substantially, and the model learns behavioral patterns that generic foundation models simply cannot capture.

A/B test simulation

Run experiments against fine-tuned digital twins that click, scroll, hesitate, and drop off like real users. Test conversion impact before shipping to production. When the model predicts that 40% of users will bounce from a new pricing page layout — and the predicted click heatmap matches real user patterns — you can trust the simulation.

Prelaunch validation

Simulate thousands of realistic user sessions on a new feature or redesign. Identify UX issues, drop-off points, and navigation confusion before any real user sees it. The 2.5x improvement in click precision means the model targets the right UI elements, and the 4.1x improvement in scroll direction means it navigates pages the way real users do.

Segment-specific models

Fine-tune on subsets of users sharing attributes — geography, device type, income bracket, new vs. returning — and simulate how specific audiences will respond. The pipeline is standardized: map session_id → user_id → user_attributes, filter training data by segment, fine-tune separate models. Mobile users scroll more and click less. First-time visitors explore more broadly. Enterprise users go straight to pricing. Each segment gets a model that captures its behavioral signature.

Your sessions, your twins

Companies already collect session recordings via tools like FullStory, Hotjar, and LogRocket. Replica ingests these recordings and maps them to user attributes to build attribute-conditioned digital twins from your existing data — no new instrumentation required.

What's Next

We validated the core hypothesis: finetuning on real sessions dramatically improves behavioral fidelity. Several directions follow:

Ablation study: A 2x2 matrix testing four training variants — Actions only (A), Actions + Rationale (AR), Actions + Screenshots (AS), and Actions + Screenshots + Rationale (ASR) — to isolate the contribution of visual context vs. chain-of-thought reasoning to model performance.
Segment-conditioned models: Fine-tune on user subgroups (mobile vs. desktop, new vs. returning, geographic cohorts) to capture behavioral differences across segments and validate that segment-specific models outperform one-size-fits-all.
Global attribute models: Aggregate anonymized behavioral data across companies to build reusable attribute-conditioned models — e.g., "US women aged 20-29 on desktop" as a pretrained behavioral prior for companies lacking sufficient internal data.
Session-level evaluation: Beyond per-action metrics, do generated sessions feel realistic end-to-end? We're developing session-level metrics that evaluate behavioral realism across an entire simulated session — funnel completion rates, time-on-page distributions, navigation graph similarity.
DPO/RLHF: Reinforcement learning from session-level feedback to optimize for holistic realism rather than per-action accuracy.
Cross-company benchmarking: Attribute-level behavioral analysis at scale across industries, enabling companies to understand how their users compare to broader behavioral baselines.

The signal is clear: more data helps, visual context helps, and real sessions contain learnable structure that foundation models miss. The path to accurate digital twins runs through fine-tuning — and we're just getting started.

See finetuning applied to your data

We’ll backtest three of your past A/B tests before the call

Book a demo Read the methodology →