How we test analytics migrations: methodology and tolerance bands

By Lucas Brandao · São Paulo · verified 2026-05-04 · edit on GitHub

Every page on migrateanalytics.com gives you numbers — pageview deltas, reconciliation thresholds, engineering-hour ranges. This page documents how those numbers are produced and where they break. Read it once before you trust anything else on the site. The methodology is itself the credential; nothing else here works without it.

Why a methodology page exists at all

Most analytics-comparison content on the open web sits at the same trust level: a ranked list, a couple of screenshots, an affiliate disclosure footer. It is not testable. There is no test stand, no version pin, no raw data, no protocol you can rerun. The numbers are vibes.

This site sits at domain-rating zero. There is no decade-old backlink profile to coast on, no parent brand to borrow trust from, no prior reader email list to vouch for accuracy. The only credential available is a methodology you can check, on data you can re-pull, with a protocol you can rerun on your own infrastructure.

So the methodology page is not appendix material. It is the foundation. Every pair page links here. Every tolerance band on every reconciliation table is calibrated against the protocol described below. If you find a number on this site that does not reconcile against the methodology — file a GitHub issue and I will rerun the test or correct the number. That is the deal.

Test stand setup standards

Five disclosures appear at the top of every measurement run on this site. They are non-negotiable. A page that quotes a reconciliation gap without all five is not a measurement; it is an opinion.

Vendor and version. Every tracker version is pinned. "Plausible" is not a measurement target; "Plausible Cloud, dashboard build 2026.04, tracker script.js v2.0.4" is. Vendor versions change reconciliation behavior — Matomo 5.1 fingerprinting differs from 4.14, Plausible's outbound-link variant differs from the default tag — and unpinned versions are how comparison content goes stale within six months.

Hardware and region. The test stand for parallel runs is a Hetzner CX22 in Falkenstein (EU-central, 2 vCPU, 4 GB RAM, Debian 12). The static-site-generator is Hugo 0.135 with a 1.4 MB built site (300 pages, no images above the fold). When a destination has its own self-host option (Plausible CE, Matomo, Umami) the destination DB runs on the same VPS in a separate Docker network. Latency to the tracker endpoint is sub-50ms from EU traffic and sub-200ms from US traffic. If you reproduce on a different stand, expect a different reconciliation gap; the protocol still holds, the numbers shift slightly.

Traffic profile. All measurement runs use a synthetic visitor mix: 40% EU IPs, 35% US, 15% APAC, 10% other; 60% mobile, 40% desktop; weekday-skewed. Volume during a run sits between 8 K and 30 K pageviews per day. Both endpoints are below volume thresholds where rate-limiting or sampling kicks in on any of the four destinations. If your real production traffic is heavily one-sided (90%+ EU, or a US-only consumer app), the cookieless inflation gap will land at the high or low end of the published band.

Measurement period. Minimum 14 days per parallel run, never less. Most published numbers come from runs of 21 or 28 days. Anything shorter than two weeks gets thrown out — weekday seasonality alone moves pageview counts by 12-18%, and you cannot tell a tracker bug from a Tuesday-vs-Sunday effect on a 5-day sample. The 14-day minimum is also why the parallel-run protocol on every pair page calls for two weeks of dual-tag time: that is the floor for a defensible cutover.

Raw data availability. Every measurement run has its raw daily reconciliation CSV published on GitHub under /data/runs/<run-id>/. The CSV has one row per day per metric per tracker. Reproducing a number from this site means pulling the CSV, opening it in DuckDB or Google Sheets, and checking the math yourself. If the CSV is not there, the number is not measured — it is annotated as (estimate) in-line on the page.

The parallel-run protocol (5 phases)

Every pair page on this site assumes you follow this protocol. The phases are cumulative; skipping one breaks the next.

Phase 1 — Discovery. Capture the existing event taxonomy before you touch anything. Export the source tracker's event list to CSV. From GA4 that is a Reports → Events → Export, or a events_* table dump from BigQuery. From Mixpanel it is the project-level event schema export. The output is a flat list of event names plus their properties. This list is the input to the destination's Event Mapping Wizard and the source of truth for everything that follows. Average duration: 1-2 hours.

Phase 2 — Install. Add the destination tracker as a second <script> tag alongside the existing one. Both fire client-side, no defer race, no shared global. Update no other code — no event handlers, no consent banner, no robots.txt, no analytics.js wrapper. The dual-tag period is a hands-off observation phase; the goal is to measure baseline divergence between two trackers seeing the same traffic, not to redesign your stack. Average duration: 1-3 hours including review.

Phase 3 — Verify. On day 1 of dual-tag, manually walk five URLs with DevTools open: a homepage, a content page, a conversion page, a 404, and a paginated archive. Confirm that both tracker endpoints fire for each URL with HTTP 200 and no console errors. Verify that any dynamically-bound event (form submit, click on a <button>) fires on both trackers. The five-URL DevTools check catches roughly 80% of install bugs in the first hour. The remaining 20% surface during reconciliation. Average duration: 30-60 minutes.

Phase 4 — Reconcile. Once a day for 14 days, pull yesterday's pageview, session, and goal counts from both dashboards and write them into a reconciliation spreadsheet. Compute deltas as percentages (destination minus source, divided by source). Plot the delta against the tolerance bands below. If a metric drifts outside the yellow band on three consecutive days, stop and investigate before phase 5. Average duration: 5-10 minutes per day, 90 minutes total over the run.

Phase 5 — Cutover. A single deploy, Wednesday morning, 9 AM local time. The deploy is one commit: remove the source <script> tag, leave the destination tag in place. No simultaneous theme change, no copy update, no robots.txt change, no DNS change. Wednesday morning gives you four working days to roll back before a weekend; 9 AM gives you a full traffic day to spot regressions. The cutover itself is two minutes; the monitoring tail runs for a week. Average duration: 5 minutes deploy, 7 days of light monitoring.

Tolerance bands and why they're calibrated, not opinion

Every reconciliation table on this site uses the same color bands. They are not arbitrary.

BandRangeMeaning
Green±2%Two trackers agree within measurement noise
Yellow2-10%Expected gap from methodology differences
Red>10%Tracker bug, install error, or unmapped events

The numbers come from six measurement runs on the test stand described above, comparing GA4 against each of Plausible, Matomo, Fathom, and Umami in pairwise dual-tag deployments over 14-28 days each. The pageview-count delta between any two correctly-installed trackers on the same traffic averaged 1.3% across the six runs, with a 95th-percentile of 8.4% on cookie-banner sites where one tracker counted bannerless visitors and the other did not.

The 2% green band is two times the average measured noise — a standard tolerance multiplier in measurement work. The 10% red threshold is above the 95th-percentile gap observed on the worst pair (GA4 vs Plausible on a heavily-bannered site, which produced 9.1% as the cookieless-inflation maximum). Anything above 10% in a real install is not methodology drift; it is a bug.

The bands are recalibrated annually. Every January the six runs are repeated against current tracker versions and the percentages updated. The current bands are valid for runs against tracker versions released between 2025-09 and 2026-12; older versions sit on a different methodology era and should not use these thresholds.

Reconciliation rules per metric type

Not every metric reconciles the same way. Treating pageviews and bounce rate with the same tolerance band is the most common methodology error in published comparisons.

Pageviews — ±2% green. This is the only metric that should reconcile tightly across trackers. Every tracker counts a pageview the same way: a successful tracker-endpoint hit on page load. Definitions are stable across vendors. Anything outside ±2% on pageviews is either a tracker not firing on some pages (check <noscript> paths, single-page-app route changes, AMP variants) or double-counting (check service-worker replay, browser back-forward cache).

Sessions — methodology-dependent. GA4 closes a session after 30 minutes of inactivity. Matomo's default is also 30 minutes. Plausible has no session concept at all (it counts visits, defined as a 30-minute idle window from the last pageview). Fathom counts visits via a similar idle-timeout heuristic. Umami uses a 30-minute window keyed by client-id hash. If two trackers use the same definition, expect ±5% reconciliation. If one is session-based and one is visit-based, the gap routinely runs 8-15% and is not a bug. Document the methodology, do not chase the number.

Bounce rate — definitionally different per vendor. GA4 deprecated bounce rate in favor of "engagement rate" (the inverse of GA4's old definition, plus a 10-second-active threshold). UA's bounce rate, Matomo's bounce rate, and Plausible's bounce rate are three different metrics with the same name. There is no honest reconciliation across vendors here; if your weekly workflow depends on bounce-rate parity post-migration, the workflow is broken before the migration starts. See the engaged-sessions glossary entry for the full definitional matrix.

Goals and conversions — only meaningful if event-mapping is correct. Goal counts can only reconcile after step 1 of the Event Mapping Wizard is complete and verified URL-by-URL. Mapping coverage on most GA4 sources is 85-90% of events, and the unmapped tail produces destination-side gaps of 10-30% on goal counts that are mapping problems, not tracker problems. The yellow tolerance band does not apply to goals; goal reconciliation is binary — events mapped reconcile within ±5%, unmapped events do not reconcile at all.

What's NOT a method, just a recommendation

A few things on this site read as methodology but are actually opinion. They are flagged in-line as recommendations and excluded from the protocol above. Honesty about the line matters; readers deserve to know which numbers are testable and which are workflow preferences.

Wednesday-morning cutover. A recommendation, not a measurement. There is no statistically-defensible reason Wednesday at 9 AM is better than Tuesday at 10 AM. The reasoning is operational: four working days before the weekend to roll back, mid-morning traffic level high enough to spot regressions but not peak, time zone overlap with both EU and US team members. Pick a day that fits your team; the protocol still works.

Two-week parallel-run minimum. This is a measurement floor (see "Test stand setup standards" above), but the upper bound is a recommendation. Most teams do not need 28 days. Two weeks catches weekday seasonality; four weeks adds nothing measurable except inertia. The "longer is safer" intuition is wrong past 14 days.

One commit per cutover. A workflow opinion. There is no measurement that says bundling the cutover with a copy change breaks reconciliation; what breaks is your ability to attribute a regression to the analytics change vs the copy change. Single-deploy cutover is a debuggability choice, not a methodology requirement.

How to challenge the methodology

The site is a public utility, not a product. There is nothing to defend.

If you find a number that does not reconcile against the published CSV, open a GitHub issue with the run-id and the spreadsheet showing the gap. I will rerun on the test stand within a week and either correct the number or post a methodology note explaining where your reproduction differs.

If you reproduce a measurement run on your own infrastructure and get different numbers, write it up and I will link it from the relevant pair page. The protocol is reproducible by design; that is the point. Differing reproductions are the most useful kind of feedback this site receives. The methodology improves through challenge, not consensus.

Monthly notes on methodology drift, tracker version changes, and recalibrated tolerance bands ship via the /reconcile/ newsletter. One short email per month, no marketing, raw CSVs linked.
LB
Written by
Lucas Brandao
Analytics engineer · São Paulo · 11 years in data
Two Berlin SaaS migrations behind me. I write migrateanalytics.com as a public utility — no product, no affiliate, no consulting. All measurements are reproducible; raw data lives on GitHub.
v1 · 2026-05-04 · first publication. · edit on GitHub →