Sampled vs. Unsampled Data: Does it Matter?
This post is inspired by something that's become a bit of a pasttime for me these days: diving into unsampled data working with our GA Premium clients. It's almost addicting! After working with unsampled data more and more, working with sampled data sets are just not as much fun because I know what's likely missing.
If you've seen the box pictured below in Google Analytics, then you've experienced data sampling in GA. But, what is it? And, does it matter? In this post I'll work to demystify Google Analytics data sampling.
Background on Sampling
I've worked with Google Analytics since before it was Google Analytics (remember Urchin OnDemand, anyone?). It's always been an impressive tool for collecting and analyzing web data. One of the aspects I appreciate most is how fast it runs. I've heard others say (and caught myself saying it too) that Google Analytics is "slow." That's a relative term. I remember using WebTrends when I would define a report with the equivalent of a simple GA Advanced Segment applied over a few months of time and literally have to wait hours for the report to be ready. That's unheard of, literally, in Google Analytics.
How Google Analytic's Sampling Works
All that said, speed comes with its consequence. For Google Analytics, that consequence is experienced as "data sampling." In order for GA to serve up an amazing report analyzing millions of data points in less time than McDonalds can make a cheeseburger, it speeds things up with a technique that uses only a portion of all available data, applies some math, and shows your data.
More specifically, data sampling in Google Analytics activates when the total data set against which you're reporting exceeds a threshold – the default is 250,000 sessions, but it can be adjusted down to 1,000 or up to 500,000 using the "sampling selector" tool.
When sampling runs, the system determines the total number of visits in the date range you're using and calculates the percentage that the sample rate will need to be based on, the sampling setting (250,000 visits, 500,000 visits, etc. thus, for a 500,000 sample rate on a date range with 1 million visits it would calculate a 50% sampling rate). It then uses this sampling rate when selecting data from across the date range. So, if your sample rate is 50%, and Monday had 10,000 visits, it would randomly select 5,000 of those visits. If Tuesday had just 1,000 visits, it would select 500 of those visits for the sample. After the "sample" is selected additional filtering is applied, such as report filters, drill-downs, or advanced segments.
Where Sampling Hurts
In cases where your report analyzes a thin slice of all data, let's say a report showing the landing pages for a certain keyword, you could be dealing with very thin data slices before sampling applies. Let's say your keyword is your top term at 10% of all organic search. If all organic search composes 30% of all your site traffic, your data in focus is just 3% of all site traffic. Add another dimension of Landing Page and by your tenth Landing Page, you could be dealing with just 0.3% of data.
When data you are analyzing is based on these thin slices and sampling is applied, the actual data can be just hundreds. Apply sampling and those numbers can drop to dozens, even handfuls. When ratios and metrics are based on a handful of visits and then re-inflated by sampling rates, you can end up with wild variations from reality.
For example, let's say a thin slice of data is selected through sampling – 10 visits based on a 10% sample rate. 10 visits isn't likely to lend itself to a statistically significant result. If the sample selects a high rate or low rate of converting visits, your conversion rate is going to be skewed, and then re-adjusted by the sample rate. The result? You can end up with numbers that are above or below, significantly.
Here are some examples of sampled data varying based on the sample rate. Notice how the data changes – the only thing changing is the sample rate.
Real World Example
Here is a real-world example of sampled vs. unsampled data. The first image shows sampled data. The reported "returning visitor" segment converted at 0%, while the "new visitor" segment converted at 2.86%.
The unsampled data shows that, in reality, returning visitors converted at 0.726% and new visitors converted at 1.066%.
Comparing these two, the difference of sampled and unsampled is stark!
Avoiding Sampled Data
If you are analyzing data that is heavily sampled and want to break free, you have a couple options:
- Reduce your date ranges to get below 500,000 visits. The smallest range you can get to is one. If you have more than 500,000 visits per day, then you will always get sampled data. Tools like Analytics Canvas make exporting daily data and combining it together to reduce sampling impact easier to do.
- Go Premium. Yep – it costs, but what's the value of accurate data? What's the value of the decisions that need to be made? If you have the kind of data scale that runs into sampling issues, then it's likely worth investing in. Using good data has a much higher value, and that's worth the cost of accurately analyzing it.