Essential guide to data accuracy in web analytics
The issue of data quality and accuracy in web analytics is something that most web analysts have no option but to learn and internalise very quickly, especially when people start asking why numbers don’t match. However, it is often easy for us to forget that our clients, business users and marketing teams don’t live and breath this data as we do. This post is therefore a reminder of the essential (by no means definitive) facts about why web analytics data can’t necessarily be taken as fact.
Why are the numbers different?
Most people first recognise a problem with web analytics data because they are trying to reconcile absolute numbers between two different systems, for example when comparing visits in Google Analytics with clicks as reported by Atlas (or some other ad tracking tool). The following are the key reasons why these numbers don’t match:
- The terminology used to calculate metrics usually differs slightly. For example, unique visitors must always be unique visitors within [a certain time frame]. Different vendors may use different time frames. Neither is right or wrong; they are just different. This same principle can also apply to lots of other metrics, and sometimes on a much more subtle level.
- Whilst advances are constantly being made, there are currently no agreed standards to these definitions. Analytics vendors often try to name-drop ABCe standards (at least in the UK), but these are generally considered to be outdated and were created for reporting on visits that derive from banner advertising and search; not for web analysis. Here is a good synopsis of the current state of standards.
- Tracking methodologies, such as cookies, packet sniffers and IP addresses all collect data in different ways and all have pros and cons to the way in which they do this. See example below for further info on this one.
- The Internet is composed of a huge array of different technologies, which are all constantly evolving and changing. These technologies play a big part in the accuracy of data collection.
- New browser versions invariably feature new types of technology that allow increasingly savvy web users to hide their on-line behaviour, or even block this behaviour by default.
- Robots and spiders crawl Internet pages in order to e.g. index what is in them for search engines. Data quality in web analytics is a race to keep up with these creatures!
Cookies and Unique Visitors – An Example
The issue of cookies is generally the biggest area of confusion. A client of mine was recently comparing Google Analytics to their incumbent provider, Sophus3. They noticed large differences in unique visitors and wanted to understand why. Whilst this issue is in some respect the product of all the points raised above, the main cause is the type of cookie used:
With Google Analytics, visitors are tracked using 1st party cookies. Estimates suggest that around 1% of users block these cookies and a further 4% block JavaScript. GA is therefore unable to track these users, so real visitors may be under-counted by about 5%.
Sophus3, on the other hand, uses 3rd party cookies. Many browsers block these by default, so estimates suggest that around 65% of traffic is lost due to the combination of this and JavaScript blocking.
Sophus3 then use IP address to track visitors who have blocked cookies. However, most broadband providers use dynamic IP addresses, which change periodically. In some cases, the IP address could change every time the person switches on their computer. Therefore, Sophus3 will register individual people as multiple visitors, and overall numbers will therefore be inflated.
The following chart illustrates this issue in a more visual way (numbers are rough estimates to illustrate a point, and are not meant to be accurate):
How cookies can affect data accuracy in web analytics
Whilst 1st party cookies are generally considered in the industry to be best practice, in truth neither is perfect. For more information, here is a more detailed overview of how cookies affect web analytics data.
Get over it!
The issue of data accuracy can cripple companies and cause vast amounts of wasted time. In truth there is no solution, it is much better to:
- Understand the limitations in as much detail as possible and ensure that all recipients of web reporting and analysis are familiar with what the numbers do and don’t tell them.
- Focus on trends and segments, and not on absolute numbers. This is easy to do when the focus is on analysis and not pure reporting; insight never comes from pure numbers.
- Where numbers such as unique visitors are required for decision making, confidence levels should be used to make reasonable judgements about those numbers.
- If we set a consistent base-line of data at the most accurate that we can get it, then we can use this data to make accurate trend assumptions and draw conclusions about time-series analyses.
This is very interesting read, I was talking with someone just last week about the inaccuracy I see in my data, this is always more apparent on new projects when traffic is low as you think to yourself ‘I know I found my site through yahoo yesterday but it’s not listed in my referrers list’.
I agree completely that this is about trends, monitoring all changes you make on a site against the performance and behaviour of your traffic, experimenting with different landing pages and watching your bounce rates change.
Very good article, thanks!
Rob
Try to get over this one
Total visits
A) GA: 7,609,334
B) Coremetrics: 7,667,586
A/B = 99.24%
Direct visits
A) GA: 5,420,192
B) Coremetrics: 6,984,755
A/B = 77.60%
Search Engines visits
A) GA: 1,527,967
B) Coremetrics: 478,278
A/B = 319.47%
Referring sites visits
A) GA: 660830
B) Coremetrics: 189,301
A/B = 349.09%
3 more times the visit reported in GA for search engines and referring sites
BTW, both tools are setup using 1st party cookie and are implemented using the same content in the CMS. So each tagged pages are tagged with both tools. And I installed Coremetrics “first”.
Sebastien,
I have seen this quite a bit with GA and normally has to do with 3rd party shopping carts or cross domain tracking. Since I do not know the domain you are referring to I am just taking a guess. If it this you can easily tell if the domain itself is the #1 referring domain under Referring Sites in GA. This could also be impacting the Direct as well.
Best of luck!
This is an often overlooked aspect of analytics but something very close to my heart. Your data is actually only as good as the underlying accuracy of unique visitor identification. It’s like a building, if the foundations are weak (ie you can’t identify people), the whole structure above becomes weak (ie your business is at risk). So understanding these limitations is crucial.
Now that everything is starting to go mobile this issue gets even more complex. The key methods of visitor identification become flawed on mobile, whether on a standard phone, smartphone or a PC connecting via mobile broadband.
To start with cookies are less reliable – as well user deletion some handsets automatically delete cookies when you close the browser or restart the phone. Also some mobile operators block cookies – this often depends on the connection gateway used or the use of transcoders.
Secondly, the IP address is rarely the physical device but rather the operators gateway machine the device is connected through, so all Sprint customers and Sprint MVNO customers (like Boost) in the US come through the same range of IP addresses.
Lastly consumers now switch connections between operator networks and WiFi (see http://bango.com/wifi) which affects the ability to track, most systems will allocate a new identity in this case.
Fortunately, unlike the PC, mobile devices come with operator identity which can be used safely and anonymously to persistently track consumers. For example, Bango Analytics (http://bango.com/analytics) uses this technique through relationships with operators around the world. It means Bango can offer much higher visitor identification accuracy than other solutions. It also works on PC sites – it’s worth trying the free trial (http://bango.com/signup) and comparing.
Ah yes…this issue is one that we must all difference. It would be nice not to have to go on these wild goose chases to figure out why the data says what it does. However, how does one go about verifying the data without doing this?
Interesting the difference between ad-tracking and site tracking. The media agency says that there were 100K UVs… and the creative agency says there were only 65K of which 25K left immediately.
This occurs regularly for display, affiliates and search. The drop-off is mistakes, people not arriving because page loading at either the media tracking or landing page was so slow, tags being too far down the landing page that users had clicked on before it was triggered etc… and all the reasons you suggest about methodology.
I’ve seen major differences between reporting of channels when SEO, PPC, display, affiliates and direct traffic feature in the multiple visits required to buy a product (a holiday, for example). This can have a dramatic effect of how you optimize spend on each of these marketing approaches.
So one of the biggest outcomes for a planning team is that he whole business says they don’t trust the data you are using… so regularly communicating that all is well is very important.
We’re a fairly IT led company when it comes to reporting, and experiencing difficulty with management buy-in of GA goal data (we’ve been running GA for 2 years now).
When someone completes a goal, e.g. website registration, what % tolerance would you say is acceptable comparing to our CRM database?
How accurate should GA be? Is GA any more/less accurate than other tools?
I’d be really interested to learn about other people’s experiences here!
Justin
Hi Justin,
Thanks for reading. All being well your GA form submissions should be reasonably similar to what you see in the back-end database, with the main exception being duplication (i.e. people might submit more than one form, but if you are deduping the database before counting then you would have lower numbers). If the figures are completely out then something is wrong I would say.
GA is as accurate as any other tool. Commercial vendors would tell you different, but any improvement would be marginal.
Hi Jonny,
Thanks for your quick response. We’ve set a GA goal on the thankyou page for people who have registered with the website.
Last month this showed a 22% difference to our CRM database.
Surely GA should perform better than 22%?
Using the JavaScript console inside Chrome, it shows that _utm.gif is always taking much longer to load than all the other resources for the rest of the pages (sometimes over a second). Could it be latency? Could the user be moving beyond our thankyou/goal page before GA has a chance to register a page impression?
Justin
That sounds plausable – do you use the WASP firefox extension to check your tags are firing? The basic version is free; I would recommend installing it and running a load of tests to see how quickly the tag fires.
Good luck
Thank you for this discussion. I’m a neophyte in this arena and am just trying to figure out why GA and Sitemeter differ so much but this article answered that question for me. Either metric is only as accurate as the data it’s seeing or whether the data can be seen. Between the two, I’m probably somewhere in the middle.
Thank you again. I’ll put less emphasis on the UV’s and concentrate on conversion.
Rob