The issue of data quality and accuracy in web analytics is something that most web analysts have no option but to learn and internalise very quickly, especially when people start asking why numbers don’t match. However, it is often easy for us to forget that our clients, business users and marketing teams don’t live and breath this data as we do. This post is therefore a reminder of the essential (by no means definitive) facts about why web analytics data can’t necessarily be taken as fact.
Why are the numbers different?
Most people first recognise a problem with web analytics data because they are trying to reconcile absolute numbers between two different systems, for example when comparing visits in Google Analytics with clicks as reported by Atlas (or some other ad tracking tool). The following are the key reasons why these numbers don’t match:
- The terminology used to calculate metrics usually differs slightly. For example, unique visitors must always be unique visitors within [a certain time frame]. Different vendors may use different time frames. Neither is right or wrong; they are just different. This same principle can also apply to lots of other metrics, and sometimes on a much more subtle level.
- Whilst advances are constantly being made, there are currently no agreed standards to these definitions. Analytics vendors often try to name-drop ABCe standards (at least in the UK), but these are generally considered to be outdated and were created for reporting on visits that derive from banner advertising and search; not for web analysis. Here is a good synopsis of the current state of standards.
- Tracking methodologies, such as cookies, packet sniffers and IP addresses all collect data in different ways and all have pros and cons to the way in which they do this. See example below for further info on this one.
- The Internet is composed of a huge array of different technologies, which are all constantly evolving and changing. These technologies play a big part in the accuracy of data collection.
- New browser versions invariably feature new types of technology that allow increasingly savvy web users to hide their on-line behaviour, or even block this behaviour by default.
- Robots and spiders crawl Internet pages in order to e.g. index what is in them for search engines. Data quality in web analytics is a race to keep up with these creatures!
Cookies and Unique Visitors – An Example
The issue of cookies is generally the biggest area of confusion. A client of mine was recently comparing Google Analytics to their incumbent provider, Sophus3. They noticed large differences in unique visitors and wanted to understand why. Whilst this issue is in some respect the product of all the points raised above, the main cause is the type of cookie used:
Sophus3 then use IP address to track visitors who have blocked cookies. However, most broadband providers use dynamic IP addresses, which change periodically. In some cases, the IP address could change every time the person switches on their computer. Therefore, Sophus3 will register individual people as multiple visitors, and overall numbers will therefore be inflated.
The following chart illustrates this issue in a more visual way (numbers are rough estimates to illustrate a point, and are not meant to be accurate):
Whilst 1st party cookies are generally considered in the industry to be best practice, in truth neither is perfect. For more information, here is a more detailed overview of how cookies affect web analytics data.
Get over it!
The issue of data accuracy can cripple companies and cause vast amounts of wasted time. In truth there is no solution, it is much better to:
- Understand the limitations in as much detail as possible and ensure that all recipients of web reporting and analysis are familiar with what the numbers do and don’t tell them.
- Focus on trends and segments, and not on absolute numbers. This is easy to do when the focus is on analysis and not pure reporting; insight never comes from pure numbers.
- Where numbers such as unique visitors are required for decision making, confidence levels should be used to make reasonable judgements about those numbers.
- If we set a consistent base-line of data at the most accurate that we can get it, then we can use this data to make accurate trend assumptions and draw conclusions about time-series analyses.