Overview

There are many methods of identifying users employed by web analytics solutions to-day. From IP/User Agent Heuristics to Requiring Registration and Authentication, each has its own set of issues ranging from being completely imprecise to very personally identifying the user. So it is no wonder that most of the analytics community has gravitated to something in the middle, the browser cookie.

Cookies, with their little pieces of text, set by and returned to the web server, has be-come the best means for discrete, secure and fairly anonymous user tracking. So when Jupiter Research released a report that as many as 39% of internet users claim to delete their cookies monthly in the spring of 2005, it caused quite a stir in the industry. To an outsider looking in, this reaction looked a lot like the 5 stages of grief.

First there was a number of counter reports aimed at denying Jupiter’s results. Then there was anger which manifested itself in attacks on Jupiter’s research practices. Bar-gaining came in the discussion of site specific variability. Depression ensued with the realization that there weren’t a lot of options evolving for dealing with the issue. Finally, Acceptance became apparent with new reports that confirmed Jupiter’s results and a general dying down of the discussion (except for when it resurfaces every so often in the Yahoo Web Analytics Group).

The use of the browser cache described in this paper is not aimed at completely solving the issue. Nor is it aimed at replacing the use of cookies. This paper is simply the discussion of one approach for identifying, measuring and trending what users do to their cookies. It is an attempt at improving accuracy and also presents pieces of methodology that could be applied to other forms of multiple ID user identification to further expand knowledge of user behavior.

The Problem

As with any Web Analytics effort, the first, best place to start is by defining the problem or business question that needs to be answered. This definition is best if it is a concise and specifically laid out in a couple sentences. There is room for a little open-endedness and the question may be refined along the way. But keep in mind that it is very difficult to reach a meaningful destination without some notion of the journey.

For this particular exercise the question is defined as follows:

We know a percentage of Users are defeating cookie based identification though periodically deleting them or blocking them in the first place. Jupiter’s report (and others since then) have told us as much. What can be done to identify users more reliably? If there isn’t any way to increase reliability, how might the deletion and blocking be identified, measured and quantified through analysis.

As you might guess, there are actually several possible answers to the problem above. It should be reiterated that from here on out that this paper is only one approach, a technical one at that. There are certainly a number of other ways to enhance or replace this method with Behavioral, Mathematical or other approaches.

The Method

By default (without additional plugins or extensions), web browsers store server specific information to the local disk in two ways, Cookies and the Cache. Both methods, and the time-to-live of the content stored, are controlled by response headers passed from the web server. Cookies are meant to be used for securely maintaining state information between the client and the server. The Cache however is simply meant as a means to optimize performance by reducing the number of times the client has to request the same information when inserted many times within a page or across many pages.

Because the intended use of the cache is relatively insecure it is not a good place to store a lot of state related data. It is however a good place to implement a simple combined cookie and cache system of duplicate IDs. In this system both client side and server side programing are used to reset the each other when one is missing, customize cache headers to make the tag live as long as possible and establish parameters for instantiating tracking requests to measure when the deletion of either has occurred.

The system works as a circle. When generating the cache the server looks for an in-bound cookie to use as the ID in the javascript. If one doesn’t exist then the server creates a new one. When executing the javascript, the client looks for the cookie and com-pares to the cache ID. If the cookie doesn’t exist the javascript sets the cookie and initiates a tracking request to record the event. If the cookie is present but has a different ID value the script reports that an anomaly has occurred.

Through the process a few other variables are used in both the cookie and the cache to record additional information such as the independent ages of the cookie and the cache, the number of times the client side script has run, and the generation of flags for key events such as the creation of both rather than the regeneration of one of the IDs.

As one additional suggestion for initial implementation. It is always a good idea to start off by implementing the CUID to Cookie ID system off to the side, separate from the analytics solution’s method of setting unique IDs. There are a couple good reasons for this. First, nothing ever launches as planned and it is better to not risk breaking your current system of identification. Second, your Visitor numbers should change when using the new id system. To explain the change relative to historical data it is best to run the two systems side by side for a while than to directly replace the existing at the start.

The Data

Once the programs are written and the scripts implemented there should be a wealth of data flowing in from the the new identification method. The first inclination may be to take the raw numbers from both the old and the new, calculate the percentage differ-ence and call that the cumulative impact on Visitor numbers of cookie blocking and deletion. Be wary of this approach is. The truth is the browser’s cache as an id mechanism is just as defeat-able through browser settings and user choice as cookies are.

The key to really understanding the impact is in the behaviors. Look for the key segments of users by frequency of deletion events and ages of the cache versus the cookie. Examine other behavioral similarities in how they interact with the site. Combine these segments with other captured user characteristics such technographic, geographic, demographic and psychographic data. Compare and trend the size and characteristics of these segments against each other as well as the whole.

For other trend-able indicators of change, be aware that cookie and the cache both are more likely to be accurate over shorter time segments. If you do use the percentage difference between multiple ID visitor numbers, do so over a shorter time period such as a day and trend that daily number over multiple days. Other indicators of change you can use to separate deletion from blocking are the daily rate of deletions per visitor, the daily percentage of visitors that deleted a cookie, and the daily percentage of visitors that didn’t allow the page script to set the cookie.

Extending the Method

The CUID is only one way of defining multiple IDs that can be compared. For instance, a site that implements even partial registration and authentication to participate at deeper levels can be analyzed in a similar fashion and add to the overall understanding. Be aware though that looking at a select segment of users that choose to be more en-gaged and do something such as registering is really more descriptive of that segment than a sample of the total population.

Another way to apply the cache ID is to use it to identify the affect of 3rd party cookie blocking. Because the browsers don’t yet govern the cache via P3P, privacy settings and domain level security controls, it is an excellent method for identifying this behavior. Be selective and careful in the use of this tool to avoid violating your own privacy policies.

5 Responses to “Measuring Cookie Deletion and blocking with the browser’s cache”


  1. 1 Ian Thomas Apr 18th, 2007 at 3:46 pm

    Ian,

    This is a great post, though as I read it I got that sinking feeling you get when you think you have a great topic for a post and realize that someone else has already written it.

    Any thoughts on how you would go about implementing privacy policy and opt-out control for something like this? It’s relatively easy to delete a cookie in JavaScript; rather more tricky to do so in a cache file.

    Ian

  2. 2 Fulton Apr 24th, 2007 at 4:38 am

    Try running this on a lab box and then measure the effects on the browser cache and cookie deletion:

    http://www.f-secure.com/home_user/products_a-z/fsis2007.html

    Free “download a trial” towards the bottom on right hand side. I am certain you will get a variation on the numbers.

    /Fulton

  3. 3 Todd Apr 25th, 2007 at 10:12 am

    Thank you for the great article. If anyone is interested, I also found a demo of this in action at:

    http://www.mukund.org/files/archive/2006/09/14/tracking-using-cache.html

  1. 1 As the cookie crumbles, how might you measure the pieces at visioactive Pingback on Apr 16th, 2007 at 2:47 pm
  2. 2 Web Analytics Demystified » Blog Archive » Ian Houston publishes very interesting cookie deletion data of his own Pingback on Apr 23rd, 2007 at 11:18 am

Leave a Reply




Add to Technorati Favorites
View blog top tags