« Older Home
Loading Newer »

Accuracy and User Identification

Jeremiah Owyang posted some great questions yesterday about the accuracy of User measurement and cultural reasons why “User” measurement may not be accurate in certain countries such as China and India. The most interesting part of the post (IMHO) is that both the post and the responses seem to focus on personal perceptions for why User measurement may not be accurate in a specific locale instead of focusing on whether User numbers are accurate at all. The underlying assumption appears to be that since they are not accurate for a particular country they would be accurate for other countries?

Whether or not User measurement is more or less accurate because of cultural differences is interesting but the answer is completely dependent on how you define and identify “Users”. There are issues with all forms of user identification regardless of cultural and socioeconomic differences around the globe.

For instance, take the use of just an identity cookie as the sole form of identification (the most common approach today by far). Cookies do not have a one to one relationship with a “computer” (a single physical machine). They are stored per browser, per user account and per virtual machine (where they are employed). There are several ways they are skewed in both directions from the reality. Take these scenarios for instance:

More “Users” than identity cookies.

  • Many users grouped around one machine viewing the content at the same time (per your example). This may happen more frequently in some cultures but I would you to also consider how many times you may have only seen a site over a coworkers shoulders.
  • Many users using the same computer such as a public machine at a library or an internet cafe and a central family computer.

More id cookies than “Users”.

  • How about a single user who uses 2 both Firefox and IE (or Firefox and Safari) on the same sites. More Cookie than users.
  • Someone with a work computer and a home computer and more than one browser, account, and/or virtual machine on either or both.

Add to the mix above the issue of cookie deletion!

This raises perhaps the perfect example for why it is important as an analyst to understand both the underlying data, how that translates into the metrics, and the goals of the analysis. For all of the issues I have listed above and more, some might complain that this is one of those areas where this is a problem with data accuracy. Though, unless we move to some sort Orwellian scenario where every computer in the world includes a proximity sensor that reads data from implanted chips and transmits identities in every request, there is no 100% solution for identifying users. There are however smarter ways to use what you have in the analysis.

Here are a number of points to consider when working on how to best utilize the user tracking of your implementation:

Manage your Implementation

Don’t just install an analytics package and expect to know what every metric means. Work with your vendor and/or your IT folks to make sure you understand how your data is collected. For cookie based tracking this may even include identifying how well your site follows these best practices:

  • Count accepted cookies not attempted cookies. A number of analytics systems will log an id whether it was returned from the users browser or not. Unless the cookie has been accepted you can’t accurately use the id as measure of a successful dientification.
  • Use First Party Cookies instead of third wherever possible. Whether or not you believe the recent ComScore report on cookie deletion, it is undeniable that todays browsers are becoming increasingly tougher on third party cookies with each release.
  • Set proper P3P header. It is equally undeniable that the most popular browser (IE) has become a more reliant on seeing an acceptable privacy policy from the server to accept any cookie.
  • Don’t over use cookies. This applies to your whole site. Cookie storage in the browsers is finite. They are limited both by site and in total. If you are using too many of them you will create a different form of cookie deletion for yourself and possibly the other site your users are interested in.

Work with What You Know

Beyond understanding the implementation, you also need to know how the tool arrives at the metrics and what they really tell you about your audience. For the standard cookie based implementation this can also be broken down into a set of best practices:

  • Be aware of what you are measuring. A cookie only has a one to one relationship with a specific of a browser and user account on a machine. To call them “Visitors” or “Users” is widely accepted but the reality is they are measuring unique client instances.
  • Focus on the Known instead of the Possible. If a user id shows up in your traffic for multiple sessions across multiple Days, Weeks Months, that is a known return of a unique client instance. You can say for certain that it occured, segment on it and analyze the behavior. If a new cookie id is seen in your traffic you know it is possible that it was the first time a user came to the site but there are several other possibilities. This doesn’t mean that you should ignore the new users metrics but you do need to be aware of this difference to tell a more accurate story with your analysis.
  • Trend on shorter time periods. With the exception of a group of people huddling around the glow of a single computer screen, most methods of inflation or deflation are more likely to occur over time. For this reason a daily count is more accurate than a weekly one which is more accurate than a monthly number. When measuring things like growth trends, consider using rolling averages of shorter time segments.
  • Embrace the Value of Session data. The session is not only a smaller segment of time but there are also methods(such as click stream data) that they can be at least partially validated.

Learn More Where You Can

Along with managing and understanding your User identification, you should also work with your site find possibilities for alternative data that would refine what you know about your Users number.

  • Identify multiple identification situations that will at least give you a better understanding of how your primary method performs within a segment of your audience. Even options such as an optional site registration and login that only a part of your audience may participate in will give you knowledge through the comparison of 2 IDs.
  • Implement alternative forms of identification in other types of web clients such as email and blog readers. Capturing both IDs in the click through events between the other clients and the browser will give you 2 numbers that can help you identify inflation of your primary tracking for a higher value segment of your audience (the ones that are responding to you marketing).

In the end, I believe the golden rule for all of the above is to simply understand what you are analyzing and the answers available from your data AND your analytics tool together. That is the best way to be more accurate in your User measurement and reporting regardless of where on the globe those users are coming from.

-Ian

NOTE: A majority of the content in this post comes my second presentation at Emetrics last month. Gary Angel wrote a review of the presentation here.

Loading..
DiggIt! Del.icio.us Blinklist Yahoo Furl Technorati Simpy Spurl Reddit

Back from traveling - Emetrics wrap up

I’m finally back in the office after enjoying a wonderful 2 weeks in California at Emetrics in San Francisco and the Visul Sciences User conference in Sandiego. Now after spending a few days catching up on all the reading I missed I’m finally getting back to blogging. Sorry for the delay!

There have been a lot great posts on Emetrics in the past couple weeks. There were a number of good presentations and big announcements. I don’t want to duplicate coverage too much this far after the fact but I do want to mention 2 of my favorite presentations and my biggest personal take away from the conference.

Favorite Presentations

My favorite presentation of the conference was Jodi McDermot’s (of InPhonic) discussion on optimizing landing pages and display ads. She gave some very practical examples about how she uses the power of real-time analytics to get the most out of their display advertising when the ads are running instead of just doing performance review after the fact.

Through evaluating and comparing ads as they are running and optimizing accordingly, Jodi has achieved what many are still struggling with, a real dollars and cents ROI for InPhonic’s analytics investment. This is a point that I feel sometimes gets lost in a lot of the analytics discussions. Great analysis and better understanding of the consumer are important but at the end of the day, how much of what you do can be attributed to improving the bottom line.

If you are looking for advice or ideas for how to make a similar impact on your organization I would highly recommend checking out her presentation if you have to opportunity to see her give it again in the future.

My next favorite presentation of the conference was by Joseph Carrabis of NextStage Evolution. His perspective as a cultural anthropologist, while not necessarily focused on web analytics, contained a wealth of insight about how we humans perceive things.

From learning the percentage of communication that is nonverbal to understanding the difference presenting my left cheek can have on an interaction and much more, I went away from his talk with a list of things to think about not only for web analysis and marketing but for life in general.

I highly recommend checking out their site as well as his upcoming book.

Biggest Take Away

As a blogger and a consultant, my biggest take away of the conference was the need more practical examples of how to do web analysis. I know Robbin has been calling out for help in generating more “newbie” content for a while now. Gary Angel also made similar comments on the percentage of practitioner examples. Beyond them, the desire for more down on the ground, get your hands dirty, examples of analysis was definitely a prevailing theme with a number of the practitioners I spoke with. Something I am going to do my best to keep in mind as I go back through my ever growing list of tings to blog about.

Bloggers Round Table Video

If you haven’t checked it out yet. Rene also posted the videos of one of the bloggers tables from the birds of a feather lunch. (I’m the one sitting to the left of Eric Peterson)

-Ian

Loading..
DiggIt! Del.icio.us Blinklist Yahoo Furl Technorati Simpy Spurl Reddit

Joining the team at Visual Sciences

I’ve been working with the great team at Visual Sciences for over 2 years as the Senior Web Analyst and Architect for one of their customers. When the opportunity presented itself to join them and help other companies get the most out of the most powerful and flexible analysis tool I’ve used to date, it wasn’t a difficult decision for me to make.

So it is with great pleasure that I announce joining Visual Sciences a a member of their consulting team. I am very excited about this opportunity and getting to work a lot closer with a truly awesome group of people.

I’d also like to say that Visioactive is not going away in this transition. I will still be blogging here and staying active in the community.

-Ian

Loading..
DiggIt! Del.icio.us Blinklist Yahoo Furl Technorati Simpy Spurl Reddit

Comparing comScore to Collected Data

There has been a lot of great discussion about comScore’s recent press release in the yahoo group and the blogs this past week. Two post that particularly caught my eye were responses from comScore to questions by Eric Peterson about their methodology and the promise they will be releasing more information soon and Marshall Sponder’s post where he looks at what reducing the measured Visitor numbers by a factor of 2.5 could mean to the numbers published by some of the bigger online retail sites.

Checking Their Method

I actually talked to Eric about the questions he sent on to them before hand. I must say they pretty much addressed my major concerns about their data collection although I will look to their upcoming write up before making any final judgments.

Also, there is still another piece of this that is sticking in my mind, they gave the participants privacy tools. Understanding those tools and how they might have contributed to the results is next on my list of things to question for this. comScore’s numbers have always been a bit lower than cookie based numbers. Have they always given their participants privacy tools? I don’t necessarily think that this will account for a huge amount of the factor but questioning whether their panel is 2.5 more likely to delete their 1st party cookies is certainly something for consideration.

Overstating the Factor

According to the press release:

Frequent Cookie Deletion by 3 out of 10 U.S. Internet Users Leads to Overstatements in Audience Sizes by a Factor as High as 2.5

So what if the method holds up and the factor becomes more or less accepted?

Marshall made this comment in his post after making a little before and after comparison of the numbers published for a few online retailers:

The corrected numbers are much more believable and feel right - but I don’t think anyone who sells something on the web and puts a value on the number of visitors they’re getting will be in any hurry to divide their web analytics sanctified Uniques by 2.5.

I have two problems with the comment. First it assumes that the numbers he started with were based on cookie id driven log analysts when in fact they appear to have come from Nielson which is a panel based method similar to comScore. Second, it assumes that the 2.5 factor has been accepted.

I’m including this because I do agree that anyone that has been using cookie based numbers for their public figures is certainly in for an adjustment. It also highlights the importance of knowing where the numbers come from. All methods of finding user numbers are estimations in the end and the panel based numbers are certainly not flawless. See my data below for an example of how variable comScores numbers can be when compared to cookie based data. When it comes to the trends and the fluctuations in types of users, neither is 100% accurate. Knowing your data and how it came to be is key to really understanding your results.

Investigating the factor

So what is so magical about the 2.5 number? I just happen to work with a number of geographically focused news publisher sites in the US who are also comScore customers. So I decided to go to the best tool at my disposal for looking into this question, the data.

What I did was take the numbers from 6 of the sites for 2 months and compared them to the comScore numbers. The result was interestingly supportive of comScore’s report:

comScore comparison summary

Or is it? To be clear, until the method is fully vetted, don’t take this as conclusive evidence of comScore’s findings. It also could end up that the factor simply explains the difference between our analytics tools and comScore. One thing this method shows is that despite the tendency to the 2.5 factor through the average, there is certainly a bit of variability in the numbers for given site/month comparison. Here is the raw data:

comScore comparison data

The one part of this that really troubles me is the size of the factor for the daily averages per month. With cookies tending to be more accurate over shorter time periods, I was expecting the difference there to be far less than I found. The notion that the daily numbers could still be twice as high as the reality is a little disturbing.

Any one else willing to share some data to widen the sample for this comparison?

I have provided some additional notes below about this research.

-Ian

Notes, Methodology, Disclaimer

Here is some more information about myself and my findings in order to be as transparent in this as possible:

  • Cookie based data is derived from an Implementation of Visual Sciences designed and maintained by myself over the past 2 years. The implementation uses a hybrid data collection model in order to see the cookies that have been accepted by the browsers on the first page view to the site. A side effect of the hybrid model is the cookies attempted are inflated by almost a 3 to 1 ratio for those that block them. The relationship of the cookies used in this implementation are almost always 1st party but there is a small number of pages for each site that employ 3rd party cookies.
  • Although the site and months are masked to provide some anonymity, the results are based on 2 side by side months per site within the past 3 months.
  • All sites used in this example use the comScore number as their published number. Cookie based data is used solely for trending and performance analysis. Cookie based data is also used to help understand the possible inaccuracies in the panel based numbers.
  • Even though I have conducted this work as a practitioner/consultant. I will be announcing in my next couple blog posts that I am joining the team at Visual Sciences on May 1st.
Loading..
DiggIt! Del.icio.us Blinklist Yahoo Furl Technorati Simpy Spurl Reddit

comScore calls to question the use of User numbers from cookies

I spent some time last night looking over the comScore press release again and thinking about the issues with identifying users by cookies. The first thing that pops into my mind is this question:

Is it really that much of a surprise?

The trend in technology (browsers and privacy tools) has been to make it easier to delete all cookies rather than try to distinguish the “good” from the “bad”. Picking out cookies to save involves work on the part of the user. Why would the user want to do that work when deleting anything that could be used for tracking has become an easy to find 2 click process in the latest versions versions of IE and Firefox?

When it comes to the technology, there are many reasons why cookies don’t live long term beyond deletion. These include blocking in the first place, limit on the number of cookies the browsers will store (per domain and total), lost during computer upgrade. There are even more for why cookies don’t equal real people such as multiple computers, multiple browsers on the same computer and multiple users of the same browser.

So if cookies aren’t an accurate method of measuring users, why do we use them for this? The simple answer is they are the best and easiest method to implement we have (that also allows for the user to maintain a decent amount of control over their privacy). Even though cookies have their issues (and always have), they still can be used for some visitor level analysis. Here are some top level things to bear in mind when doing so:

Cookies don’t equal Users.

With all of the reasons for a single person to have many cookies and many people to have a single cookie, it is really more appropriate to call cookie based user identification an estimate or a best guess. With the exception of requiring registration and authentication to use a site, there really isn’t anything much better to use. When looking at totals you have to keep this in mind but it doesn’t mean cookies aren’t valuable for use this way.

Cookies are most accurate over shorter time periods.

One way to compensate for the inaccuracies of cookies is to consider how you use the numbers. The longer a cookie is stored in the browser the more likely it is to be removed. Counts by day are more accurate than counts by week which are in turn more accurate than a month. If you are looking at using the numbers to show growth consider trending averages of shorter time periods. For example, take a rolling 52 Week Average Users per Week across a few reporting cycles and you end up with a trend that illustrates audience growth more accurately than monthly totals and accounts for seasonality at the same time.

Sessions are relatively unaffected

I am still waiting to see the survey results for when the cookie deleters remove their cookies but just as the above sections states, sessions, being smaller than a day, are even more accurate. Any analysis you do based on sessions is relatively un-impacted by the issue of cookies being deleted. There is a lot of good analysis that can be done on this level. Consider this as an option for items you might be wary of measuring over Days, Weeks, Months, etc…

The accuracy of cookies can be improved but will never reach 100%

There are many ways you can approach better accuracy through multiple ID comparison. My Projects and Papers section describes a couple based on the browser’s cache and flash. One I have not documented yet is the use of cross client activity with distinct identity methods such as clicks from a feed reader to a web site. Also, even partial registration can be used as well to improve understanding of what is probably your more valuable audience segment (those who are engaged enough to register).

Nothing is 100% accurate.

The one last thing to remember about the comScore press release is that it is to a large degree a piece of marketing material. For every one of the points above I can give similar examples of cause for inaccuracy in panel based measurement. Both methods are valuable for audience analysis, the trick is to understand that they aren’t perfect and know how to use the numbers.

-Ian

Loading..
DiggIt! Del.icio.us Blinklist Yahoo Furl Technorati Simpy Spurl Reddit

As the cookie crumbles, how might you measure the pieces

comScore announced today the release of a new cookie deletion study. The title of the press release nearly says it all, “Cookie-Based Counting Overstates Size of Web Site Audiences“. The results are interesting but are not very surprising. In fact, they appear to coincide nicely to the Jupiter Research study done in 2005.

Since the first reports of this issue started surfacing a couple years ago, I have done some personal research on methods for measuring both cookie deletion and cookie blocking. Rather than write a couple really long post I have put up a couple pages discussing using the browser’s cache and flash local shared objects. The later of the two was also published in Web Site Measurement Hacks, by Eric T. Peterson. You can find them under the Projects and Papers section of my site.

-Ian

DiggIt! Del.icio.us Blinklist Yahoo Furl Technorati Simpy Spurl Reddit

Jeremiah’s World, Party On, Excellent!

I was catching up on my blog reading late Saturday night when I stumbled on Jeremiah Owyang’s post about about testing a new video site called UStream at the Web 2.0 Expo. My timing was incredible because I clicked over to UStream.tv to check it out and found a LIVE show with Jeremiah talking to Kris Tate of Zoomr.

It was a pretty interesting experience. Jeremiah was on camera talking to Kris on the phone pus a dozen or more of us looking in and commenting via text chat. Before long Kris simulcasting from Zoomr HQ and taking over the show for a bit after Jeremiah went to bed.

The whole thing reminded me a lot of cable access call in shows I participated in while in college in the early 90s. Now with UStream being the method of transmission and anyone with a web cam and a broadband connection being able to set up their own “broadcast studio”. It will be interesting to see how this latest iteration of user generated video takes shape. For my part, I am currently pondering how I would quantify it analytically.

You can check out Jeremiah and his stream from the expo here.

-Ian

Loading..
DiggIt! Del.icio.us Blinklist Yahoo Furl Technorati Simpy Spurl Reddit

It is not about Time.

I just read a great post by Stéphane Hamel about the notion of Time being tied to Attention (or rather not being tied).

He did nice job summing up many of my own similar thoughts on Compete.com’s notion of Attention.

The only point I disagree on is the parting thought that the use of time is a matter of finding the right equation. I don’t think the equation is the problem when it comes to time and web analytics. I think the problem with time in web analytics is the limit to what we can measure about it. This whole discussion reminds me of the phrase “quality, not quantity”.

Take two users that read the same article on a web site and in almost all ways are equally satisfied that they received the content or experience they were looking for. Say user A is a speed reader and was completely focused on reading with no environmental distractions. Now say user B was an average reader, got a call from the boss in the middle of reading the story, perhaps decided to reread a portion of the article when getting back to it but completed it and was satisfied with that conclusion.

Unable to measure the other factors related to time including the internal about the user and the external about the user’s environment, what does the fact that user A spent 10x less time on the page tell us? Not all that much.

Even if we could take all of these other pieces of information into account I would still ask what does it matter. Isn’t the important bit that the user was satisfied with the result (quality)? I can think of times when I have been very attentive to a piece content, read it to completion, and still didn’t find what I was looking for. In the end I was unsatisfied with the experience. So what did my time on that piece of information tell the web analyst? That I was attentive?

Beyond the Compete.com discussion, I have also seen a lot of other discussions recently on using time as a way to provide comparable information in the face of the page view is dead talk. In the end, none of them have convinced me that time is the great replacement for activity. If anything, I think it is even more misleading than page views has been because at least page views are a measure of user actions. As opposed to time which also counts during user inaction. It also is often used without qualifying positive vs. negative time and usually just amounts to noise without meaning in reports. It is one of those “things that makes you go hmmm” but rarely presented in a way that amounts to real actionable insight. Interesting and insightful are not synonyms.

IMHO, activity is still the most important indicator of the users attention to the site and whether that attention is positive or negative. The only thing better than activity when it comes to identifying quality is psycho-graphic data such as surveys and ratings.

So what do you think? Am I being to hard on Time?

Loading..
DiggIt! Del.icio.us Blinklist Yahoo Furl Technorati Simpy Spurl Reddit

Finding time to blog

Many apologies for the long hiatus since my last post. I’ve been traveling a lot the last few weeks, the cause of which I will be announcing very soon.

In the mean time, I have started a new section of the site called Projects and Papers where I am beginning to put up information on projects I am developing and ideas for projects to be developed. The first piece I have just put up is the description of a method I have developed for using the browsers cache to help maintain user identity cookies and recording the deletion and blocking events. You can read it directly here:

Measuring Cookie Deletion and Blocking through the Browsers Cache

Loading..
DiggIt! Del.icio.us Blinklist Yahoo Furl Technorati Simpy Spurl Reddit

But what has my data done for you lately?

Google announced, today, new efforts to make the data they collect from users more privacy friendly.

When you search on Google, we collect information about your search, such as the query itself, IP addresses and cookie details. Previously, we kept this data for as long as it was useful. Today we’re pleased to report a change in our privacy policy: Unless we’re legally required to retain log data for longer, we will anonymize our server logs after a limited period of time. When we implement this policy change in the coming months, we will continue to keep server log data (so that we can improve Google’s services and protect them from security and other abuses)—but will make this data much more anonymous, so that it can no longer be identified with individual users, after 18-24 months.

The whole post seems to me to be more a statement on how long they think they can monetize your information than any great stride for privacy.

Don’t get me wrong, I actually don’t see anything wrong with them monetizing their business data. And if I were seriously opposed to it I would respond by not using their search tools.

The thing that raises the little hairs on the back of my neck is the “we’re doing this for you” nature of the release. The, we care about your concerns so 18 to 24 months after you use our services we wont know what you did. Why even bother with such an announcement? Seems to only highlight the possibility that they might know exactly what I did yesterday and might also be able to track it down to me specifically.

While I’m asking questions, the other questions that immediately scream at me are: Why 18 to 24 months? Why not 6 months or 12 months? What are they doing that makes my identifiable data something they can monetize over a year after it occurred?

Anonymizing their log data doesn’t mean they loose any of the reporting they did on my data at the time. It also doesn’t mean they necessarily have to loose the ties that bind my records together over finite periods of time. It only means they have to replace any references to my ip addresses and login IDs with something that doesn’t track back to me. The only reason I can see for having this data for 18 to 24 months is if they have the means and a plan to use it.

The other part of the statement that catches my attention is

This also means providing clear, easy to understand privacy policies that help you make informed decisions about using our services.

I don’t know what internet you’ve been surfing lately but, with the rapid proliferation of their advertising and analytics offerings, they are pretty near impossible to avoid. They don’t really make any clear distinction as to which products/server logs this policy covers if not all the data they collect.

What do you think?

Personally, I think the news is probably only useful for some ridiculous plot twist in a future episode of your favorite prime-time crime drama. I can hear it now. “We would have all the evidence we need to to convict him for the murder he committed 2 years plus one day ago but Google annonimizes their logs after 24 months. Drats, foiled again!” ;-)

Loading..
DiggIt! Del.icio.us Blinklist Yahoo Furl Technorati Simpy Spurl Reddit

Add to Technorati Favorites
View blog top tags