Evolution of the Web Analytics Data Model Part I

Visitor -> Visits -> Page Views. This is a typical expression of the data model at the core of most web analytics reporting today. As the dominant view of web site traffic, it has actually worked well to foster understanding and communication across the web analytics discipline and across both technical and non-technical audiences. But it presents a very limited view of what activity with a web site is. Often times forcing those who strive to capture and analyze more specific levels of interaction between a user and a web site to display the results through this very simple view and then explain what really is being tracked.

As we move forward with this concept of the next generation of the internet, this issue has achieved new focus as perhaps one of the most important areas needing improvement if the web analytics discipline is to keep pace with the growing use of RIAs, flexible reuse of content via feeds and the expanding variations of internet clients.

For this post I would like to start by focusing on the bane of many an analyst, the “Page View” and in future posts address the other levels of the underlying data archtiecture of web analytics.

Part I, Reinventing the Page View

What is a page view? In simple terms it is what evolved to replace “hits” by applying a filters and/or data capture techniques that limited the requests (hits) counted to only those that represented the display of a complete ‘page’ of content in the web browser. It is neither universally the same in all analytics tools nor the only unit of activity by which traffic is measured.

To support the need to track purchasable space on a page, the “impression” came to be to track pieces of content (creatives) and with it the response to that content or “clickthrough”. To track files that are requested by the browser but not necessarily displayed by the browser some tools support identifying certain files as ‘downloads’. Many email marketing applications have their own variations of page views, impressions and clickthroughs to support measurement of a campaign’s effectiveness.

Now we have the web 2.0 world of feeds and feed readers, desktop information clients (widgets, gadgets) and a rapid increase in the use RIAs that have caused the ad hoc invention of new units of measurement such as “map scrolls” and “post views”. This has resulted in naming things on the fly and/or reaching out to the community to ask what these new things should be called.

Events

What all of these have in common are they all attempts at capturing finite units of activity with an online property. They are also all represented by specific requests (hits) deemed to have value for measurement and analysis. So the first step is to replace page views at the lowest level, most finite measurement of web activity with something less fixed in the measurement of a specific activity and more generally illustrative of the level of activity being tracked. I call these requests captured and retained for use in analysis (or hits with meaning) Events.

An Event is the measure of any internet request captured and retained for the purpose of analysis. An Event requires first that there is a request and that the request passes a URI that is distinct to the event. The URI of the event may or may not be equal to the URI of the request to support the the capture of data through representational requests (page tags).

Now that I have all of my events together in one dimension I need to be able to flexibly attach meta-data to each event for the grouping of events of like types, their time of occurrence, their order and relation to other events and any other data I might need to better identify context and value of each event.

Additionally, in order to derive meaning from events, analytics tools must support the attachment of meta-data to the event by both parameters passed with the request and though interpretation of the URI and/or parameters passed with the request. The two required attributes for every event are “type” and “timestamp”.

In other words, page views, impressions, clicks, etc. are all “types” of events. The time at which an event occurs must be recorded to support reporting occurrences within time intervals as well as ordering events for path or process analysis. Additionally, sub dimensions of the events should exist by the value of type that allow for the same time based ordering for path based analysis within the sub dimension.

Higher end analytics tools will also support the flexible association of additional meta-data to an event as attribute sub dimensions of the top level Event dimension. They also will support the flexible modification of the types of events instead of fix configurations for only a small subset of specific types (page views, clicks, etc..)

So now we have Visitors -> Visits -> Events. Doesn’t seem like all that big a change does it? For some Vendors it isn’t. In simple terms it is the replacement of the Page View dimension with the Event dimension and the Events metric for counting. Then the new Page Views dimension is a subgrouping of the Events by type=’page views’.

What it provides is the ability to do path and process analysis across all the Events as a whole and subgroupings by type. But there is still something missing in the above for describing how the type based sub-dimensions relate to the Events dimension and the parent dimensions above that. I’ll try to tackle this in my next post by refining the definition of Events to the concept of an Event Super Class where the sub-dimensions by type are not children of the Events dimensions but rather sub-classes of the Events class that inherit their properties and relations within the data model from the Super Class.

-Ian

Loading..
DiggIt! Del.icio.us Blinklist Yahoo Furl Technorati Simpy Spurl Reddit

6 Responses to “Evolution of the Web Analytics Data Model Part I”


  1. 1 Steve Feb 6th, 2007 at 5:11 am

    Ian,
    do correct me, but aren’t you trying to redefine a transaction?

    Not so much in the 1:1 select == transaction model, rather in the select, update, (repeat both as necessary) commit database model of a transaction?

    Or am I missing something? :-)
    Cheers!
    - Steve

  2. 2 Jeroen Feb 8th, 2007 at 10:25 am

    Ian,

    I see an opportunity to align your terminology with probability theory. Just rename your “Event” to “Atomic Event” and specify that every set of Atomic Events is an “Event”. Then, if a Visitor requests say this blog page, that can be said to be the ocurrence of the Event WADM. If that request contains a referrer URL that points to Eric Peterson’s blog, then it would also be an occurrence of the, more specific, Event EP->WADM. Et voilá: all standard theory on probabilities would be applicable to your model.

    If you would go further, and break down Event as a Request/Response pair, then the Event can specify aspects of the returned content that cannot be derived from the request (because the web server could use a random generator or data sources that are not accessible to the analysis system). In addition to that your model would also benefit from theory about (Partially Observable) Markov Decision Processes.

    Jeroen

  3. 3 Ian Feb 11th, 2007 at 12:22 am

    Steve,

    I’m not sure how you mean transaction in terms of databases. Transactions in terms of transactional databases (and file systems) contain instructions to do something and require that the transaction complete entirely or abort. In logging for data collection it is acceptable and sometimes desirable to log incomplete interactions such as partial file downloads. Or am I missing your meaning?

    Jeroen,

    I see where you are going with the “Atomic Event” (or elementary event depending on your preference) and for the most part I am in fact trying to describe the “Event” as a set of atomic events as you describe but that is not what the term “Event” means in probability theory.

    You are absolutely right about the “Request/Response” pair. It is simply my omission that I didn’t phrase it that way. This comes from the way data is collected and the fact that the information you would wish to collect from both is or can be made available for boht logging and page tag methods. Though it is more dificult to get the information from both simultaneously through the packet sniffing method of data collection.

    One of my goals for the outcome of this series is to foster the discussion of the Data Model in language that is practical and understandable across multiple disciplines (Statisticians, Programmers, Marketers, etc…). My hopes are that the end result is a more flexible framework for defining names for new types of events without requiring the management and acceptance of these names across the entire industry for them to valuable. While I would love to find an affordable and practical tool that employs Markov and other advanced methods of deriving meaning from my data, this is a high level programming and implementation detail for the tool providers than I think should be involved for these purposes.

    Question: Do you think it would be cleared if I relabeled this thread of posts the Web Analytics Data Framework instead of Model?

    -Ian

  4. 4 Jeroen Feb 16th, 2007 at 4:00 pm

    Ian,

    You are right on both accounts: elementary event is better, because the things that we try to name have parts, so they are not ‘unsplittable, which is what atomos means. In the language of probability theory the elementary events would not be single request/response pairs, but entire descriptions of all interactions with the site over a long(ish) period of time. Viewed like that, a fully specifies request/response pair would be the non-trivial event that consists of all possible interaction histories that contain the given pair. This implies awfully large ‘elementary events’ though. I am sure you would agree that is also quite counter intuitive. I makes more sense to define the most simple experiment as ‘wait for the next request/response pair’, which would make the description of all interactions a sequence of outcomes of a repeated experiment. (Note that the experiments are not statistically independent at all.)

    I must confess that I wrote my comments from my background in Model Theory, which implies a view of ‘models’ that is not quite the same as the ‘data models’ used in computer science and software engineering. However, if you aim to produce something close to universal, I find it is often useful to ground your definitions in some mathematical framework.

    If I were to go back to UML to define Event, then I would say that Event is a class that defines a method isMember that takes a single argument of type ElementaryEvent and produces a Boolean value. The value that is returned by isMember indicates whether the given ElementaryEvent is a member of the Event instance or not.

    As for the name of the thread, I would say: stick to ‘data model’. You can always define an API that allows vendors to integrate with other applications that use this data model and create (or motivate others to create) a framework that implements the API.

  1. 1 hey jeff jarvis » Instant Cognition » web analytics Pingback on Feb 8th, 2007 at 8:57 pm
  2. 2 Web Analytics Demystified » Blog Archive » Etc. Pingback on Feb 19th, 2007 at 12:26 pm

Leave a Reply




Add to Technorati Favorites
View blog top tags