Ever since I learned about Market Basket Analysis, my head was spinning with ideas on how it could be applied to web data. To back up for a second, Market Basket Analysis (MBA), is a data mining technique that catalogs the strength in relationships between combinations of items placed together during a transaction. Applications often include:
The typical use case, and where the name is derived, is in the retail setting where marketers want to know what products are commonly associated with one another during checkout. The reason we need fancy algorithms for this type of analysis is due to the explosion of combinations to evaluate. As an example, if you wanted to look at the combinations of 2 or 3 items out of a set of 50 items, you would have 20,000 combinations to evaluate. That number expands immensely as you increase the number of unique items and increase the size of combinations.
In the retail case, the “item” is a “product” and the “transaction” is “checkout”. However, the algorithm underlying MBA doesn’t care what you use as an “item” and “transaction”. We can just as easily run an analysis that looks at web pages as “items” and browsing sessions as “transactions”. Going further, if we have information related to webpage taxonomy and unique user IDs, we can abstract the analysis away from individual pages and look at taxonomy tags as “items” and users as “transactions”. Hopefully that gives some flavor for how flexible this analysis can be.
As a simple example, we’ll run MBA on my own personal blog. Given my small number of pages and limited amount of traffic, this analysis won’t do justice to the full power of MBA. Just be aware that this technique scales to thousands of items and tens of thousands of transactions without much effort.
I’m interested in understanding the combinations of pages that users visit during a session so that I might recommend new pages of interest during their journey. Perhaps I plan on asking my editorial team to manually attach these recommendations in WordPress or perhaps I plan to feed this information into some sort of automated personalization engine. The first step is to pull down our “items” (webpages) and “t