1 Introduction

Ever since I learned about Market Basket Analysis, my head was spinning with ideas on how it could be applied to web data. To back up for a second, Market Basket Analysis (MBA), is a data mining technique that catalogs the strength in relationships between combinations of items placed together during a transaction. Applications often include:

  • Recommending content in the form of “Users who view X and Y also view Z”
  • Offering promotions for combinations of items to increase revenue
  • Better understanding of user behavior and intent
  • Updating editorial decisions based on popular combinations of items

The typical use case, and where the name is derived, is in the retail setting where marketers want to know what products are commonly associated with one another during checkout. The reason we need fancy algorithms for this type of analysis is due to the explosion of combinations to evaluate. As an example, if you wanted to look at the combinations of 2 or 3 items out of a set of 50 items, you would have 20,000 combinations to evaluate. That number expands immensely as you increase the number of unique items and increase the size of combinations.

In the retail case, the “item” is a “product” and the “transaction” is “checkout”. However, the algorithm underlying MBA doesn’t care what you use as an “item” and “transaction”. We can just as easily run an analysis that looks at web pages as “items” and browsing sessions as “transactions”. Going further, if we have information related to webpage taxonomy and unique user IDs, we can abstract the analysis away from individual pages and look at taxonomy tags as “items” and users as “transactions”. Hopefully that gives some flavor for how flexible this analysis can be.

As a simple example, we’ll run MBA on my own personal blog. Given my small number of pages and limited amount of traffic, this analysis won’t do justice to the full power of MBA. Just be aware that this technique scales to thousands of items and tens of thousands of transactions without much effort.

2 Pull Data from Google Analytics

I’m interested in understanding the combinations of pages that users visit during a session so that I might recommend new pages of interest during their journey. Perhaps I plan on asking my editorial team to manually attach these recommendations in WordPress or perhaps I plan to feed this information into some sort of automated personalization engine. The first step is to pull down our “items” (webpages) and “transactions” (session IDs). We’ll do this by calling the Google Analytics reporting API with the googleAnalyticsR library and grabbing pages, landing pages, and session ids.

I’ve included the session ‘landing page’ and its purpose becomes clear once you think about how we plan to use our results. On its own, MBA doesn’t provide any information about the sequence of items in a transaction, it simply indicates that “these items are associated”. Given that we want to recommend a new page to a user during their journey, we want to avoid recommending a page that is commonly associated with the start of a journey. In other words, let’s not recommend a landing page after they’ve landed!

To resolve this issue, we’ll tag the starting pages with ‘ENTRANCE-’ at the beginning. In the example below you can see that we differentiate between someone landing on the ‘differential scroll tracking’ blog post by prepending ‘ENTRANCE-’. We make no distinction regarding the ordering of the remaining pages.

Session ID Page Path
1582641547248.274nwi3 ENTRANCE-/2020/02/introducing-differential-scroll-tracking-with-gtm/
1582641547248.274nwi3 /blog/page/2/
1582641547248.274nwi3 /blog/
1582641547248.274nwi3 /2016/03/google-analytics-public-service-announcement-direct-traffic-is-unknown-traffic/
1582641547248.274nwi3 /

Looking at the results below, we can see that most sessions contained only 1 pageview and that the number of pageviews taper off after that. It’s good to get a general sense of the shape of the data before running MBA because it will influence the size of the combinations that we can reasonable expect. For example, it would be unreasonable to look for combinations of 9 different pages because only 1 session generated a combination of that length.

Count of Pageviews Sessions
1 2436
2 146
3 55
4 19
5 2
6 1
7 2
8 1
9 1

3 Running MBA

To run our Market Basket Analysis, we’ll use the arules package in R. Before we look at any results, it might be helpful to cover some terminology that often appears in MBA:

  • Itemsets - these are combinations of items and are often associated with a count which demonstrates how frequent the combination appeared in the transaction history. You’ll often see size-2 itemsets or size-3 itemsets, etc, indicating how many unique items appear in the itemset.
  • Support - This is the percentage of transactions in which the itemsets (or association rules, covered next) appear
  • Association Rules - These are presented in the format of “{Left Hand Side} => {Right Hand Side}” and indicate that transactions that contain itemsets on the LHS also include the item on the RHS. Note that the RHS only ever contains 1 item while the LHS can contain an itemset of any size.
  • Confidence - This is a percentage indicating the strength of our association rule. It says “Out of the users who visited the items in the LHS, XX% visited the RHS”. This is helpful, but can be misleading when items in the RHS are ubiquitous and relevant to nearly every combination of items. To resolve this, we often look at both confidence and lift.
  • Lift - This number indicates how much more likely we are to see the LHS and RHS together as opposed to apart in a transaction. A lift of 3 means we’re 3x more likely to see these items together and a lift of .33 means we’re 1/3 as likely.

With that, let’s format our data for use with MBA and get started. The following table shows the top 4 itemsets discovered, sorted by support.

Itemset Support
{/2016/08/google-analytics-autotrack-js-updated-see-whats-new/,ENTRANCE-/2016/02/deploying-autotrack-js-through-google-tag-manager/} 20.26%
{/blog/,ENTRANCE-/} 17.18%
{/portfolio/,ENTRANCE-/} 17.18%
{/blog/,/portfolio/} 10.13%

Next, we run the Apriori algorithm to find association rules. Remember that we want to filter out any association rules where the ‘entrance’ page is on the RHS. This ensures that we never recommend an entrance page.

The best way to present association rules is often in a scatter chart that allows us to look at support, confidence, and lift in one view. Below, you can see that 4 association rules were generated that have a minimum support of 2% and a minimum confidence of 80%.

Association Rule Confidence Support Lift
1 {ENTRANCE-/2016/08/google-analytics-autotrack-js-updated-see-whats-new/} => {/2016/02/deploying-autotrack-js-through-google-tag-manager/} 87.50% 3.08% 15.28
2 {/2020/03/mobile-app-live-streaming-analytics-case-study-hope-channel/} => {/blog/} 90.00% 3.96% 2.65
3 {ENTRANCE-/2016/02/deploying-autotrack-js-through-google-tag-manager/} => {/2016/08/google-analytics-autotrack-js-updated-see-whats-new/} 92.00% 20.26% 3.60
4 {/2020/03/mobile-app-live-streaming-analytics-case-study-hope-channel/,ENTRANCE-/} => {/blog/} 100.00% 3.52% 2.95

4 Analysis of Results

The scatter plot and table yield some interesting results. First, I should point out that finding an association rule with strong support, confidence, and lift is the holy grail, but exceedingly rare. Most commonly, you’ll find items with high confidence and low support, or high support and low confidence.

Notice that many of the itemsets we discovered previously, such as “Blog” and “Entrance-/” didn’t make the cut as association rules. This is because we’re filtering to search for association rules with a minimum confidence of 80%. This is important to avoid the situation where we recommend content that is broadly popular, but not tailored to the user’s unique viewing history.

So what we can we determine from the graph and table above?

  • Rules #1 and #3 are nearly the mirrors of one another, remember that the ‘entrance’ version of each page is considered to be a unique page. What stands out is the high lift - these 2 pages are clearly connected to one another in a way that stands apart from their connection to other pages.

  • Rule #3 is interesting because of the high confidence and high support. This is often hard to find. When I review some of the analytics underlying these figures, I see that my blog is generating a lot of SEO traffic to the ‘deploying auto track’ page and that those users are going onto the 2nd page 92% of the time. If we look at the page in question, we can see that I have an “Update” callout. It looks like that callout is working very well!

  • Rule #2 is notable because it doesn’t include an ‘entrance’ page. It’s a nice, broadly applicable rule stating that users who visit my live streaming case study are 90% likely to visit, or to have visited, the blog landing page.

  • Rule #4 is interesting given the 100% confidence (which I doubt you would ever see in a more realistic scenario). What this says is that if a user enters on the home page and at some point visits my live streaming case study then at some point they will (or will have already), with 100% certainty, visit the blog landing page. Notice that I have to emphasize the fact that this analysis gives no indication of the ordering of events. If we wanted to turn this rule into a content recommendation, we would likely want to check their browsing history first to avoid recommending a page they’ve already visited.

5 Closing Thoughts

Hopefully the analysis above shows how MBA can help someone dig deeper into user behavior and start looking at metrics for patterns as opposed to metrics for individual pages/products. While I used individual pages as the “items” above, websites with thousands of pages may benefit from an analysis centered on content taxonomy such as “content types”, “tags”, or “topics”. This makes the results much easier to interpret. One application of such an analysis may be feedback for the editorial team to focus on content that contains specific combinations of topics. Happy analyzing!