The Google Analytics Related Product feature is an extremely valuable set of dimensions and metrics that can highly benefit your product bundling, remarketing, email campaigns, product recommendation and more. The core of the feature is correlation score, a metric that sets a value of association between two products based on their transaction data.
Correlation score takes values from 0 (lowest correlation) to 1 (highest correlation). One important thing to note is that the metric is directed. That means that product B may be associated to product A differently than A is associated to B. Example: our clients buy ketchup in 80% of cases when they buy pizza, but they only buy pizza in 25% of cases when they buy ketchup. The correlation score where main product (query product) is pizza and a related product is ketchup is therefore a lot higher, than the opposite case.
Online Grocery Store
We own an online grocery store business where we sell: Pears, Apples, Bananas, Pizza, Ketchup, Soda, Milk and Cocoa. If we pair all of those together (in two directions) we get 56 pairs*.
Lets do some exploratory analysis to get more comfortable with the data that related products has to offer. Through Google Analytics reporting API we’ll query for:
Query for our grocery store example responds with the following results:
To have a quick look at your data check out the GA Query Explorer tool. The tool offers you to build custom queries in seconds without having any programming experience.
Column one is the query product, column 2 the related product, 3 is the correlation score and 4 number of items of query product were sold. The first row would read: the correlation score where the main product is pear and related product is apple is 0.8. 55 apples were sold.
Visualizing Related Products in R
To query Google Analytics Reporting API in R we will be using the RGoogleAnalytics library. (A more detailed explanation of how to do that is available in our How to connect Google Analytics with R blog post).
First lets build the query and retrieve GA data.
query.list <- Init(start.date = "2014-12-01", end.date = "2014-12-31", dimensions = "ga:queryProductId,ga:relatedProductId ", metrics = "ga:correlationScore,ga:queryProductQuantity", sort = "ga:queryProductID, ga:relatedProductId", table.id = view.id) ga.query <- QueryBuilder(query.list) ga.data <- GetReportData(ga.query, token)
Because the data provided from Google Analytics is in a good format to do some basic graph visualization lets do that first.
Creating a Directed Graph of Related Products
The input data for a directed graph consist of vectors of from and to vertices and width of edges. In our case edges are going to start in a query product vertex and are going to be directed to the related product vertex. The width of the edge will be the value of correlation score. To make the width better visible we multiply it by a factor of 7.
E <- as.matrix(data.frame(from = ga.data$queryProductId, to = ga.data$relatedProductId, width = 7*ga.data$correlationScore))
Before we draw the directed graph let’s set the size for each vertex based on the quantity of the product that we’ve sold. First we need to select only unique vertices. We do this by not selecting rows that duplicate the value of queryProductId. When we have a vector consisting only of product quantities we should normalize it to the scale of 0 to 1 and then adjust the size to our wishes (in our case I multiply it by 12).
size <- ga.data[!duplicated(ga.data$queryProductId),]$queryProductQuantity size <- size/max(size)*12
All we need to do now is take the matrix of Edges and vector of sizes and pass it to the qgraph function.
qgraph(E, mode="direct", vsize=size)
Bigger vertex = Higher quantities of that products sold,
Thicker lines = Bigger correlations between the two products
A quick look at this graph reveals a lot more than looking at the table of data that we got from the query response. R takes great care of positioning vertices and shows us that there are a couple of groups forming. On top we can see that fruits are often bought together, on the bottom left we can see that ketchup and pizza form a little group as well and on the right milk and cocoa form the third group. Soda seems to be close to all, but may be closest to ketchup and pizza.
A great way to confirm the groups would be through a dendrogram visualization.
Creating a Heatmap with Dendrogram
Dendrogram (lines on the left and top of the heat map) is a visual presentation of how data is correlated. The bottom of the dendrogram consist of the 8 individual products. As we are moving towards the top of the dendrogram small groups (clusters) start to form and at the top of the dendrogram all of our products are in 1 group (cluster). Building a dendrogram also helps with ordering related products together therefore designed heat map is a lot easier to read.
Using the tidyr library we will shape the data into a new matrix where each line will present the query product, each column a related product and the value of each cell in the matrix will be the correlation score. The matrix we’re going to design is known as the distance matrix. The only difference is that small values usually mean higher similarity, while in our case values closer to 1 show higher correlation.
library(tidyr) newScore <- spread(onlyScore, relatedProductId, correlationScore) newScore[is.na(newScore)] <- 1 row.names(newScore) <- newScore$queryProductId newScore <- newScore[,c(-1, -2)]
Now that our data is looking good all we need to do is draw an R native plot “heat map + dendrograms”.
A look at both dendrograms shows that the most independent product is Soda (Soda’s dendrogram line is the last one to join any other). When all of the products have joined a group (cluster) we have 3 basic groups that are marked on the visualization above.
- Cocoa and Milk
- Apple, Pear and Banana
- Soda, Ketchup and Pizza
We could have decided to take a different limit to define our clusters, but a 3 clusters limit seems very reasonable for the amount of data that we have. But just as an example, if we decided to have our products in 2 clusters then the first cluster would be milk and cocoa and the 2nd cluster would be everything else or if we decided to have more (4 clusters) then compared to our 3 cluster structure, soda would now be a one product group separated from ketchup and pizza.
Using The Related Products Data
The data that Google Analytics is providing for us here in the form of correlation score is extremely valuable and at reach. Correlation is already calculated and can be retrieved fast. All the expensive machine learning analysis has already been done for us by Google Analytics, the correlations scores are there and waiting for us. Exploring and analyzing it is a great first step, but it would be of a much greater value to use the data and start increasing our analytics ROI.
Here are a few examples of how to do that:
- Upsell. In case that a user is deciding to only buy pizza and ketchup we could display a message to take 20% off of their soda if they decide to add it to their order. Our data suggest that our users like to see it in the same basket, maybe all there’s missing is a little enticement.
- Email marketing with products. If a user has bought a certain product, or even if we only noticed their high engagement, the value of an email with related (personalized) products would be a lot higher than sending a generic email. In our case if a user buys apples and bananas we could send them an email to try our pears.
- Remarketing. We know the type of products a visitor is interested in. If we decide to display our ads to them, we should make certain that those ads are relevant to what they’re in the search for. So if they’re looking at a product that suggest a pizza party, it may be smart to offer them ketchup and soda.
- Recommend. The “you may also be interested in” type of products always seem to work on me. A good phone case will go well with a new phone, and it looks like suggesting milk when purchasing cocoa will make sense as well! Then there are also cases when user will not be sure if apples are what they really want to buy on our website, but offering them pears and bananas may persuade them to finish a purchase.
* 8 products form 28 pairs (handshake problem ~ n*(n-1)/2). Because the correlation score in the Google Analytics Related Products is directed we need to multiply that by 2 to get 56 correlations.