Adswerve

Analytics per Capita – Correlations, Revenues, States (R)


February 5, 2015

In this post I’ll show you how to connect your Google Analytics data with an outside source of data in R and perform some basic data analysis. We will focus on identifying US states based on their per capita value for our website or application.

Motivation

We are interested in performing an analysis of sessions (visits) and sales per US state. We want to know how interested people are in our brand and how much they’re spending depending on the state or region.

Google Analytics offers a great report to overview visits and sales based on geographical location.  You may find that report under Audience –> GEO –> Location.

Although we can get some really valuable information regarding users that visited our website, we cannot answer questions like how many times on average a Washingtonian visited our website or how much money on average a person from Montana spent on our website last month.

Google Analytics visits by stateExample 1: Sessions and transactions revenue data in your Google Analytics has a very similar heat map of the USA for any website. California leads the way, followed by Florida, Texas and New york.  This comes as no surprise since the number of visits and transactions are strongly correlated to the state’s population. In this blogpost we will be looking at how engaged residents of each state are in our website. So instead of visits per US state, we will look at visits per capita for each state. Now the 912,907 sessions from California will be seen as 0.023 sessions per resident. This will greatly help identify states where our presence is greater.

Example 2: in the report we can look at the per session value for each state. As you can see the per session value for users from Montana is 0.81$. That means that if we had 1000 visits from Montana together they spent 810$. The number tells us how valuable each visit from the state is, but it doesn’t tell us how much money an average resident of Montana spent on our website.

Google Analytics per session value by stateGoogle Analytics eCommerce transaction numbers

 

 

Data Source

To get data from Google Analytics we will be using the RGoogleAnalytics library. If you are new to connecting Google Analytics to R please read our How To tutorial.

As a data source for population we will be using the US 2014 census of state population estimates. This is available at census.gov.

Analysis

First we need to read the US Census data.

library(RCurl)
censusURL

Lets continue the analysis by querying Google Analytics with ga:region dimension (provides US states), ga:sessions  and ga:transactionRevenue metrics and filter by ga:country==United States.

query.list <- Init(start.date = start.date,
                   end.date = end.date,
                   dimensions = "ga:region",
                   metrics = "ga:sessions,ga:transactionRevenue",
                   filters = "ga:country==United%20States",
                   table.id = profile.id)

ga.query <- QueryBuilder(query.list)
ga.data <- GetReportData(ga.query, token)

Now in the next few lines using dplyr library, we’ll first join the US census data and GA data based on the matching key that is state name (State names are stored as attribute “NAME” in US census data and as “region” in Google Analytics data) and later calculate values of transaction revenue per resident and number of sessions per resident.

library(dplyr)
stateData <- inner_join(usCensus, ga.data, by=c("NAME" = "region"))
stateData <- mutate(stateData, transactionRevenuePerResident = transactionRevenue/POPESTIMATE2014)
stateData <- mutate(stateData, sessionsPerResident = sessions/POPESTIMATE2014)

The structure of the data table stateData at this step should look like this:

NAME, SUMLEV, REGION, DIVISION, STATE, POPESTIMATE2014, POPEST18PLUS2014, PCNT_POPEST18PLUS, sessions, transactionRevenue, transactionRevenuePerResident, sessionsPerResident

Using the ggplot2 we’ll first show our statement from example 1 that transaction revenue is tightly connected to the population of each state.

library(ggplot2)
states <- map_data("state")
stateData$NAME <- tolower(stateData$NAME)
choro <- inner_join(states, stateData, by = c("region" = "NAME"))
choro <- choro[order(choro$order), ]

plot1 <- qplot(long, lat, data = choro, group = group, fill = transactionRevenue, geom = "polygon")
plot2 <- qplot(long, lat, data = choro, group = group, fill = POPESTIMATE2014, geom = "polygon") 

require(gridExtra)
grid.arrange(plot1, plot2, ncol=2)

Transaction revenue Map, Population estimate map
Transaction revenue and population estimate map side by side. Resemblance is quite obvious.

The two maps seem to look much alike the more population the state has the higher transaction revenue was generated. With R this can be proven really fast.

cor(stateData$transactionRevenue, stateData$POPESTIMATE2014)

In our case the (Pearson’s) correlation is 0.85*, which is considered a high positive correlation and is statistically significant.

plot3 <- qplot(long, lat, data = choro, group = group, fill = transactionRevenuePerResident, geom = "polygon")
plot4 <- qplot(long, lat, data = choro, group = group, fill = sessionsPerResident, geom = "polygon")
grid.arrange(plot3, plot4, ncol=2)

Map showing transaction revenue per resident and sessions per resident.
Map showing transaction revenue per resident and sessions per resident.

These 2 maps reveal some states that were hidden before. First lets interpret the the scales. The transaction revenue per resident gives us information on how many dollars were spent per resident in the selected date range (our case was 1 month). This means that an “average resident” of Colorado or Wyoming spent more than 6 cents in that month, while an average resident of Florida only spent about 2 cents. Another state shows up west from Colorado for sessions per resident. Looks like a lot of residents of Utah are aware of our products, but are not as interested in buying it. In this case, it may be smart to think of a Utah-only special promotion and see how it could impact sales.

Extra

At this point we have produced great answers for the examples from motivation, but we still have some cool data laying in R, just waiting to be played with.

Just because we can, lets look at how the percentage of people over 18 affects transaction revenue per resident in each state.

plot5 <- ggplot(cleanData, aes(x=PCNT_POPEST18PLUS, y=transactionRevenuePerResident)) +
       geom_point(shape=1) +
       scale_colour_hue(l=50)+ 
       geom_smooth(method=lm)  

Plot showing that residents of states with higher percentage of residents over 18 tend to spend more on our website.
Dots represent different states. Regression is showing that the higher percentage of population over 18 years of age results in higher transaction revenue per resident.

This is pretty interesting, so lets explore it a little bit more.

lm(stateData$transactionRevenuePerResident~stateData$PCNT_POPEST18PLUS)$coefficients
cor(stateData$transactionRevenuePerResident, stateData$PCNT_POPEST18PLUS)
t.test(stateData$transactionRevenuePerResident, stateData$PCNT_POPEST18PLUS)

From the linear regression’s slope we can see that for every 1 percent of the population over 18 residents of that state accounts for around 0.2 cents of revenue per month more. From the test we can conclude that the positive correlation of 0.30 between the two variables is statistically significant.

Because it would be interesting and it only takes 3 lines of code, lets look at USA by regions. And look at the performance of transaction revenue per resident.

stateData$REGION <- factor(stateData$REGION)
levels(stateData$REGION) <- c("Northeast", "Midwest", "South", "West")
plot(stateData$REGION, stateData$transactionRevenuePerResident)

byRegion

The best Midwest state is performing as good as the worst Northeast state, and the average earnings per resident there are about half of those in the Northeast. This could mean that there’s a major competitor taking over those states or maybe just a lack of presence.

Next Steps

Doing a good analysis of your data is only part of the work. Next step is reacting, coming up with a plan on how to best use the results of our analysis.

In our example we could identify states that have a high per-session revenue and a low number of sessions per resident. Lets call them X states. We could assume that a typical resident of X states has not heard of our website (assumption made based on the low session per resident), but those who have tend to spend a lot more money than an average user (assumption made based on the high per-session revenue). We believe that spending more money on marketing in those states would result in a faster transaction revenue increase than marketing somewhere else.

A solid suggestion to our marketing team would therefore be, to create campaigns that would target those states, stating that not many people know about our website and again those who do tend to buy more. The campaign ROI will soon tell us whether our assumptions were correct or not.

*Pearson correlation coefficient measures a strength of linear association between 2 variables (A and B). A value of 0 would imply no association between two variables, +1 implies a maximum positive correlation (bigger A => bigger B) and -1 implies a maximum negative correlation (bigger A => smaller B).