Google Analytics for Firebase export to Google BigQuery has been available for a while but not long enough to eliminate ambiguity around calculating some of the most basic metrics we’re accustomed to working with. Additionally, in the next few weeks, new BigQuery export schema will be introduced, replacing what’s currently in use.
Luckily, the difference between schemas is not going to be dramatic. Google also released documentation around migrating from the old to new schema. Essentially, there is one large query which takes data from the old schema, makes some transforms, and writes data back to BigQuery in the new schema. Nothing is added, nothing removed, just columns renamed and shuffled around.
To be ready for the future, this blog post will assume the reader is using the new schema, however, even if you still have some of the data in the old format, the insights from this blog post will be applicable to the old schema.
The challenge we’ll try to solve is how to group Firebase events into user sessions. Once we do that we’ll be able to calculate session metrics like session duration, number of (specific) events in the session, first event / last event, and all the others you might think of…
But why is that the challenge in the first place and why is there ambiguity around it?
There may be many reasons, but let’s point out three of the main ones:
- Long before Google Analytics for Firebase export to BQ was available, Google Analytics export to BQ already existed. The rows in that export represent one single user session and, additionally, the schema already includes pre-calculated metrics like session duration, number of pageviews in a session, number of hits in a session… All of which make it very simple to think about user activity in session-based context especially because whenever you start querying you always start at the session level (each row is one session). Many people got used to that and are now having to deal with Firebase data where data is structured differently. Rows represent a single event (in the context of Google Analytics, that’s a hit) which was tracked. You don’t get any session id or pre-aggregated session metrics. You have to do that all by yourself.
- That brings us to the second reason, if you want to construct session ids by yourself, you have to first understand how Firebase tracks sessions. This is where much of the ambiguity lies.
- Even if you do understand all of the above, the query actually is not very straightforward and it takes some advanced SQL knowledge to pull it off.
How Google Analytics for Firebase tracks sessions
The first thing worth noting is that Google Analytics for Firebase tracks some events automatically. The complete list can be found here, but, most importantly, among those events there is a session_start event which plays a crucial role in defining Firebase sessions in BQ.
Google Analytics for Firebase defines a session as a user engaging with your app for a minimum amount of time (10 seconds by default) followed by your user not engaging with your app for a certain amount of time (30 minutes by default). You can change those times if you’d like something different. This opens a window for 3 cases:
- No session_start event logged if the user engages with a screen for less than 10 seconds.
- session_start event logged if the user engages for more than 10 seconds till there is an inactivity for continuous 30 minutes. All the screens logged in between this time would be considered as a single session.
- User logs a session_start and there is an inactivity for 30 minutes by closing the app, user opens the app would result in the another session_start event logged.
Now that we know what defines a session we can start implementing this logic in SQL.
The goal is to write a query that will associate every user event with an id of that user’s session. All events with the same id will be considered in the same session. The query has a few layers, so we’ll go layer by layer starting with the innermost which is the one everything else is built upon.
Layer 1 – The Setup
SELECT user_pseudo_id, --unique user id event_timestamp, --time in microseconds UTC as logged on the client TIMESTAMP_MICROS(event_timestamp) AS ts, --transform to TIMESTAMP type LAG(TIMESTAMP_MICROS(event_timestamp)) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS prev_evt_ts, --timestamp of the event before the current one IF(event_name = "session_start", 1, 0) AS is_session_start_event FROM `<your project id>.<your dataset id>.<your table id>`
The query is pretty straightforward with the exception of the usage of the LAG analytic function, which might be new to those less proficient in SQL. Generally, we’re only selecting user id and event timestamp but with the usage of the LAG function we take a look at the timestamp before the current event (we’ll use this later to check if 30 minutes has already passed). And lastly, we take note when we encounter event by the name of session_start.
One thing to note: if there’s no previous event available prev_evt_ts will return null. If is_session_start_event is 1 that means the row represents a session_start event. That information will come in handy later.
Layer 2 – Marking Session Starts
SELECT *, IF(COALESCE(TIMESTAMP_DIFF(ts, prev_evt_ts, SECOND) / 60, 1e10) >= 30, 1, 0) AS session_start FROM ( SELECT user_pseudo_id, event_timestamp, TIMESTAMP_MICROS(event_timestamp) AS ts, LAG(TIMESTAMP_MICROS(event_timestamp)) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS prev_evt_ts, IF(event_name = "session_start", 1, 0) AS is_session_start_event FROM `<your project id>.<your dataset id>.<your table id>` )
We’ve wrapped our layer 1 query into another query. The only addition is the session_start column. The purpose is to make a note on which event a new session should be started.
TIMESTAMP_DIFF(ts, prev_evt_ts, SECOND) / 60 is used to calculate the difference between the current timestamp and the previous timestamp in minutes. Keep in mind that prev_evt_ts can be null if there is no previous event, meaning that TIMESTAMP_DIFF would also return null. That’s why we wrapped it inside a COALESCE function which returns the first non null value. In case the first one is null we output 1e10 which is a big number and clearly larger than 30, that means we automatically make note to start a new session if there is no previous event before the current one. Also if the difference is larger than 30 that also means a new session should be started as per Firebase’s default session timeout duration.
Layer 3 – Creating Session ID
SELECT *, SUM(session_start) OVER (PARTITION BY user_pseudo_id ORDER BY ts) AS session_id FROM ( SELECT *, IF(COALESCE(TIMESTAMP_DIFF(ts, prev_evt_ts, SECOND) / 60, 1e10) >= 30, 1, 0) AS session_start FROM ( SELECT user_pseudo_id, event_timestamp, TIMESTAMP_MICROS(event_timestamp) AS ts, LAG(TIMESTAMP_MICROS(event_timestamp)) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS prev_evt_ts, IF(event_name = "session_start", 1, 0) AS is_session_start_event FROM `<your project id>.<your dataset id>.<your table id>` ) )
We did the same thing as before, wrapping our layer 2 query into a new query and added exactly one additional column.
Column session_id is meant to represent the id of the user’s session.
We achieve that by using analytic functions once again. This time we partition our data by user id and order it by event timestamp in an ascending order and SUM our session_start column that we prepared in the previous layer. The function moves row by row and adds the previous sum to the new sum.
We see how session_id changes once we need to start with a new session. This is almost good enough. What we’re doing is grouping all user events that are no more than 30 minutes apart and making that our session id.
However, at the beginning we mentioned that session_start event plays a role in how a session is defined. That’s why we created an is_session_start_event column, but we’re not yet using it.
Firebase will only acknowledge a session if there is a session_start event inside it.
Layer 4 – Marking Valid and Invalid Sessions
SELECT user_pseudo_id, event_timestamp, session_id, MAX(is_session_start_event) OVER (PARTITION BY user_pseudo_id, session_id) AS has_session_start_event FROM ( SELECT *, SUM(session_start) OVER (PARTITION BY user_pseudo_id ORDER BY ts) AS session_id FROM ( SELECT *, IF(COALESCE(TIMESTAMP_DIFF(ts, prev_evt_ts, SECOND) / 60, 1e10) >= 30, 1, 0) AS session_start FROM ( SELECT user_pseudo_id, event_timestamp, TIMESTAMP_MICROS(event_timestamp) AS ts, LAG(TIMESTAMP_MICROS(event_timestamp)) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS prev_evt_ts, IF(event_name = "session_start", 1, 0) AS is_session_start_event FROM `<your project id>.<your dataset id>.<your table id>` ) ) )
This is the last layer—we expose user id—event timestamp and session id from the previous query and add another new column.
has_session_start_event will tell us if the session had session_start event or not. That way we can easily exclude sessions that lasted less than 10 seconds and had no session_start or do just the opposite and analyze only those short, not-acknowledged-by-firebase sessions.
The analytic function partitions data by user id and session id and calculates the MAX value of is_session_start_event column, if the result is 1 we know that somewhere inside that session session_start event was sent.
Thing to note is that session_start event is not guaranteed to be the first one in the session.
Example is the same as the previous layer. Here we see that has_session_start_event is set to 1 indicating both sessions are valid.
Above, we see a session with 3 events, but none of those events are session_start which would make this an invalid session per Firebase’s session definition.
That does it. With this query we can now explore user sessions in more detail.
Using the query to calculate session metrics
You can use this query as is to explore further, in our case we’ll save it as a view so the examples will look cleaner.
The query shown below gives us the number of events in each session as well as session duration.
SELECT user_pseudo_id, session_id, COUNT(*) AS events, TIMESTAMP_DIFF(MAX(TIMESTAMP_MICROS(event_timestamp)), MIN(TIMESTAMP_MICROS(event_timestamp)), SECOND) AS session_duration FROM `<your project id>.<your dataset id>.<your view name>` WHERE has_session_start_event = 1 --only look at valid sessions GROUP BY user_pseudo_id, session_id
Note that we are only looking at valid sessions. From there on it’s fairly easy to further calculate global averages of events per session and average session duration.
The idea is that you can use user_pseudo_id and event_timestamp from the query (or view) and join it back with the original data so you have that extra data available right beside your main data.
You’ll notice that if you look at only invalid sessions (sessions without session_start event), you’ll most likely see a few sessions longer than 10 seconds. Due to the way Firebase works, all sessions longer than 10 seconds should have session_start event however it is never guaranteed that network errors or other problems on the device might prevent it from being sent. Overall, there should be a small number of those cases which likely won’t interfere with your reporting.
If you have an Android and iOS app, you’ll have two tables with Firebase data. The schemas of both tables are the same so the query will work on both of them.
Make sure to not only copy the query but also understand how it works!