Funnels with SQL

Don't build funnels in SQL. Unless you have to.

Funnels with SQL

When it comes to product optimization, funnels are everything. OK, not everything. These days product leaders talk a lot about loops, not funnels. The thinking is, products should create positive feedback loops if they're going to succeed! (Or if they're going to attract VC.) But actually loops are funnels, they just feed into each other. So no matter what you're building, it's inevitable that you're going to need to build a funnel at some point. In this post, I'll show you how to create funnels with SQL.

Interpreting a funnel

In tools like Mixpanel, funnel reports look something like this.

We might interpret this report as...

For the given period, 94.1K users entered a login funnel, with dropoff at every step of the way until users get to their dashboard. The biggest dropoff occurs in the final step – about half of logged in users don't make it to their dashboard.

Counting method

Users, Sessions, and Events

In fact, there's a little bit more going on here. We don't actually know our counting method. Is this users, sessions, or events? I'm a fan of using sessions by default because sessions are discrete (you can always add them up to get a total), and in many cases a user should be able to complete a flow in a single session. But defining sessions with SQL can get tricky (though it is very doable). So let's work with users.

Conversion window

Minutes, Days, and Unbound

Another consideration is the conversion window. How much time are we giving users to complete all the steps? If we set the conversion window at 5 minutes and the user logs in at 4 minutes and then sees the dashboard at 10 minutes, then that will only count as a conversion for login. Sometimes it makes sense as a product manager to define the conversion window to a short period of time because you know it's important that your users can complete a flow quickly. But other times it makes sense to set the conversion window to weeks or longer because you know that the happy path on your user journey is long. Your analytics reports should reflect what's happening and what has business value in real life.

A funnel in SQL

We'll use the Stack Overflow dataset because it's the only public dataset on BigQuery that includes tables that look a lot like the event stream data you might find in a data warehouse at work.

DECLARE yr, conversion_window INT64;
SET (yr, conversion_window) = (2021, 30);

WITH users AS (
FROM `bigquery-public-data.stackoverflow.users`
WHERE EXTRACT(YEAR FROM creation_date) = yr

users_questions as (
SELECT u.display_name, user_id, u.creation_date signup, COUNT( questions, MIN(q.creation_date) first_question
FROM users u
LEFT JOIN `bigquery-public-data.stackoverflow.posts_questions` q on q.owner_user_id = 
  AND DATE_DIFF(q.creation_date, u.creation_date, day) <= conversion_window
GROUP BY 1,2,3

users_questions_answers as (
SELECT display_name, user_id, signup, questions, first_question, count( answers_after_question
FROM users_questions uq
LEFT JOIN `bigquery-public-data.stackoverflow.posts_answers` a on a.owner_user_id = uq.user_id 
  AND a.creation_date > uq.first_question
  AND DATE_DIFF(a.creation_date, first_question, day) <= conversion_window
GROUP BY 1,2,3,4,5

  EXTRACT(MONTH FROM signup) month,
  COUNT(user_id) new_users,
  COUNT(DISTINCT CASE WHEN questions >0 THEN user_id ELSE NULL END) asked,
  ROUND(COUNT(DISTINCT CASE WHEN questions >0 THEN user_id ELSE NULL END) * 1.0 / COUNT(user_id), 3) pct_asked,
  COUNT(DISTINCT CASE WHEN answers_after_question >0 THEN user_id ELSE NULL END) then_answered,
  ROUND(COUNT(DISTINCT CASE WHEN answers_after_question >0 THEN user_id ELSE NULL END) * 1.0 / COUNT(user_id), 3) pct_then_answered
FROM users_questions_answers

Stack Overflow Funnel - Ask & then Answer - Run the query!

The query above creates a funnel for monthly cohorts of Stack Overflow users who signed up in 2021. We join common table expressions (CTEs) to progressively build up to a final query that produces a result: for each cohort, what was the conversion rate across steps from Signup to Ask to Answer, where every event in the funnel needs to occur sequentially and within a 30-day conversion window.

The query generates this result:

Stack Overflow Funnel - Ask & then Answer - Run the query!

Here we see that about 250K users signed up every month, with about 8% of users posting a question within the first 30 days after signup. About 1% of those users asked and then answered within 30 days.

DECLARE and SET your variables

We can tweak the query to see how conversion rates have changed across years. One way to easily explore the data is to update the yr variable in the DECLARE and SET statement. See what happens when we change it to 2009 and hit run.

DECLARE yr, conversion_window INT64;
SET (yr, conversion_window) = (2009, 30);

New users were A LOT more engaged in 2009. 50% asked a question within 30 days, and about 20% asked and then answered within 30 days.

Declaring variables is a powerful way to keep your code DRY (Don't Repeat Yourself). Variables allow you to store and update a value in a single place, minimizing opportunity for error, and saving you from the headache of picking through code to update a value in many places across your code.

DECLARE and SET can be used together. DECLARE creates the variable that can be used throughout your query. This statement must sit at the top of your code. SET assigns a value to the variable. You can include SET statements throughout your query to set and reset variables.

Variables are a powerful tool used across all programming languages. The official docs on Procedural Language do a good job demonstrating how you can use them in BigQuery. But not all SQL editors allow you to declare variables. Looker SQL Runner does not let you declare variables. Snowflake has no DECLARE, but it does let you SET a variable.

Using Variables in Funnels

Variables are great for funnels because we can store conversion criteria, such as the conversion window, at the top of our query, and we can easily modify the value just once to see how conversion rate changes as we tweak parameters.

For example, if we tweak our conversion_window from 30 to 180 days, our funnel gives new users a much longer duration during which they might convert, and we see conversion rates tick up across our funnel steps. In this case conversion does not increase dramatically, which tells us that most new users activate quickly after signup.

We can also modify the query and create a new variable counting_method to toggle between users, sessions, and events. And we can create an additional variable sequential to control whether or not steps must be completed one after the other to count as a conversion. We can do these things, but I need to get to the next section and I have no idea if anyone is reading this, so drop a comment if you want a funnel query with more controls and maybe I'll give it to you.

Why use SQL for Funnels

Don't use SQL! If you don't have to.

In the perfect world a company has a single analytics tool that is easy for anyone to use. The user interface lets you drag and drop, scan across event logs for user sessions, and easily create reports like funnels and retention curves. In the perfect world, databases, tables, records, and events are named intuitively, and there is up-to-date documentation on how everything works. Needless to say, this perfect world does not exist.

Real life is messy. Data is dirty. Naming conventions are inconsistent. And documentation is not up to date. So when you use an analytics tool like Mixpanel, Amplitude, or Google Analytics, you will have questions. Does that event mean what I think it means? How does this funnel logic work? Can a user enter the funnel multiple times? You can and should ask for help. But before you do, put in some effort to get an answer yourself. SQL can help you get these answers and become a person who has the answers when others have trouble.

Did you learn something from this post? See an error or have an observation? Leave a comment!

Other posts in this series: