BigQuery is a tremendously powerful tool. Not just because you can use it as a SQL client for your own data (for instance, you can use it to query a CSV that you upload), but also because it's a tremendous tool for teaching. BigQuery includes hundreds of public datasets that you can query for free (up to 1 TB per month). This is a great playground where you can practice your skills or...
I've spent many hours testing the datasets that are publicly available. A lot of them are outdated or thin. So many of them only include aggregated data, which limits the potential for fun and insights. A lot of these datasets are interesting, but they don't look anything like the data you might use at work.
But there are some keepers in here! These are my favorite BigQuery datasets.
Probably the most usable dataset to practice product analysis. That's because this dataset looks a lot like what you might use at work. It includes a huge
users table and a few tables – questions, answers, and votes – that resemble event logs associated with users, so that we can query the data to understand product fundamentals like activation, retention, and user segmentation. Many of the tutorials shared through this website use Stack Overflow data as an example.
Example query – Signup >> Ask Funnel (from Funnels in SQL)
DECLARE yr, conversion_window INT64; SET (yr, conversion_window) = (2021, 30); WITH users AS ( SELECT * FROM `bigquery-public-data.stackoverflow.users` WHERE EXTRACT(YEAR FROM creation_date) = yr ), users_questions as ( SELECT u.display_name, u.id user_id, u.creation_date signup, COUNT(q.id) questions, MIN(q.creation_date) first_question FROM users u LEFT JOIN `bigquery-public-data.stackoverflow.posts_questions` q on q.owner_user_id = u.id AND DATE_DIFF(q.creation_date, u.creation_date, day) <= conversion_window GROUP BY 1,2,3 ) select EXTRACT(MONTH FROM signup) month, COUNT(user_id) new_users, COUNT(DISTINCT CASE WHEN questions >0 THEN user_id ELSE NULL END) asked, ROUND(COUNT(DISTINCT CASE WHEN questions >0 THEN user_id ELSE NULL END) * 1.0 / COUNT(user_id), 3) pct_asked FROM users_questions GROUP BY 1 ORDER BY 1 ASC
NYC Taxi Fares
A record of every yellow taxi ride in NYC since 2011, including trip distance, passenger count, fare, and tip amount. These datasets are pretty big, with 36 million rows for the 2022 dataset. Read more about this dataset.
Example query – NYC taxi fare deciles (from Histograms in SQL)
SELECT APPROX_QUANTILES(total_amount, 9) FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022`
Machine Learning Datasets
Includes classic datasets used in machine learning 101 classes such as Irises, Credit Card Defaults, ULB Fraud Detection, and Penguins.
Example query – Distribution of penguin height (from Histograms in SQL)
DECLARE bin INT64 DEFAULT 200; DECLARE outlier FLOAT64 DEFAULT null; SET outlier = (SELECT PERCENTILE_CONT(body_mass_g, outlier) OVER() from `bigquery-public-data.ml_datasets.penguins` limit 1); WITH cleaned as ( SELECT body_mass_g, outlier, species, (case when body_mass_g > outlier then outlier else body_mass_g end) clean FROM `bigquery-public-data.ml_datasets.penguins` ) SELECT FLOOR(clean / bin * 1.0) * bin flr, CASE WHEN FLOOR(outlier) > (FLOOR(clean / bin * 1.0) * bin) AND FLOOR(outlier) < (FLOOR(clean / bin * 1.0) * bin + bin) then FLOOR(clean / bin * 1.0) * bin || ' - max' else FLOOR(clean / bin * 1.0) * bin || ' - ' || (FLOOR(clean / bin * 1.0) * bin + bin) end weight_range, COUNT(*) AS penguins, COUNT(CASE WHEN species = 'Adelie Penguin (Pygoscelis adeliae)' then 1 else null end) adelie, COUNT(CASE WHEN species = 'Chinstrap penguin (Pygoscelis antarctica)' then 1 else null end) chinstrap, ROUND(COUNT(*) * 1.0 / (SELECT COUNT(*) FROM cleaned),3) percent FROM cleaned GROUP BY 1,2 ORDER BY 1 ASC
A massive database that includes traffic data for every single page on Wikipedia. This is a great dataset for you to practice
PARTITION BY. You'll need it, or else you'll run out of your free 1 TB fast.
Example query – Yesterday's most popular Wikipedia pages (from How to Use BigQuery)
SELECT title, sum(views) views FROM `bigquery-public-data.wikipedia.pageviews_2023` WHERE DATE(datehour) > CURRENT_DATE() - 1 AND LOWER(title) NOT LIKE '%wikipedia%' AND LOWER(title) NOT LIKE '%cookie%' AND LOWER(title) NOT LIKE '%main%' AND LOWER(title) NOT LIKE '%search%' GROUP BY 1 ORDER BY 2 desc LIMIT 100
Like Stack Overflow and Wikipedia, this is another corpus of user generated content. There used to be a few more tables in here (the comments were especially fun), but now it's just a record of posts and votes.
Example query – Most popular posts about Steve Jobs
SELECT f.id, f.title story_title, f.timestamp story_time, f.score story_score FROM `bigquery-public-data.hacker_news.full` f WHERE LOWER(f.title) like '%steve jobs%' ORDER BY f.score DESC NULLS LAST LIMIT 10
See what the world is searching for. Actually, what they searched for – this massive dataset is usually a few days stale. Still, it can be useful for real life use cases like planning a content calendar or establishing an SEO strategy.
Example query – Top 25 search terms in the UK
SELECT DISTINCT t.term, min(t.rank) rnk FROM `bigquery-public-data.google_trends.international_top_terms` t WHERE refresh_date = DATE_SUB(CURRENT_DATE(), INTERVAL 4 DAY) AND country_code = 'GB' GROUP BY 1 ORDER BY 2 ASC
A comprehensive record of names given to babies in the United States, grouped on states. A fun dataset for trendspotting and prediction. Which grandma name will be popular next year?
SELECT YEAR, SUM(n.number) number, COUNT(DISTINCT n.name) names, SUM(case when n.name in ('Sam','Samuel','Samantha') then n.number else null end) sams FROM `bigquery-public-data.usa_names.usa_1910_current` n GROUP BY 1 ORDER BY 1 DESC
Iowa Liquor Sales
A record of every bottle of liquor sold in the state of Iowa since 2012, including details about the manufacturer and the store where the bottle was sold. Fun dataset for time series analysis and can be used for machine learning and forecasting. Read more.
Sample query – Products purchased per month
SELECT DATE_TRUNC(date, month) month, COUNT(*) items FROM `bigquery-public-data.iowa_liquor_sales.sales` GROUP BY 1 ORDER BY 1 DESC
NYC Citi Bike Trips
A record of all rides that occur on the largest bike share system in the US. At the time of this writing, it appears this dataset only goes through 2016.
Example query – Most popular Citi Bike stations
SELECT start_station_name, start_station_latitude, start_station_longitude, COUNT(*) AS num_trips FROM `bigquery-public-data.new_york.citibike_trips` GROUP BY 1,2,3 ORDER BY num_trips DESC LIMIT 10
This is a fake ecommerce dataset that is somewhat useful for practice working on analysis problems you might find if you worked for a business that sells widgets online. This dataset includes a few tables such as inventory items, orders, users and distribution centers. Read more.
Example query – Distribution of orders per user
WITH oo as ( SELECT u.id, count(o.order_id) orders FROM `bigquery-public-data.thelook_ecommerce.users` u LEFT JOIN `bigquery-public-data.thelook_ecommerce.orders` o on o.user_id = u.id GROUP BY 1 ORDER BY 2 DESC ) SELECT orders, count(*) users FROM oo GROUP BY 1 ORDER BY 1 ASC