Best BigQuery Datasets

A list of the top 10 most useful datasets available to query for free through BigQuery.

Best BigQuery Datasets

BigQuery is a tremendously powerful tool. Not just because you can use it as a SQL client for your own data (for instance, you can use it to query a CSV that you upload), but also because it's a tremendous tool for teaching. BigQuery includes hundreds of public datasets that you can query for free (up to 1 TB per month). This is a great playground where you can practice your skills or...

I've spent many hours testing the datasets that are publicly available. A lot of them are outdated or thin. So many of them only include aggregated data, which limits the potential for fun and insights. A lot of these datasets are interesting, but they don't look anything like the data you might use at work.

But there are some keepers in here! These are my favorite BigQuery datasets.

Stack Overflow

Probably the most usable dataset to practice product analysis. That's because this dataset looks a lot like what you might use at work. It includes a huge users table and a few tables – questions, answers, and votes – that resemble event logs associated with users, so that we can query the data to understand product fundamentals like activation, retention, and user segmentation. Many of the tutorials shared through this website use Stack Overflow data as an example.

Example query – Signup >> Ask Funnel (from Funnels in SQL)

DECLARE yr, conversion_window INT64;
SET (yr, conversion_window) = (2021, 30);

WITH users AS (
SELECT *
FROM `bigquery-public-data.stackoverflow.users`
WHERE EXTRACT(YEAR FROM creation_date) = yr
),

users_questions as (
SELECT u.display_name, u.id user_id, u.creation_date signup, COUNT(q.id) questions, MIN(q.creation_date) first_question
FROM users u
LEFT JOIN `bigquery-public-data.stackoverflow.posts_questions` q on q.owner_user_id = u.id 
  AND DATE_DIFF(q.creation_date, u.creation_date, day) <= conversion_window
GROUP BY 1,2,3
)

select 
  EXTRACT(MONTH FROM signup) month,
  COUNT(user_id) new_users,
  COUNT(DISTINCT CASE WHEN questions >0 THEN user_id ELSE NULL END) asked,
  ROUND(COUNT(DISTINCT CASE WHEN questions >0 THEN user_id ELSE NULL END) * 1.0 / COUNT(user_id), 3) pct_asked
FROM users_questions
GROUP BY 1
ORDER BY 1 ASC

NYC Taxi Fares

A record of every yellow taxi ride in NYC since 2011, including trip distance, passenger count, fare, and tip amount. These datasets are pretty big, with 36 million rows for the 2022 dataset. Read more about this dataset.

Example query – NYC taxi fare deciles (from Histograms in SQL)

SELECT APPROX_QUANTILES(total_amount, 9)
FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022`

Machine Learning Datasets

Includes classic datasets used in machine learning 101 classes such as Irises, Credit Card Defaults, ULB Fraud Detection, and Penguins.

Example query – Distribution of penguin height (from Histograms in SQL)

DECLARE bin INT64 DEFAULT 200;
DECLARE outlier FLOAT64 DEFAULT null;
SET outlier = (SELECT PERCENTILE_CONT(body_mass_g, outlier) OVER() from `bigquery-public-data.ml_datasets.penguins` limit 1);

WITH cleaned as (
SELECT 
  body_mass_g, outlier, species,
  (case when body_mass_g > outlier then outlier else body_mass_g end) clean
FROM `bigquery-public-data.ml_datasets.penguins`
)

SELECT
  FLOOR(clean / bin * 1.0) * bin flr,
  CASE WHEN FLOOR(outlier) > (FLOOR(clean / bin * 1.0) * bin) AND FLOOR(outlier) < (FLOOR(clean / bin * 1.0) * bin + bin) then FLOOR(clean / bin * 1.0) * bin || ' - max' else FLOOR(clean / bin * 1.0) * bin || ' - ' || (FLOOR(clean / bin * 1.0) * bin + bin) end weight_range,
  COUNT(*) AS penguins,
  COUNT(CASE WHEN species = 'Adelie Penguin (Pygoscelis adeliae)' then 1 else null end) adelie,
  COUNT(CASE WHEN species = 'Chinstrap penguin (Pygoscelis antarctica)' then 1 else null end) chinstrap,
  ROUND(COUNT(*) * 1.0 / (SELECT COUNT(*) FROM cleaned),3) percent
FROM cleaned
GROUP BY 1,2
ORDER BY 1 ASC

Wikipedia

A massive database that includes traffic data for every single page on Wikipedia. This is a great dataset for you to practice PARTITION BY. You'll need it, or else you'll run out of your free 1 TB fast.

Example query – Yesterday's most popular Wikipedia pages (from How to Use BigQuery)

SELECT title, sum(views) views
FROM `bigquery-public-data.wikipedia.pageviews_2023`
WHERE DATE(datehour) > CURRENT_DATE() - 1
AND LOWER(title) NOT LIKE '%wikipedia%' AND LOWER(title) NOT LIKE '%cookie%' AND LOWER(title) NOT LIKE '%main%' AND LOWER(title) NOT LIKE '%search%'
GROUP BY 1
ORDER BY 2 desc
LIMIT 100

Hacker News

Like Stack Overflow and Wikipedia, this is another corpus of user generated content. There used to be a few more tables in here (the comments were especially fun), but now it's just a record of posts and votes.

Example query – Most popular posts about Steve Jobs

SELECT
 f.id, f.title story_title, f.timestamp story_time, f.score story_score
FROM `bigquery-public-data.hacker_news.full` f
WHERE LOWER(f.title) like '%steve jobs%'
ORDER BY f.score DESC NULLS LAST
LIMIT 10

See what the world is searching for. Actually, what they searched for – this massive dataset is usually a few days stale. Still, it can be useful for real life use cases like planning a content calendar or establishing an SEO strategy.

Example query – Top 25 search terms in the UK

SELECT DISTINCT t.term, min(t.rank) rnk
FROM `bigquery-public-data.google_trends.international_top_terms` t
WHERE refresh_date = DATE_SUB(CURRENT_DATE(), INTERVAL 4 DAY) AND country_code = 'GB'
GROUP BY 1
ORDER BY 2 ASC

Baby Names

A comprehensive record of names given to babies in the United States, grouped on states. A fun dataset for trendspotting and prediction. Which grandma name will be popular next year?

Example query

SELECT
  YEAR,
  SUM(n.number) number,
  COUNT(DISTINCT n.name) names,
  SUM(case when n.name in ('Sam','Samuel','Samantha') then n.number else null end) sams
FROM `bigquery-public-data.usa_names.usa_1910_current` n
GROUP BY 1 
ORDER BY 1 DESC

Iowa Liquor Sales

A record of every bottle of liquor sold in the state of Iowa since 2012, including details about the manufacturer and the store where the bottle was sold. Fun dataset for time series analysis and can be used for machine learning and forecasting. Read more.

Sample query – Products purchased per month

SELECT  
DATE_TRUNC(date, month) month,
COUNT(*) items
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY 1
ORDER BY 1 DESC

NYC Citi Bike Trips

A record of all rides that occur on the largest bike share system in the US. At the time of this writing, it appears this dataset only goes through 2016.

Example query – Most popular Citi Bike stations

SELECT
  start_station_name, start_station_latitude, start_station_longitude,
  COUNT(*) AS num_trips
FROM `bigquery-public-data.new_york.citibike_trips`
GROUP BY 1,2,3
ORDER BY num_trips DESC
LIMIT 10

The Look

This is a fake ecommerce dataset that is somewhat useful for practice working on analysis problems you might find if you worked for a business that sells widgets online. This dataset includes a few tables such as inventory items, orders, users and distribution centers. Read more.

Example query – Distribution of orders per user

WITH oo as (
SELECT u.id, count(o.order_id) orders
FROM `bigquery-public-data.thelook_ecommerce.users` u
LEFT JOIN `bigquery-public-data.thelook_ecommerce.orders` o on o.user_id = u.id
GROUP BY 1 
ORDER BY 2 DESC
)

SELECT orders, count(*) users
FROM oo
GROUP BY 1 
ORDER BY 1 ASC