Run this. Return a list of all users on Stack Overflow.
SELECT * FROM `bigquery-public-data.stackoverflow.users`
Read this first to set up BigQuery:
Congratulations, you know SQL.
Seriously. SQL is copy / pasting and troubleshooting. This query is pretty simple so you probably won't need to troubleshoot, at least not yet.
The syntax for SQL is fairly straightforward.
SELECT determines the columns you want to return.
FROM accesses the dataset.
In this specific example, we are using the asterisk:
*. This is a shortcut to return all of the columns in the dataset. The dataset we are using here is the users table in the stackoverflow database which is included in the bigquery-public-data dataset.
Let's try the same query but with a different dataset.
SELECT * FROM `bigquery-public-data.hacker_news.stories`
We chose these 2 tables for first queries because they're relatively small, so you won't burn through your 1 TB using
SELECT * here. But in general be careful querying all columns. The more columns you return the more data you burn through.
For any query, you can analyze the result at a glance in the Results that are returned below your query in the BigQuery console. Or you can save the results to a CSV or to Google Sheets and then do calculations there.
SELECT display_name, reputation FROM `bigquery-public-data.stackoverflow.users`
This query becomes about 10x cheaper now that we're only returning the columns we care about:
It is important that you have some “business sense” for what it is that you're querying. Data is never just data. It's a real record of an event or an entity that exists in real life. In this case, we're looking at user records for Stack Overflow, an online community website where users post questions and answers about software development (disclosure: I used to work there).
Stack Overflow licenses all content under Creative Commons so anybody can use it for (almost) any purpose. And periodically they make a full data dump of all posts available for anyone to access. Occasionally this dump is uploaded to Google Cloud. That's what we're querying now.
We'll be using the Stack Overflow database a lot on this website, not because I used to work there, but because this is probably the most robust and realistic dataset publicly available through BigQuery. This dataset looks a lot like what you'll find at work. It has a real users table and we can analyze their behavior through their questions, answers, and votes. We can do segmentation and cohorting to identify user traits that correlate with levels of activity. We can measure activation and retention and chart trends. These are real analysis techniques that I regularly use at work.
SELECT FROM WHERE
OK so you've returned an entire table. Now let's filter that result. Add a
WHERE statement to constrain the scope of what your query returns.
SELECT display_name, reputation FROM `bigquery-public-data.stackoverflow.users` WHERE display_name = 'samthebrand'
WHERE is how we filter. I know my username so I can find myself in this dataset.
We can include additional columns in the result to see even more user traits like name, reputation, location, and more.
SELECT display_name, reputation, location FROM `bigquery-public-data.stackoverflow.users` WHERE display_name = 'samthebrand'
WHERE: Comparison operators
We can use comparison operators to filter a dataset based on logic.
SELECT display_name, reputation, location FROM `bigquery-public-data.stackoverflow.users` WHERE reputation > 1000000
The query above returns all users where reputation is greater than one million. There are only a handful of generous Stack Overflow users who ever reached one million rep.
We found samthebrand with the
= comparison operator. There are several other operators we can use.
Here’s another example. This time we use a date to define the filter.
SELECT display_name, reputation, location, creation_date FROM `bigquery-public-data.stackoverflow.users` WHERE creation_date < ‘2008-08-01’
This query returns all 9 users who registered earlier than August 2008. I see some old friends here.
Here is another fundamental building block of SQL. You will use
ORDER BY all the time.
SELECT display_name, reputation, location FROM `bigquery-public-data.stackoverflow.users` ORDER BY reputation DESC
ORDER BY sorts a result by a column of our choice. It's just like Excel or Google Sheets when you click on the top of a column to sort a list alphabetically or chronologically. In the example above we sort the result by user reputation in descending order. This returns a list of the users with the most reputation. We could also sort by reputation ascending:
ASC. (Nobody on Stack Overflow has less than 1 rep — that'd just be cruel.)
Here's another example.
SELECT display_name, creation_date, reputation, views, location FROM `bigquery-public-data.stackoverflow.users` ORDER BY creation_date ASC
Sorting can also be done on dates, or really any datatype. The query above returns all user accounts with the oldest up top.
Here's another bit of SQL you'll use all the time.
SELECT display_name, reputation, location FROM `bigquery-public-data.stackoverflow.users` ORDER BY reputation DESC LIMIT 10
It's good practice to always use
LIMIT when writing queries. This puts a cap on the number of rows in your result. In some environments this might save your butt in case you were to otherwise accidentally download a massive dataset that crashes your computer.
In the example above, we use
ORDER BY and
LIMIT to return the “top ten” users by rep.
ORDER BY and
LIMIT is a common query pattern that delivers the most or least or best records in a dataset.
SELECT WHERE ORDER BY LIMIT
Putting it all together...
SELECT display_name, reputation, location, about_me FROM `bigquery-public-data.stackoverflow.users` WHERE creation_date > ‘2022-01-01’ ORDER BY reputation DESC LIMIT 10
We can combine what we've learned so far to write a query that returns the top users who meet all sorts of criteria. In the above query we filter to find the top 10 users by rep who registered their account since 2022.
AND & OR
We can combine comparison operators to build complex filters.
SELECT display_name, reputation, location FROM `bigquery-public-data.stackoverflow.users` WHERE location = ‘Pakistan' OR location = ‘Mongolia'
In the query above we find all users whose Location is stored exactly as Pakistan or Mongolia (there’s more than 20,000 users who fit this description).
We can go even further, combining as many comparison operators as we want. And we can include filters across properties.
SELECT display_name, reputation, location FROM `bigquery-public-data.stackoverflow.users` WHERE (location = 'NYC' OR location = 'New York City') AND reputation > 10000
Here we find all users with more than 10,000 rep who say they live in NYC or New York City. Now you can see how it might be useful for a technical recruiter to learn SQL.
Be careful when using comparison operators. The computer reads operators in a specific order according to logical precedence, similar to PEMDAS. It is useful to always use parentheses to group similar statements.
And always check your work. Run the query and scan your result. Does the result match the parameters of your filter? If you see users in your result who have less than 10k rep then you did something wrong.
LIKE and NOT LIKE
NOT LIKE are useful operators that let you specify results that do not exactly match a lookup value.
SELECT display_name, reputation, location, about_me FROM `bigquery-public-data.stackoverflow.users` WHERE location LIKE ‘%City’
The query above returns all users who have a location that ends in "City". So in your result you’ll see users from New York City, Ho Chi Minh City, Salt Lake City, etc.
When using LIKE you’ll probably want to also include a wildcard character:
% to indicate where the matching starts and ends.
IN & NOT IN
IN to pattern match for a list of values.
SELECT display_name, reputation, location, about_me FROM `bigquery-public-data.stackoverflow.users` WHERE location IN ('Houston’,'houston') ORDER BY reputation DESC LIMIT 10
The query above will find the top users by reputation whose location is either Houston or houston.
SELECT display_name, reputation, location, about_me FROM `bigquery-public-data.stackoverflow.users` WHERE LOWER(location) = 'houston' ORDER BY reputation DESC LIMIT 10
Here’s another way to get the same result.
LOWER to lowercase all values in the column. In the example above, we get the users whose lower-cased location is equal to ‘houston’. This will catch users whose location is Houston, housTon, HousTon, etc. This is a great tool to use when you are dealing with user generated content or unstructured data.
NOT IN to create a compound filter and clean up your result to your needs.
SELECT display_name, reputation, location, about_me FROM `bigquery-public-data.stackoverflow.users` WHERE location LIKE '%ston' AND LOWER(location) NOT IN ('houston','boston')
NULL is a special value that means “does not exist”.
NULL is not equal to zero. Zero is a number.
NULL is the absence of a number or a string or any other data type.
SELECT display_name, reputation, location, about_me FROM `bigquery-public-data.stackoverflow.users` WHERE location is NULL ORDER BY reputation DESC LIMIT 10
In the query above we find the top 10 users by reputation who have no location in their profile.
As you use SQL more and more you'll learn that
NULL is cool in all sorts of ways, such as when it is safe to use
SUM (zeros count but they don't add up, whereas
NULLs don't count and they don’t add up).
That's it! If you've made it this far you're about halfway done with SQL 101. Next up: aggregates (using group by & order by), but I haven't yet published this tutorial. Do you want it? Lemme know if so.
Other articles you can find on this website: