Run a query
Follow these instructions to run your first query.
- Go here: https://console.cloud.google.com/bigquery
- Log in with your email
- Click "Create a new query"
- Copy / paste the code below into the console and click “Run”
SELECT title, sum(views) views FROM `bigquery-public-data.wikipedia.pageviews_2023` WHERE DATE(datehour) > CURRENT_DATE() - 1 AND LOWER(title) NOT LIKE '%wikipedia%' AND LOWER(title) NOT LIKE '%cookie%' AND LOWER(title) NOT LIKE '%main%' AND LOWER(title) NOT LIKE '%search%' GROUP BY 1 ORDER BY 2 desc LIMIT 100
You just accessed a database to see yesterday's top 100 most popular Wikipedia pages. Congratulations. You know SQL.
Don't worry about what that code is doing, for now. You'll get there soon. Let's back up for a second.
What is BigQuery
BigQuery is a data warehouse. It's exactly what it sounds like. A place where data is stored and an interface through which you can access that data.
BigQuery is one of a handful of data warehousing tools, including Amazon Redshift, Microsoft Azure, Databricks, and Snowflake. Companies use data warehouses to store datasets such as account information, inventory, order history, analytics & engagement data, and anything else that might be useful to track user or customer histories and run a business.
We're using BigQuery in these tutorials for a few reasons. For one, there is an active team invested in maintaining the product (it's not going anywhere). Also, BigQuery is packed with cool features which reveal themselves over time, such as integrations with Google Analytics and Google Sheets, CSV ingestion, and new machine learning features that let you make predictions with SQL. Also, BigQuery is a great place to learn because it is very possible that today or in the future you will work at a company that runs on Google's cloud infrastructure, and so you might use BigQuery to do analysis on the job.
But the main reason I use BigQuery in the tutorials found across this site is because it comes out of the box linked to hundreds of public datasets that are free to query. BigQuery is a fantastic place to learn because the data is already connected to your interface, and this data is packed with insights waiting to be revealed through your code.
BigQuery has hundreds of them. These datasets are uploaded and maintained by governmental organizations and corporate entities committed to keeping data open and free. Some datasets are massive, and up-to-date, and represent datasets that look a lot like what you might query in a professional setting. But there's a lot of junk in here, too.
Here are two useful datasets:
- Stack Overflow – Questions, answers, upvotes, accounts, and more. This is probably the realest dataset available through BigQuery because it features a users table and tables for user actions, so you can analyze user histories and aggregate them to understand things like activation and retention rates, and funnels.
- NYC Taxi Fares – 30 million rides per year, including fares, tips, and pickup and dropoff locations. We use this dataset in the tutorial on creating histograms with SQL.
Read about my 10 favorite BigQuery datasets here.
Publicly available data is great, but your own business data will always be the most useful data to query. Learning in-house where you work is the best place to learn because the data reflects business problems that you're familiar with. And you will be intrinsically motivated to grind through the learning curve to extract insights.
But not everybody has the opportunity to learn on the job. And even if you do, it's worth breaking out of your 9-5 bubble to expand your mind and enhance your skills. This is where public datasets can be extremely helpful.
Click around and find out.
Click on a dataset to see “Dataset info” which will show you useful info about what's included in the dataset.
“Last modified” will let you know how current the data is. This is one of the first things I look at when considering a dataset to query.
“Description” is also useful. The maintainer of the dataset will often include a basic overview of how the data was collected and what's included.
Once you've found a dataset that's interesting to you, click the star so you can easily find it in the future.
Split your screen
Open tables in a “split tab” so that you can write SQL and see a table's schema at the same time. This helps to prevent context switching between tabs. Now you know why software developers love big screens.
Databases & tables
BigQuery is a data warehouse that contains relational structured data. The data is relational because tables relate to one another. For example, in the
stackoverflow database you'll find tables like
posts, all of which can be joined together through a common key, such as a unique identifier or a date, to build more complex datasets and draw insights.
In the Stack Overflow database, for example, we can query the questions table (
post_questions) to find the highest scoring questions, and then join that result to the
users table so we only return results where the question was asked by users from Berlin.
Here's the query.
SELECT q.title, q.score, u.display_name, u.location FROM `bigquery-public-data.stackoverflow.posts_questions` q INNER JOIN `bigquery-public-data.stackoverflow.users` u on q.owner_user_id = u.id WHERE u.location = 'Berlin' ORDER BY q.score DESC LIMIT 100
We never could have returned this result by looking just at a single table because no one table includes questions AND user location. We had to join the tables by the common key
user_id. (Heads up that JOINs are beyond the scope of SQL 101, so don't worry about it if it doesn't stick just yet.)
Here's what it looks like in BigQuery when you put it all together.
Yep, that's the highest scoring question by a user whose profile includes the location Berlin. Here is the question: https://stackoverflow.com/q/4797534/896802
Rows & Columns
Tables are just like CSVs or spreadsheets. Every row is a record. Every column is a trait or a property.
In a true relational database there should only ever be 1 row for a specific entity. If it us a users table, then there should be only 1 row per user, and that user would be identifiable by a specific ID — their user id.
User ID is a column, property, or trait (these words all mean the same thing). A users table might also include columns like
permissions, etc. These are all traits that describe the user and other users, and as we'll soon see, we can group by traits like this to ask questions like “How many users were created in the last year.”
SQL is way better than Excel and Google Sheets
SQL is more powerful than Excel in a hundred ways. For one, you can't destroy your dataset with a sneeze. Unless you're writing to the database (we are not), you're never actually modifying the data! We're just reading it, and we're modifying the presentation layer with code. It's a much more safe and powerful way to interact with data.
SQL is how the internet works
The entire web is powered by SQL.
Human analysts aren't the only "clients" who query databases. So do applications like websites, mobile apps, and APIs. When you use a search engine you write a human-readable query, such as "what's the weather", that generates real SQL under the hood to crawl through a database and return a result.
In addition to reading from a database, SQL is used to write to databases. As an analyst, we might run a query to read the most popular sites on Wikipedia, but when you edit a Wikipedia page and click "save" then SQL might be working under the hood to save the updated article to the database.
For now, we're just analysts. We're going to be writing SQL to read and transform data to extract insights.
If you hang around engineers long enough you'll hear them throw around the word "expensive" when talking about computational processes or queries. "Expensive" doesn't necessarily mean it will cost you lots of money, though in some cases it might also mean that. Queries are expensive when they take a while to run or when there might be a more efficient way to produce an equivalent result.
In BigQuery you can see how much your query costs before you hit "Run". This can help you modify your query to make it more efficient so you don't waste your free 1 TB per month. 1 TB equals 1,000 GB, which is plenty for our purposes. I've been hammering this system pretty hard and I haven't yet hit my limit.
OK, now you're ready to start writing queries.
See the next tutorial: SELECT FROM WHERE - Basic Queries