Codelab-at-a-conference setup

The instructor will be sharing with you temporary accounts with existing projects that are already setup so you do not need to worry about enabling billing or any cost associated with running this codelab. Note that all these accounts will be disabled soon after the codelab is over.

Once you have received a temporary username / password to login from the instructor, log into the Google Cloud Console: https://console.cloud.google.com/.

Here's what you should see once logged in :

In the Google Developer Console, click the Menu icon on the top left of the screen.

Scroll down the menu to the bottom, and click BigQuery:

This will open up the BigQuery console in a new browser tab, that looks like this:

But, there is nothing in here! Luckily, there are tons of Open Datasets available in BigQuery for you to query. We'll query the Wikipedia dataset in the next section.

We need a dataset to query with. We'll talk about loading data into BigQuery in another lab. In this section, we'll query the Wikimedia pageviews dataset that's part of many Public Datasets available in BigQuery today, including Wikimedia, Hacker News, GitHub, GDELT (News events), and many more.

For this lab, we'll use the Wikimedia public data set. To add the data set, visit this URL:

https://bigquery.cloud.google.com/table/bigquery-samples:wikimedia_pageviews.201112

Click on bigquery-samples:wikimedia_pageviews, this is the dataset we will use.

Scroll down and find and click the table 201112:

You can see the table schema in the Schema view on the right:

You can find out how much data is in the table, by navigating to the Details view on the right:

Alright! Let's write a query to see what's the most popular Wikipedia page in December, 2011 using BigQuery.

Click Compose Query on the top left:

This will bring up the New Query view:

Let's find out the total number of Wikimedia views in December, 2011, by writing this query:

SELECT SUM(views)
FROM `bigquery-samples.wikimedia_pageviews.201112`

Before we run the query, for the purpose of this lab, let's disable data caching so that we are not using any cached results. Click Show Options:

Then, uncheck Use Cached Results:

Also, uncheck Use legacy SQL:

Click Run Query:

In a few seconds, the result will be listed in the bottom, and it'll also tell you how much data was proccessed:

In seconds, we queried over a 105 GB table, but we only needed to process 12.2GB of data to get to the result! This is because BigQuery is a columnar database. Because we only queried a single column, the total amount of data processed is significantly less than the total table size.

Find Wikipedia page views

The Wikimedia dataset contains page views for all of the Wikimedia projects (including Wikipedia, Wikitionary, Wikibooks, Wikiquotes, etc). Let's narrow down the query to just Wikipedia pages by adding a where statement:

SELECT SUM(views), wikimedia_project
FROM `bigquery-samples.wikimedia_pageviews.201112`
WHERE wikimedia_project = "wp"
GROUP BY wikimedia_project

Notice that, by querying an additional column, wikimedia_project, the amount of data processed increased from 12.2 GB to 18 GB.

BigQuery supports many of the familiar SQL clauses, such as contains, group by, order by, and a number of aggregation functions. In addition, you can also use regular expressions to query text fields! Let's try one:

SELECT title, SUM(views) views
FROM `bigquery-samples.wikimedia_pageviews.201112`
WHERE
  wikimedia_project = "wp"
  AND REGEXP_CONTAINS(title, 'Red.*t')
GROUP BY title
ORDER BY views DESC

Query across multiple tables

You can select a range of tables to form the union using a wildcard table. Let's query over the entire year of 2011 by querying tables with "2011" as a prefix:

This query will query over a total dataset size of 1 TB, but process only 672 GB of data.

SELECT title, SUM(views) views
FROM `bigquery-samples.wikimedia_pageviews.2011*`
WHERE
  wikimedia_project = "wp"
  AND REGEXP_CONTAINS(title, 'Red.*t')
GROUP BY title
ORDER BY views DESC

That wasn't too hard to query that much data!

You can filter the tables more selectively with the _TABLE_SUFFIX pseudo column. This query will limit to tables corresponding with the last two months of the year.

SELECT title, SUM(views) views
FROM `bigquery-samples.wikimedia_pageviews.2011*`
WHERE
  (_TABLE_SUFFIX = '11' OR _TABLE_SUFFIX = '12')
  AND wikimedia_project = "wp"
  AND REGEXP_CONTAINS(title, 'Red.*t')
GROUP BY title
ORDER BY views DESC

Because the query limited the tables scanned with the _TABLE_SUFFIX pseudo column, it only processed 113 GB of data.

BigQuery is that simple! With basic SQL knowledge, you are now able to query terabytes and terabytes of data!

What we've covered

Next Steps