First, create a new dataset in the project. A dataset is composed of multiple tables. To create a dataset, click the project name under the resources pane, then click the Create dataset button:
lab as the Dataset ID:
Click the Create dataset button to create an empty dataset.
From the GCP Console click the Cloud Shell icon on the top right toolbar:
Then click "Start Cloud Shell":
It should only take a few moments to provision and connect to the environment:
This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on the Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this lab can be done with simply a browser or your Google Chromebook.
Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your PROJECT_ID.
Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list
Credentialed accounts: - <myaccount>@<mydomain>.com (active)
gcloud config list project
[core] project = <PROJECT_ID>
If it is not, you can set it with this command:
gcloud config set project <PROJECT_ID>
Updated property [core/project].
For your convenience, some of the data for April 10, 2019 from the Wikimedia pageviews dataset is available on Google Cloud Storage at
gs://cloud-samples-data/third-party/wikimedia/pageviews/pageviews-20190410-140000.gz. The data file is a GZip'ed CSV file. You can load this file directly using the
bq command line utility. As part of the load command, you'll also describe the schema of the file.
bq load \ --source_format CSV \ --field_delimiter " " \ --quote "" \ --max_bad_records 3 \ $GOOGLE_CLOUD_PROJECT:lab.pageviews_20190410_140000 \ gs://cloud-samples-data/third-party/wikimedia/pageviews/pageviews-20190410-140000.gz \ wiki,title,requests:integer,zero:integer
You used a few advanced options to load the page views file:
--source_format CSVto indicate the file should be parsed as a CSV file. This step is optional, since CSV is the default format.
--field_delimiter " "to indicate that a single space is used to delimit fields.
--quote ""to indicate that strings are unquoted.
--max_bad_records 3to ignore at most 3 errors while parsing the CSV file. This is needed because there are several lines in the file with only 2 columns.
You can learn more about the bq command line in the documentation.
In the BigQuery console, open one of the tables that you just loaded.
You can see the table schema in the Schema view on the right. Find out how much data is in the table, by navigating to the Details view:
Open the Preview view to see a selection of rows from the table.
Click Compose new query button on the top right:
This will bring up the Query editor view:
Find the total number of Wikimedia views between 2-3pm on April 10, 2019, by writing this query:
SELECT SUM(requests) FROM `lab.pageviews_20190410_140000`
Click the Run button:
In a few seconds, the result will be listed in the bottom, and it'll also tell you how much data was processed:
This query processed 123.9 MB, even though the table is 691.4 MB. BigQuery only processes the bytes from the columns which are used in the query, so the total amount of data processed can be significantly less than the table size. With clustering and partitioning, the amount of data processed can be reduced even further.
The Wikimedia dataset contains page views for all of the Wikimedia projects (including Wikipedia, Wiktionary, Wikibooks, Wikiquotes, etc). Let's narrow down the query to just English Wikipedia pages by adding a
SELECT SUM(requests), wiki FROM `lab.pageviews_20190410_140000` WHERE wiki = "en" GROUP BY wiki
Notice that, by querying an additional column,
wiki, the amount of data processed increased from 124 MB to 204 MB.
BigQuery supports many of the familiar SQL clauses, such as
ORDER BY, and a number of aggregation functions. In addition, you can also use regular expressions to query text fields! Let's try one:
SELECT title, SUM(requests) requests FROM `lab.pageviews_20190410_140000` WHERE wiki = "en" AND REGEXP_CONTAINS(title, 'Red.*t') GROUP BY title ORDER BY requests DESC
You can select a range of tables to form the union using a wildcard table. First, create a second table to query over by loading the next hour's page views into a new table:
bq load \ --source_format CSV \ --field_delimiter " " \ --quote "" \ $GOOGLE_CLOUD_PROJECT:lab.pageviews_20190410_150000 \ gs://cloud-samples-data/third-party/wikimedia/pageviews/pageviews-20190410-150000.gz \ wiki,title,requests:integer,zero:integer
In the Query editor pane, query over both tables you loaded by querying tables with "
pageviews_2019" as a prefix:
SELECT title, SUM(requests) requests FROM `lab.pageviews_2019*` WHERE wiki = "en" AND REGEXP_CONTAINS(title, 'Red.*t') GROUP BY title ORDER BY requests DESC
You can filter the tables more selectively with the _TABLE_SUFFIX pseudo column. This query will limit to tables corresponding to April 10.
SELECT title, SUM(requests) requests FROM `lab.pageviews_2019*` WHERE _TABLE_SUFFIX BETWEEN '0410' AND '0410_9999999' AND wiki = "en" AND REGEXP_CONTAINS(title, 'Red.*t') GROUP BY title ORDER BY requests DESC
Optionally, delete the dataset you created with the
bq rm command. Use the
-r flag to remove any tables it contains.
bq rm -r lab
You've used BigQuery and SQL to query the real-world Wikipedia page views dataset. You have the power to query petabyte-scale datasets!