In this lab, you learn how to load data into BigQuery and how to transform data using BigQuery.

What you need

To complete this lab, you need:

Access to a supported Internet browser:

A Google Cloud Platform project with billing enabled

What you learn

In this lab, you:

In this lab, you load data into BigQuery in multiple ways. You also transform the data you load, and you query the data.

In this section of the lab, you upload a CSV file to BigQuery using the BigQuery web UI. This data is transformed later in the lab.

BigQuery supports the following data formats when loading data into tables: CSV, JSON, AVRO, or Cloud Datastore backups. This example focuses on loading a CSV file into BigQuery.

Step 1

Open the BigQuery web UI.

Step 2

Click the blue arrow to the right of your project name and choose Create new dataset.

Step 3

In the ‘Create Dataset' dialog, for Dataset ID, type cpb200_flight_data and then click OK.

Step 4

Download the airports.csv file to your local machine. This file contains the data that will populate the first table.

Step 5

Create a new table in the cpb200_flight_data dataset to store the data from the CSV file. Click the create table icon (the plus sign) to the right of the cpb200_flight_data dataset.

Step 6

On the Create Table page, in the Source Data section:

Step 7

In the Destination Table section:

Step 8

In the Schema section:

Step 9

In the Options section:

Step 10

Once the load job is complete, click cpb200_flight_data > AIRPORTS.

Step 11

On the Table Details page, click Details to view the table properties and then click Preview to view the table data.

In this section of the lab, you upload multiple JSON files and an associated schema file to BigQuery using the CLI. This data is transformed later in the lab.

Step 1

Navigate to the Google Cloud Platform Console and to the right of your project name, click Activate Google Cloud Shell.

Step 2

Type the following command to download schema_flight_performance.json (the schema file for the table in this example) to your working directory.

curl https://storage.googleapis.com/cloud-training/CPB200/BQ/lab4/schema_flight_performance.json -o schema_flight_performance.json

Step 3

The JSON files containing the data for your table are stored in a Google Cloud Storage bucket. They have URIs like the following:

gs://cloud-training/CPB200/BQ/lab4/domestic_2014_flights_*.json

Type the following command to create a table named flights_2014 in the cpb200_flight_data dataset, using data from files in Google Cloud Storage and the schema file stored on your virtual machine.

Note that your Project ID is stored as a variable in Cloud Shell ($DEVSHELL_PROJECT_ID) so there's no need for you to remember it. If you require it, you can view your Project ID in the command line to the right of your username (after the @ symbol).

bq load --source_format=NEWLINE_DELIMITED_JSON $DEVSHELL_PROJECT_ID:cpb200_flight_data.flights_2014 gs://cloud-training/CPB200/BQ/lab4/domestic_2014_flights_*.json ./schema_flight_performance.json

Step 4

Once the table is created, type the following command to verify table flights_2014 exists in dataset cpb200_flight_data.

bq ls $DEVSHELL_PROJECT_ID:cpb200_flight_data

The output should look like the following:

tableId        Type
-------------- -------
AIRPORTS       TABLE
flights_2014   TABLE

Step 5

Type the following command to display the schema for the flights_2014 table. Compare this to the schema file on your virtual machine.

bq show $DEVSHELL_PROJECT_ID:cpb200_flight_data.flights_2014

The output should look like the following:

|- YEAR: integer (required)
|- QUARTER: integer (required)
|- MONTH: integer (required)
|- DAY_OF_MONTH: integer
|- DAY_OF_WEEK: integer
|- FULL_DATE: string
|- CARRIER: string
|- TAIL_NUMBER: string
|- FLIGHT_NUMBER: string
|- ORIGIN: string
|- DESTINATION: string
|- SCHEDULED_DEPART_TIME: integer
|- ACTUAL_DEPART_TIME: integer
|- DEPARTURE_DELAY: integer
|- TAKE_OFF_TIME: integer
|- LANDING_TIME: integer
|- SCHEDULED_ARRIVAL_TIME: integer
|- ACTUAL_ARRIVAL_TIME: integer
|- ARRIVAL_DELAY: integer
|- FLIGHT_CANCELLED: integer
|- CANCELLATION_CODE: string
|- SCHEDULED_ELAPSED_TIME: integer
|- ACTUAL_ELAPSED_TIME: integer
|- AIR_TIME: integer
|- DISTANCE: integer
|- CARRIER_DELAY: integer
|- WEATHER_DELAY: integer
|- NAS_DELAY: integer
|- SECURITY_DELAY: integer
|- LATE_AIRCRAFT_DELAY: integer

In this section of the lab you use the BigQuery web UI to transform and merge the data from the AIRPORTS and flights_2014 tables into a single denormalized table, which you upload to BigQuery.

Step 1

Open the BigQuery web UI.

Step 2

Click Compose Query.

Step 3

Type the following query in the New Query window.

This query associates airport information with the origin and destination of each flight. The query uses two JOIN clauses. The first joins information from AIRPORTS based on the flight's origin. The second joins information based on the destination airport.

#standardSQL
SELECT
  YEAR,
  QUARTER,
  MONTH,
  DAY_OF_MONTH,
  DAY_OF_WEEK,
  FULL_DATE,
  CARRIER,
  TAIL_NUMBER,
  FLIGHT_NUMBER,
  ORIGIN,
  a.AIRPORT AS ORIG_AIRPORT,
  a.CITY AS ORIG_CITY,
  a.STATE AS ORIG_STATE,
  a.LATITUDE AS ORIG_LATITUDE,
  a.LONGITUDE AS ORIG_LONGITUDE,
  DESTINATION,
  d.AIRPORT AS DEST_AIRPORT,
  d.CITY AS DEST_CITY,
  d.STATE AS DEST_STATE,
  d.LATITUDE AS DEST_LATITUDE,
  d.LONGITUDE AS DEST_LONGITUDE,
  SCHEDULED_DEPART_TIME,
  ACTUAL_DEPART_TIME,
  DEPARTURE_DELAY,
  TAKE_OFF_TIME,
  LANDING_TIME,
  SCHEDULED_ARRIVAL_TIME,
  ACTUAL_ARRIVAL_TIME,
  ARRIVAL_DELAY,
  FLIGHT_CANCELLED,
  SCHEDULED_ELAPSED_TIME,
  ACTUAL_ELAPSED_TIME,
  AIR_TIME,
  DISTANCE,
  CARRIER_DELAY,
  WEATHER_DELAY,
  NAS_DELAY,
  SECURITY_DELAY,
  LATE_AIRCRAFT_DELAY
FROM
  `cpb200_flight_data.flights_2014`
JOIN
  `cpb200_flight_data.AIRPORTS` AS a
ON
  origin=a.IATA
JOIN
  `cpb200_flight_data.AIRPORTS` AS d
ON
  destination= d.IATA
LIMIT
  100

Step 4

Verify you have no destination table set and then click Run Query.

Notice that the flight data from flights_2014 is joined to the AIRPORTS table to get the airport name and location information for both the origin and destination airports.

Step 5

Repeat the previous steps to run the following modified query to produce the same results. In the modified query, the FLIGHT_CANCELLED field is a boolean value rather than an integer value. You can view the modification in the CAST function: CAST(FLIGHT_CANCELLED AS BOOLEAN) AS FLIGHT_CANCELLED.

#standardSQL
SELECT
  YEAR,
  QUARTER,
  MONTH,
  DAY_OF_MONTH,
  DAY_OF_WEEK,
  FULL_DATE,
  CARRIER,
  TAIL_NUMBER,
  FLIGHT_NUMBER,
  ORIGIN,
  a.AIRPORT AS ORIG_AIRPORT,
  a.CITY AS ORIG_CITY,
  a.STATE AS ORIG_STATE,
  a.LATITUDE AS ORIG_LATITUDE,
  a.LONGITUDE AS ORIG_LONGITUDE,
  DESTINATION,
  d.AIRPORT AS DEST_AIRPORT,
  d.CITY AS DEST_CITY,
  d.STATE AS DEST_STATE,
  d.LATITUDE AS DEST_LATITUDE,
  d.LONGITUDE AS DEST_LONGITUDE,
  SCHEDULED_DEPART_TIME,
  ACTUAL_DEPART_TIME,
  DEPARTURE_DELAY,
  TAKE_OFF_TIME,
  LANDING_TIME,
  SCHEDULED_ARRIVAL_TIME,
  ACTUAL_ARRIVAL_TIME,
  ARRIVAL_DELAY,
  CAST(FLIGHT_CANCELLED AS BOOLEAN) AS FLIGHT_CANCELLED,
  SCHEDULED_ELAPSED_TIME,
  ACTUAL_ELAPSED_TIME,
  AIR_TIME,
  DISTANCE,
  CARRIER_DELAY,
  WEATHER_DELAY,
  NAS_DELAY,
  SECURITY_DELAY,
  LATE_AIRCRAFT_DELAY
FROM
  `cpb200_flight_data.flights_2014`
JOIN
  `cpb200_flight_data.AIRPORTS` AS a
ON
  origin=a.IATA
JOIN
  `cpb200_flight_data.AIRPORTS` AS d
ON
  destination= d.IATA
LIMIT
  100

Step 6

When the query completes, verify the values in the FLIGHT_CANCELLED column are boolean.

Step 7

You can save the result set of any query for future analysis by specifying a destination table. Click the Show Options button.

Step 8

For Destination Table, click the Select Table button.

Step 9

In the Select Destination Table dialog:

Step 10

Check the Allow Large Results box. This will allow for a larger than normal result set. ‘Allow Large Results' is discussed later in the course.

Step 11

In the New Query window, remove the LIMIT clause at the end of the query. This ensures the entire result set is stored to the destination table for analysis.

Step 12

Click Run Query.

Step 13

When the query is complete, verify that the results were saved to a the flights_2014_w_airports table in cpb200_flight_data. You may need to refresh your browser window.

Step 14

As time allows, examine the schema and details of the new table. Verify the field FLIGHT_CANCELLED is a boolean value, and that airport information is bound to origin and destination airports for each flight.

©Google, Inc. or its affiliates. All rights reserved. Do not distribute.