In this lab, you query relational and denormalized data.

What you need

To complete this lab, you need:

Access to a supported Internet browser:

A Google Cloud Platform project with billing enabled

What you learn

In this lab, you:

This lab covers querying relational and denormalized data.

In this section of the lab, you measure query performance for relational data in BigQuery. The data used in this section includes liquor sales for the state of Iowa.

BigQuery supports very large JOINs, and JOIN performance is very good. However, BigQuery is a columnar datastore, and maximum performance is achieved on denormalized datasets. Because BigQuery storage is very inexpensive and scalable, it is often prudent to denormalize and pre-JOIN datasets into homogeneous tables. In essence, you exchange compute resources for storage resources (the latter being more performant and cost-effective).

In this section, you upload a set of tables from a relational schema (in 3rd normal form). You then run queries against the relational tables. Later, you compare performance by running the same queries against a table from a denormalized schema containing the same information.

The first set of tables you need to upload have a relational schema. The relational schema consists of the following tables.

A diagram of the relational schema is shown below.

Step 1

Open the BigQuery web UI.

Step 2

Click the blue arrow to the right of your project name and choose Create new dataset.

Step 3

In the ‘Create Dataset' dialog, for Dataset ID, type cpb200_liquor_sales, and then click OK.

Step 4

Create the sales table. Click the plus sign to the right of the cpb200_liquor_sales dataset to create a new table.

Step 5

On the Create Table page, in the Source Data section:

Step 6

In the Destination Table section:

Step 7

In the Schema section:

Step 8

In the Options section:

Step 9

Create the remaining tables in the relational schema using Cloud Shell. Open the Google Cloud Platform Console and to the right of your project name, click Activate Google Cloud Shell.

Step 10

Type the following commands to create the tables.

bq load --source_format=CSV --skip_leading_rows=1 --allow_quoted_newlines cpb200_liquor_sales.category gs://cloud-training/CPB200/BQ/lab8/category.csv category:STRING,category_name:STRING

bq load --source_format=CSV --skip_leading_rows=1 --allow_quoted_newlines cpb200_liquor_sales.convenience_store gs://cloud-training/CPB200/BQ/lab8/convenience_store.csv store:STRING

bq load --source_format=CSV --skip_leading_rows=1 --allow_quoted_newlines cpb200_liquor_sales.county gs://cloud-training/CPB200/BQ/lab8/county.csv county_number:STRING,county:STRING

bq load --source_format=CSV --skip_leading_rows=1 --allow_quoted_newlines cpb200_liquor_sales.item gs://cloud-training/CPB200/BQ/lab8/item.csv item:STRING,description:string,pack:INTEGER,liter_size:INTEGER

bq load --source_format=CSV --skip_leading_rows=1 --allow_quoted_newlines cpb200_liquor_sales.store gs://cloud-training/CPB200/BQ/lab8/store.csv store:STRING,name:STRING,address:STRING,city:STRING,zipcode:STRING,store_location:STRING,county_number:STRING

bq load --source_format=CSV --skip_leading_rows=1 --allow_quoted_newlines cpb200_liquor_sales.vendor gs://cloud-training/CPB200/BQ/lab8/vendor.csv vendor_no:STRING,vendor:STRING

Step 11

Open the BigQuery web UI.

Step 12

Click Compose Query, and in the New Query window, type and run the following query against the relational tables.

Important: Click the Show Options button and ensure that Use Cached Results is unchecked. If you have to run the query more than once, you do not want to use cached results.

#standardSQL
SELECT
  gstore.county AS county,
  ROUND(cstore_total/gstore_total * 100,1) AS cstore_percentage
FROM (
  SELECT
    cy.county AS county,
    SUM(total) AS gstore_total
  FROM
    `cpb200_liquor_sales.sales` AS s
  JOIN
    `cpb200_liquor_sales.store` AS st
  ON
    s.store = st.store
  JOIN
    `cpb200_liquor_sales.county` AS cy
  ON
    st.county_number = cy.county_number
  LEFT OUTER JOIN
    `cpb200_liquor_sales.convenience_store` AS c
  ON
    s.store = c.store
  WHERE
    c.store IS NULL
  GROUP BY
    county) AS gstore
JOIN (
  SELECT
    cy.county AS county,
    SUM(total) AS cstore_total
  FROM
    `cpb200_liquor_sales.sales` AS s
  JOIN
    `cpb200_liquor_sales.store` AS st
  ON
    s.store = st.store
  JOIN
    `cpb200_liquor_sales.county` AS cy
  ON
    st.county_number = cy.county_number
  LEFT OUTER JOIN
    `cpb200_liquor_sales.convenience_store` AS c
  ON
    s.store = c.store
  WHERE
    c.store IS NOT NULL
  GROUP BY
    county) AS hstore
ON
  gstore.county = hstore.county

Step 13

Click Query History and select the query job.

Step 14

Make a note of the time the query takes to execute by subtracting the Start Time from the End Time.

In this section of the lab, you improve query performance by denormalizing tables into a flattened schema.

A denormalized schema typically "flattens" all of the relational data into a single row. You can, however, use a nested repeated denormalized schema as well. For example, in the denormalized schema, county_number, county, store, name, address, city, zipcode, store_location, county_number, and cstore are fields containing all of the fields from the County, Store, and Convenience_store tables.

Note that the cstore field (in the denormalized schema) represents the convenience_store.store field in the relational schema above. It has a value of Y if a store is a convenience store, and "null" otherwise.

The following is a diagram of the denormalized schema.

Step 1

Open the BigQuery web UI.

Step 2

Create the iowa_sales_denorm table. Click the plus sign to the right of the cpb200_liquor_sales dataset to create a new table.

Step 3

On the Create Table page, in the Source Data section:

Step 4

In the Destination Table section:

Step 5

In the Schema section:

Step 6

In the Options section:

Step 7

Type and run the following query against the table with a denormalized schema (this query produces the same results as the query in the previous section).

#standardSQL
SELECT
  gstore.county AS county,
  ROUND(cstore_total/gstore_total * 100,1) AS cstore_percentage
FROM (
  SELECT 
    county,  
    sum(total) AS gstore_total 
  FROM 
    `cpb200_liquor_sales.iowa_sales_denorm`
  WHERE cstore is null
  GROUP BY 
    county) AS gstore
JOIN (
  SELECT 
    county, 
    sum(total) AS cstore_total 
  FROM 
    `cpb200_liquor_sales.iowa_sales_denorm`
  WHERE cstore is not null
  GROUP BY
    county) AS cstore
ON gstore.county = cstore.county
ORDER BY county

Step 8

Click Query History and selecting the query job.

Step 9

Make a note of the time the query takes to execute by subtracting the Start Time from the End Time.

Notice that the query corresponding to the table with denormalized schema runs faster, and has simpler syntax. Wherever possible, pre-JOIN datasets into homogeneous tables in order to optimize performance in BigQuery.

©Google, Inc. or its affiliates. All rights reserved. Do not distribute.