This lab demonstrates how to handle some common anti-patterns in BigQuery.

What you need

To complete this lab, you need:

Access to a supported Internet browser:

A Google Cloud Platform project with billing enabled

What you learn

In this lab, you:

In this lab you modify queries that encounter resource constraints.

In this section of the lab, you modify a "large ORDER BY" query that exceeds available resources. This particular example focuses on a query which joins the public trigram and shakespeare tables, and then attempts to arrange the results using GROUP BY and ORDER BY clauses. This query is for demonstration purposes only.

Step 1

Open the BigQuery Web UI.

Step 2

Click Compose Query.

Step 3

Type and run the following query.

#standardSQL
SELECT ngram,first,second,third,fourth,fifth
FROM
  `bigquery-public-data.samples.trigrams` AS trigram
JOIN
  `bigquery-public-data.samples.shakespeare` AS shakespeare
ON
  trigram.second = shakespeare.word
GROUP BY ngram,first,second,third,fourth,fifth
ORDER BY ngram

Step 4

You should receive an error similar to the following (it may take a moment for the error to appear).

Error: Resources exceeded during query execution.

Step 5

This error occurred because the final data sorting (the ORDER BY statement) is done at the Master, and the result set is too large for the Master to process in a reasonable amount of time. You can overcome this error in one of two ways: apply a LIMIT clause to the end of the query or enable the allowLargeResults option with a destination table.

Add the following line to the end of the query.

LIMIT 10

Then run the query again and examine the results.

In this section of the lab, you also modify a query that exceeds available resources. This particular example focuses on a query which uses the ARRAY_AGG function against the Github public dataset. This query is for demonstration purposes only.

Step 1

Open the BigQuery Web UI.

Step 2

Click Compose Query.

Step 3

Type and run the following query.

#standardSQL
SELECT
  author.tz_offset as tz,
  ARRAY_AGG(commit) as repeated_commits
FROM
  `bigquery-public-data.github_repos.commits`
GROUP BY tz

Step 4

You should receive an error similar to the following (it may take up to 6 minutes for the error to appear).

Error: Resources exceeded during query execution.

Step 5

This error occurred because the ARRAY_AGG function is collecting too many commit strings into the tz_offset which has a low cardinality. If you examine the frequency distribution for specific values, you find that there are ~169 distinct tz values and the data is heavily skewed to some tz's. For example, a single worker ends up processing tz = 0 and has to try to build a single row that contains ~36 million commit strings.

Type and run the following query.

#standardSQL
SELECT
  COUNT(1) as total_tz,
  MIN(frequency) as min_freq,
  CAST(AVG(frequency) as INT64) as avg_freq,
  APPROX_QUANTILES(frequency, 101)[OFFSET(50)] as p50_freq,
  APPROX_QUANTILES(frequency, 101)[OFFSET(90)] as p90_freq,
  APPROX_QUANTILES(frequency, 101)[OFFSET(99)] as p99_freq,
  MAX(frequency) as max_freq
FROM
(
SELECT
  author.tz_offset as tz,
  COUNT(commit) as frequency
FROM
  `bigquery-public-data.github_repos.commits`
GROUP BY tz
)

Step 6

Examine the query results. This query reveals the data skew in the dataset. What might you do to mitigate the data skew? Some possibilities would be

In this section of the lab, you again modify a query that exceeds available resources. This particular example focuses on a query which causes JOIN explosion. This query is for demonstration purposes only.

Step 1

Type the following query to generate some fake accounts from the IRS filing data. 10% of the accounts have INVALID account_managers and 40% have UNASSIGNED account_managers.

#standardSQL
SELECT
  COUNT(1) as total_tz,
  MIN(frequency) as min_freq,
  CAST(AVG(frequency) as INT64) as avg_freq,
  APPROX_QUANTILES(frequency, 101)[OFFSET(50)] as p50_freq,
  APPROX_QUANTILES(frequency, 101)[OFFSET(90)] as p90_freq,
  APPROX_QUANTILES(frequency, 101)[OFFSET(99)] as p99_freq,
  MAX(frequency) as max_freq
FROM
(
SELECT
  author.tz_offset as tz,
  COUNT(commit) as frequency
FROM
  `bigquery-public-data.github_repos.commits`
GROUP BY tz
)

Step 2

Click Show Options and for Destination Table, click Select Table.

Step 3

Verify the values for Project and Dataset, and for Table ID, type accounts.

Step 4

Click OK and then click Run Query.

Step 5

Type and run the following query to create some fake agents, with names chosen statistically, and 10 UNASSIGNED extensions. Save the results in a destination table named account_managers.

#standardSQL
SELECT
  CONCAT("agent", CAST(row - 1 as STRING)) as account_manager,
  name,
  MOD(ABS(FARM_FINGERPRINT(CONCAT("agent", CAST(row - 1 as STRING)))),10000) as phone_extension
FROM
(
SELECT
 name,
 ROW_NUMBER() OVER () as row
FROM
(
SELECT
  name,
  COUNT(1) as freq
FROM
 `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY name
ORDER by freq DESC
LIMIT 100
)
)
UNION ALL
SELECT *
  FROM UNNEST([
    STRUCT("UNASSIGNED" as agent, "No Agent Assigned" as name, 10000 as phone_extension),
    STRUCT("UNASSIGNED" as agent, "No Agent Assigned" as name, 10001 as phone_extension),
    STRUCT("UNASSIGNED" as agent, "No Agent Assigned" as name, 10002 as phone_extension),
    STRUCT("UNASSIGNED" as agent, "No Agent Assigned" as name, 10003 as phone_extension),
    STRUCT("UNASSIGNED" as agent, "No Agent Assigned" as name, 10004 as phone_extension),
    STRUCT("UNASSIGNED" as agent, "No Agent Assigned" as name, 10005 as phone_extension),
    STRUCT("UNASSIGNED" as agent, "No Agent Assigned" as name, 10006 as phone_extension),
    STRUCT("UNASSIGNED" as agent, "No Agent Assigned" as name, 10007 as phone_extension),
    STRUCT("UNASSIGNED" as agent, "No Agent Assigned" as name, 10008 as phone_extension),
    STRUCT("UNASSIGNED" as agent, "No Agent Assigned" as name, 10009 as phone_extension),
    STRUCT("UNASSIGNED" as agent, "No Agent Assigned" as name, 10010 as phone_extension)
   ])

Step 6

Type and run the following query to create the manager locations. You create 500 cities, but 300 of them will be unassigned to account managers. Save the query results to a destination table named locations.

#standardSQL
SELECT
  city,
  IF(MOD(FARM_FINGERPRINT(city),200) < 100,CONCAT("agent", CAST(ABS(MOD(FARM_FINGERPRINT(city),100)) as STRING)),'UNASSIGNED') as account_manager
FROM
  (SELECT DISTINCT city as city FROM `bigquery-public-data.irs_990.irs_990_ein` LIMIT 500)

Step 7

Now, let's join account, accountmanager, and location info. A few hundred MB of input goes through join explosion to multiple gigs. Type and run the following query. Be sure to replace [YOUR_DATASET] with the name of the dataset you used to store the data. Also, be sure the destination table field is empty.

#standardSQL
SELECT
  account_manager,
  name,
  phone_extension,
  city,
  ARRAY_AGG(acct ORDER BY acct.asset_amt DESC, acct.revenue_amt DESC, acct.ein) as accounts,
  AVG(acct.asset_amt) as avg_acct_assets,
  FARM_FINGERPRINT(STRING_AGG(acct.org)) as mgr_secret,
  APPROX_QUANTILES(acct.asset_amt, 10001) as asset_dist,
  APPROX_QUANTILES(acct.revenue_amt, 10001) as revenue_dist,
  APPROX_QUANTILES(acct.ein, 10001) as ein_dist,
  MAX(acct.asset_amt) as largest_assets,
  MAX(acct.revenue_amt) as largest_revenue
FROM
(
SELECT
  STRUCT(acc.orgname as org,
         acc.ico,
         acc.street,
         acc.city,
         acc.state,
         acc.zip,
         acc.exgroup,
         acc.classification,
         acc.affiliation,
         acc.asset_amt,
         acc.revenue_amt,
         acc.ein as ein
        ) as acct,
  am.account_manager,
  am.name,
  am.phone_extension,
  location.city
FROM
  [YOUR_DATASET].accounts as acc
FULL OUTER JOIN
  [YOUR_DATASET].account_managers as am
ON acc.account_manager = am.account_manager
FULL OUTER JOIN
  [YOUR_DATASET].locations as location
ON
  acc.account_manager = location.account_manager
)
GROUP BY
account_manager,
name,
phone_extension,
city

Step 8

You should receive an error similar to the following (it may take up to 9 minutes for the error to appear).

Error: Resources exceeded during query execution.

What would you do to mitigate this query? Look at the frequency of keys and filter out unassigned values from the JOIN key.

┬ęGoogle, Inc. or its affiliates. All rights reserved. Do not distribute.