Gain insights into query performance by authoring and rewriting queries for performance. You should complete these exercises using Standard SQL.

What you learn

In this lab, you learn how to:

Pin the project by going to the project dropdown, and navigating to the bottom of the project list until you get to "Display Project" and providing the project "google.com:shollyman-bq-experiments".

Step 1: Examine view

Examine the view google.com:shollyman-bq-experiments.codelabdata.heroes. It provides a simple set of heroes and their current location(s), which is represented as an ARRAY<STRING>, or "REPEATED STRING" as shown in the UI.

Step 2: Flatten and order view

Write a query so that the output looks like the following:

Row

hero

location

1

batman

gotham

2

green lantern

coast city

3

superman

fortress of solitude

4

superman

metropolis

5

wolverine

null

(Hint: use UNNEST. UNNEST can only be used in the FROM clause)

(Answer:

#StandardSQL
SELECT
 hero,
 loc as location
FROM `google.com:shollyman-bq-experiments.codelabdata.heroes`
LEFT JOIN UNNEST(location) as loc
ORDER BY hero, location 


)

Step 1: Examine table

Examine table google.com:shollyman-bq-experiments.examplebq.DimCustomer contains user information with the schema below. What is the data type for the email field?

_____________________________________________

(Answer: The email field is an ARRAY<STRING>, or a "REPEATED STRING" as shown in the UI.)

Step 2: Top customers

Find the top 5 customers who have more than one email registered, ordered by total number of emails registered (descending) and login_name.

(Hint: Arrays are a first class type, take a look at what array functions are available:

https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#array-functions
)

(Answer:

#StandardSQL
SELECT 
  customer_id,
  login_name,
  email
FROM `google.com:shollyman-bq-experiments.examplebq.DimCustomer`
WHERE
  ARRAY_LENGTH(email) > 1
ORDER BY
  ARRAY_LENGTH(email) DESC, login_name 

)

Step 3: Lucky users

How many customers have an email address with a 7 in it? ________________

(Hint: Look for the section on scanning for values that satisfy a condition.)

(Answer:

#StandardSQL
SELECT 
  COUNT(*)
FROM `google.com:shollyman-bq-experiments.examplebq.DimCustomer`
WHERE
  EXISTS (SELECT 1 FROM UNNEST(email) AS e WHERE e LIKE '%7%')

)

The table google.com:shollyman-bq-experiments.examplebq.Sales is our company transaction history. Each record is a transaction, and the 1:many relationship to line items has been inlined and represented as an array of item_bought structs.

Step 1: Unique users

Write a query to determine how many sales have included product_ids 2222, 8797, and 5664 (not how many we've sold) and whether the number of distinct users is different from the total number of transactions that have included this item.

Hint: EXISTS() and UNNEST() may help here

Answer:

#StandardSQL

SELECT
  COUNT(DISTINCT customer_id) as distinct_customers,
  COUNT(1) as total_transactions
FROM `google.com:shollyman-bq-experiments.examplebq.Sales`
WHERE
  EXISTS(SELECT product_id FROM UNNEST(item_bought) WHERE product_id IN (2222, 8797, 5664))

Step 2: Quality problems

Write a query to determine the number of products that have been sold but never returned. To do this, you'll need to associate the google.com:shollyman-bq-experiments.examplebq.Returns and google.com:shollyman-bq-experiments.examplebq.Sales tables. You may also use the google.com:shollyman-bq-experiments.examplebq.DimProduct table, which has a list of all product IDs.

Hint:
Use EXISTS to see if a row matches a condition in relation to another table, and NOT EXISTS to verify that it doesn't

Answer:

#StandardSQL

SELECT COUNT(DISTINCT product_id) AS item_count
FROM `google.com:shollyman-bq-experiments.examplebq.DimProduct` product
WHERE
  EXISTS (
    SELECT * FROM google.com:shollyman-bq-experiments.examplebq.Sales s, s.item_bought item_bought
    WHERE product.product_id = item_bought.product_id)
  AND NOT EXISTS (
    SELECT * FROM google.com:shollyman-bq-experiments.examplebq.Returns r, r.item_returned item_returned
    WHERE product.product_id = item_returned.product_id);

Step 1: Try query

Run this query that shows the top scoring article score and title for each hacker news user.

#StandardSQL
SELECT
  rankinfo.author AS author,
  rankinfo.top_score AS score,
  textdata.title
FROM (
  SELECT
    author,
    MAX(score) AS top_score
  FROM
    `bigquery-public-data.hacker_news.stories`
  GROUP BY
    author) rankinfo
LEFT JOIN
  `bigquery-public-data.hacker_news.stories` textdata
ON
  rankinfo.author = textdata.author
  AND rankinfo.top_score = textdata.score
ORDER BY
  score DESC
LIMIT
  1000

How long does it take? ________________________

How much data is processed? __________________

Why? ______________________________________

Answer: What tables are being joined? Self-joins are wasteful. You should only need to scan the table once ...

Step 2: Improve performance

Improve the performance.

Hint: Try using a window function (row_number). Partition by author and find top score for each author.

Answer:

#StandardSQL
SELECT
  author,
  score,
  title
FROM
(SELECT
  author,
  title,
  score,
  ROW_NUMBER() OVER (PARTITION BY author ORDER BY score DESC) as row
FROM
  `bigquery-public-data.hacker_news.stories`
)
WHERE
  row = 1
ORDER BY score DESC
LIMIT 1000

Another option is to use ARRAY_AGG and group by the author:

#StandardSQL
SELECT
  author,
  ARRAY_AGG(STRUCT(title, score) ORDER BY score DESC LIMIT 1)[OFFSET(0)].*
FROM
  `bigquery-public-data.hacker_news.stories`
GROUP BY author
ORDER BY score DESC
LIMIT 1000

Step 1: Try query

Run this query that shows the top scoring article score and title for each hacker news user.

#StandardSQL
SELECT
  author,
  COUNT(DISTINCT id) as distinct_stories
FROM `bigquery-public-data.hacker_news.stories` 
GROUP BY author
ORDER BY distinct_stories DESC
LIMIT 100

How long does it take? ________________________

Why? ______________________________________

Answer: COUNT(DISTINCT) is slow because it has to be precise. Is the DISTINCT needed for a column that is already unique?

Step 2: Improve performance

Improve the performance.

Hint: Use APPROX_COUNT_DISTINCT