In this lab, you learn how to derive insights from data using Google BigQuery, Cloud Dataprep, and Google Data Studio.

Labs

Course Timing:

Two Datasets:

Lab 1: Exploring your Dataset with Google BigQuery

Lab 2: Troubleshooting common SQL Errors

Lab 3: Calculating Google BigQuery Pricing

Lab 4: Exploring and Transforming data with Cloud Dataprep

Lab 5: Creating new Permanent Tables

Lab 6: Ingesting and Querying New Datasets

Lab 7: Exploring a Dataset in Google Data Studio

Lab 8: Merging Data Tables across multiple years

Lab 9: Joining and Visualizing Data

Lab 10: Running Queries in the Past

Lab 11: Querying and Visualizing Nested Data

Lab 12: Visualizing Insights with Google Data Studio

Lab 13: Deriving Insights from Advanced SQL Functions

Lab 14: Optimizing and Troubleshooting Query Performance

Lab 15: Reading a Google Cloud Datalab notebook

Please see Getting setup on the Qwiklabs environment

Summary

In this lab, you will explore and query the Google BigQuery Public Dataset for IRS Form 990 Non-Profit Organizations. You will practice basic Standard SQL syntax and avoid common query pitfalls.

URL

Category

GCP Big Data

Status

Environment

Feedback Link

g.co/CloudTrainEval

Overview

In this lab you will access the Google BigQuery Public dataset for all registered U.S. Non-Profits that have filed form 990 with the Internal Revenue Service (IRS).

You will explore key metadata about the dataset (tables, row counts, and more) and perform basic SQL commands using the Google BigQuery UI. Lastly, you will understand how to find duplicate records in a dataset while avoiding common SQL pitfalls.

Objectives

In this lab, you will learn how to perform the following tasks:

Explore the Google BigQuery Public Dataset: IRS Form 990 Data

Find table row counts within the BigQuery UI

  1. Open BigQuery at https://bigquery.cloud.google.com/project/bigquery-public-data
  2. Expand the bigquery-public-data Project in the left side navigation panel

  1. Expand the irs_990 Dataset within the bigquery-public-data project you expanded earlier. It will then look similar to the below:

`

  1. Scan the list and note 4 different types of tables in the irs_990 dataset. For our labs we will primarily be using irs_990_ein and irs_990_YYYY.

Table Prefix

Table Explanation

Example Usage

irs_990_ein

Employer Identification Number (EIN) lookup listings for all Non-Profit organizations in the U.S. for all years.

This is the only table that contains the full name of the organization and other identifying information (and will often serve as a lookup table against a specific filing)

irs_990_YYYY

IRS Form 990 (U.S. organizations exempt from income tax) filings for year YYYY

Tax filings by year for organizations who are large enough to be required to fill out the full 990 form.

irs_990_ez_YYYY

IRS Form 990 ez filings for year YYYY

Tax filings by year for organizations who are allowed to use the 990 EZ shortened form

irs_990_pf_YYYY

IRS Form 990 Private Foundation (pf) filings for year YYYY

Tax filings by year for Private Foundation organizations who are required to use the 990 pf form

  1. Click on the irs_990_ein table
  1. In the right pane, locate the BigQuery table metadata menu bar which will look like

Table Details: [table name]

Schema

Details

Preview

  1. Click on Details
  1. Find the total number of records in the metadata table shown.
  1. Answer: 1,587,766 * (as of the writing of this lab - data updated annually)
  1. Question: Assuming each of those records corresponds to a unique EIN (Employer Identification Number) within the U.S., what does this number mean? (choose one)
  1. There are 1,587,766 U.S. Non-Profits registered with the IRS
  2. There are 1,587,766 tax filings for U.S. Non-Profits for all years in the dataset
  3. There are 1,587,766 tax filings for U.S. Non-Profits for the year 2015

Answer: (a) is correct. irs_990_ein contains all registered Non-Profits with details about their organization. There are other tables in the dataset which have tax filings by year.

Finding Duplicate Records in a Dataset

How can we be 100% sure those EIN records are unique in our table?

To answer this question, we will need to query the dataset using SQL (Structured Query Language). For this example we will be using the Google BigQuery web UI.

  1. In the left panel, select Compose Query
  2. Click Show Options
  3. Find SQL Dialect and uncheck Use Legacy SQL (or leave it unchecked). Standard SQL Mode is now enabled. Leave all other values at their defaults.
  1. Why use Standard SQL? Standard SQL is ANSI 2011 compliant and has performance advantages we will discuss in future modules.
  1. Click Hide Options
  2. Begin writing your SQL query in the empty top panel
  1. Question: COUNT the number of unique EINs in the irs_990_ein table
  2. Hint: use `bigquery-public-data.irs_990.irs_990_ein` as your data table (note the backticks for Standard SQL to denote the table name)
  3. Hint: consider using the DISTINCT with your aggregation function like COUNT(DISTINCT field_name).
  1. Compare your query to the below solution
#standardSQL
# EIN distinct count
SELECT 
  COUNT(DISTINCT ein) AS ein_distinct
FROM `bigquery-public-data.irs_990.irs_990_ein`;
  1. Click on Run Query once you are finished writing your query
  2. How many unique EINs were returned?
  1. Answer: 1,587,766 * (As of the time of this course writing - dataset updated annually)
  1. Modify your query to count the total number of records and the difference between that total and the unique count
  2. Compare your query to the below solution
#standardSQL
# EIN duplicate count
SELECT 
  COUNT(ein) AS ein_count,
  COUNT(DISTINCT ein) AS ein_distinct,
  COUNT(ein) - COUNT(DISTINCT ein) AS count_of_duplicates
FROM `bigquery-public-data.irs_990.irs_990_ein`;
  1. Click Run Query
  2. How many duplicate EINs were found?
  1. Answer: 3,179

How do we find out which specific EINs below in are that set? We need to write a new query.

  1. Comment out your old query in the editor by highlighting the query and using Ctrl + / (Windows Linux) or Command + / (Mac). Commented code will not run.
  2. Copy and modify the partially written query below to find the specific EINs that occurred more than once in the dataset. Order the result by the organization's name alphabetically.
  1. Hint: Use GROUP BY, HAVING, and ORDER BY
#standardSQL
# Exercise: Complete the below query

# Specific EINs that occur more than once
SELECT 
  ein, 
  name,
  COUNT(ein) AS ein_count
FROM `bigquery-public-data.irs_990.irs_990_ein` 
GROUP BY EIN, name 
HAVING > 1
ORDER BY pick_a_field_name;
  1. Compare your answer to the below solution
# Solution: Complete the below query

/*
Explanation: HAVING filters on an aggregation which
is our count of EIN occurrences. Then we simply ORDER
the results by name (defaults to alphabetically)
*/

# Specific EINs that occur more than once

SELECT 
  ein, 
  name,
  COUNT(ein) AS ein_count
FROM `bigquery-public-data.irs_990.irs_990_ein` 
GROUP BY ein, name 
HAVING ein_count > 1
ORDER BY name;
  1. Click Run Query
  2. Which is the first organization that has a duplicate EIN?
  1. Answer: 1 WORLD FOUNDATION INC

Answer:

Your result set will look something like the below. Since we applied an ORDER BY in our query above, our results are sorted alphabetically by institution name.

ein

name

ein_count

161574588

1 WORLD FOUNDATION INC

2

943170081

19TH SUPPORT COMMAND & TAEGU COMMTY OFFCRS CSMS-SGMS-CIVILIANS FORMAL

2

660578643

3RA MANANTIAL DE VIDA

2

473173973

493 FS TROPHY CLUB

2

980407051

4D ART LEMIEUX PILON INC

2

660821204

500 SAILS

2

462946884

694 COMPANY GRADE OFFICER GROUP

2

Total Records: 3,179 (Many records not shown)

Cleanup: Save your Query

  1. Click Save Query
  2. Name your query "Lab 1 - Querying Public Datasets - [YOURNAME]"

Nice work! You've just completed the first lab. You're on your way to becoming a master data analyst with Google BigQuery.

Learning Review

Lab extra credit

References

Summary

In this lab you will be faced with several broken SQL queries that are returning common errors or incorrect results in Google BigQuery. You will practice your SQL debugging skills to fix these queries.

URL

Category

GCP Big Data

Status

Environment

Feedback Link

g.co/CloudTrainEval

Overview

In this lab you will be faced with several broken SQL queries that are returning common errors or incorrect results in Google BigQuery. You will practice your SQL debugging skills to fix these queries.

We will be using the same dataset from the previous lab, Google BigQuery Public Dataset IRS Form 990, and will provide you with broken queries to troubleshoot and fix. In the process you will learn how to avoid these common pitfalls in your own queries.

Objectives

In this lab, you will learn good coding practices and how to debug these common query errors:

Good Code Practices: Formatting and Commenting

In this section, you will learn about automatically formatting a query in the Google BigQuery UI and common conventions for commenting your SQL queries.

Enable Standard SQL mode

Google BigQuery has two SQL dialects you can write your queries with. Standard SQL has the latest and greatest syntax and features and is better optimized for performance.You will useStandard SQL for these labs.

  1. On the left navigation bar, click Compose Query

  2. Click Show Options in the right panel

  3. Uncheck the option to Use Legacy SQL under SQL Dialect

  4. Click Hide Options

  5. Alternatively or in addition, you can type #standardSQL on the very first line of your code to automatically enable Standard SQL execution (instead of Legacy SQL).

Compose, name, and save a new query

  1. Copy and paste the query below into the query editor window on the right
#standardSQL
select ein, Count(ein) as col2, name FROM `bigquery-public-data.irs_990.irs_990_ein` group by 1, name HAVING col2>1 order by name ASC;
  1. Click Save Query in the bar below the query editor
  2. Name your query "Lab 2 - SQL Errors - [YOURNAME]" in the box that opens. Leave visibility settings as private.

Format your query

  1. Click Format Query in the bar beneath the query editor. Your query is now styled and formatted automatically.
  2. Compare your result with the following code:
  1. Important clauses and functions (SELECT, FROM, etc.) are capitalized.
  2. Fields are separated and indented for readability.
SELECT
  ein,
  COUNT(ein) AS col2,
  name
FROM
  `bigquery-public-data.irs_990.irs_990_ein`
GROUP BY
  1,
  name
HAVING
  col2>1
ORDER BY
  name ASC;

Formatting your query is the first step in writing clear and readable code. It helps you troubleshoot errors later and also helps any team members who are responsible for using and maintaining your queries in the future.

The automatic formatting tool is not enough to produce high quality code, even if your query executes successfully. Great data analysts also explain what their queries are doing through appropriate comments and review their code for readability.

Comment your query

  1. Add a new # comment in front of your query that explains the purpose; for example, # Find Duplicate Records
  1. Note: If comments span multiple lines, use /* my multiline comment */
  2. Your query should look like the below
#standardSQL

# Find Duplicate EIN records in IRS 990 form data
SELECT
  ein,
  COUNT(ein) AS col2,
  name
FROM
  `bigquery-public-data.irs_990.irs_990_ein`
GROUP BY
  1,
  name
HAVING
  col2>1
ORDER BY
  name ASC;

Clean up your query for readability

One final step when reviewing your queries is to ensure that they are clearly readable for future reviews.

Using the following guidelines, clean up your existing query:

  1. Clean up any ambiguous column names or aliases
  2. Remove any unnecessary defaults
  3. Re-order aggregations / non-aggregations in the SELECT statement
  4. Be consistent in using full column names or column indexes in your GROUP BY

Answer:

#standardSQL

/* 
Lab 2 - SQL Error Clean-up
Goal: Make our code more readable

Clean-up performed:
 Col2 as an alias is an ambiguous placeholder, renamed to EIN_Count
 ORDER BY defaults to sorting Ascending (A-Z), removed ASC
 Moved EIN and name together in the SELECT clause ordering
 GROUP BY had a mix of column indexes and names. Using names now. 
*/

# Find Duplicate EIN records in IRS 990 form data
SELECT
  EIN,
  name,
  COUNT(EIN) EIN_Count
FROM
  `bigquery-public-data.irs_990.irs_990_ein`
GROUP BY
  EIN,
  name
HAVING
  EIN_Count>1
ORDER BY
  name;

Troubleshooting Invalid Queries

In this section, you will be faced with invalid SQL queries that will error upon execution. You will fix them using your knowledge of Standard SQL and the provided hints. As you solve and complete each query exercise, you will comment your solved code and use the same window to append a new query to solve at the end of your editor window.

Error 1: More than one Query

You can only execute one query at a time per browser window.

  1. Copy and Paste the below query into your open query editor
  2. Click Run Query
  3. Confirm you see Error: "Syntax error: Unexpected keyword SELECT" after execution
#standardSQL

/* 
Lab 2 - SQL Error Clean-up
Goal: Fix the below queries by troubleshooting SQL syntax
*/

# Error 1: "Syntax error: Unexpected keyword SELECT"

# Find Duplicate EIN records in IRS 990 form data
SELECT
  EIN,
  name,
  COUNT(EIN) EIN_Count
FROM
  `bigquery-public-data.irs_990.irs_990_ein`
GROUP BY
  EIN,
  name
HAVING
  EIN_Count>1
ORDER BY
  name;
  
SELECT
  'hello' AS greeting;
  1. Highlight the first query (starting with SELECT and ending with ORDER BY name;)
  2. Apply a block comment using Ctrl + / (Windows Linux) or Command + / (Mac)
  3. Click Run Query (or use the keyboard shortcut Ctrl + Enter (Windows Linux) or Command + Enter (Mac)
  4. Confirm the solved query results:

greeting

hello

  1. Highlight the second query
  2. Apply a block comment using Ctrl + / (Windows Linux) or Command + / (Mac). (Note: Uncommenting a block is the same process to toggle off comments)
  3. Confirm your queries are commented and proceed to the next cleanup exercise

Protip: If you want to run a just a few lines of a query you can click on the arrow next to Run Query and click Run Selected (list of keyboard shortcuts)

Error 2: Incorrect Table Name

  1. Copy and Paste the below query and append it to the end of your existing query from the previous exercise.
#standardSQL
# Error 2: Incorrect Table Name

# Top 10 Employers with the most employees
SELECT
  EIN,
  noemplyeesw3cnt AS employee_count
FROM
  [bigquery-public-data:irs_990.irs_990_2015]
ORDER BY
  noemplyeesw3cnt DESC
LIMIT 10;
  1. Instead of clicking Run Query, click on the query validator icon (errors are indicated by a red ( ! ) (if a green check appears instead, ensure you are in Standard SQL mode by having #StandardSQL as the very first line of your query).

    Tip: The query validator will often catch invalid SQL even before you run a query
  1. Confirm the error: Error: Syntax error: Unexpected "[" at [XX:X]. If this is a table identifier, escape the name with `, e.g. `table.name` rather than [table.name].
  2. Solve the error by providing the correct table name for the IRS 990 2015 filings table.

    Hint: ensure both the name of the dataset and table name is correct and also the appropriate escaping characters are used for Standard SQL
  3. Confirm your answer against the below solution. Highlight the below box to view.
#standardSQL
# Solution 2: Incorrect Table Name

/*
Explanation: 
  Standard SQL uses `project.dataset.table` syntax
  Legacy SQL uses [project:dataset.table] syntax

  Updated the table to use Standard SQL syntax
*/

# Top 10 Employers with the most employees
SELECT
  EIN,
  noemplyeesw3cnt AS employee_count
FROM
  `bigquery-public-data.irs_990.irs_990_2015`
ORDER BY
  noemplyeesw3cnt DESC
LIMIT 10;
  1. Apply a block comment using Ctrl + / (Windows Linux) or Command + / (Mac) to your solved query and move on to the next exercise

Error 3: Misspelled column names, commas, and cases

In wide tables and schemas, finding the right columns to query can be a significant effort. Learn how to quickly fix misspelled columns and how to find and navigate a table schema quickly.

  1. Copy and Paste the below query and append it to the end of your existing query from the previous exercise.
#standardSQL
# Error 3: Misspelled column names, commas, and case

# Top 10 Non-Profit Schools by Number of Employees
SELECT 
  EIN, 
  totalrevenue AS total_revenue, 
  totfuncexpns AS total_expenses,
  noemplyeesw3cnt AS employee_count,
  noindiv100kcnt AS emp_salary_over_100k,
FROM 
`bigquery-public-data.irs_990.irs_990_2015`

# Filter on "Yes" response to operates a school
WHERE 
  operateschools170cd = 'y'

ORDER BY 
  noemplyeesw3cnt DESC
  
LIMIT 10;
  1. Click on the query validator icon ( ! )
  2. Confirm an error
  1. Error: Syntax error: Trailing comma before the FROM clause is not allowed at [XX:X]
  1. Resolve the error by finding and removing the trailing comma from your SELECT statement
  1. Solution: [Remove the comma after emp_salary_over_100k]
  1. Check the query validator for your next error
  1. Error: Unrecognized name: totalrevenue; Did you mean totrevenue? at [4:3]
  1. One of the columns is misspelled, find the Total Revenue column in the dataset and correct the spelling
  1. Solution
  1. Use the hint in the error message which reveals the field you probably intended was totrevenue
  2. In absence of a useful error hint, check the data table schema for a list of all fields.

    A quick way to access the schema inside the Query Editor when you are writing a script is to hold down the ctrl (Window / Linux) command (Mac) key which will highlight all the data tables in your query.

    Click on a highlighted table name to jump to the schema which appears in the bottom panel.

    A simple Find ctrl + F (Windows / Linux) or command + F (Mac) searching on "Revenue" will lead you to the Total revenue field shown below.

totrevenue

INTEGER

NULLABLE

Total revenue


Protip: Click on the column name totrevenue in the schema to automatically add it to your query to avoid typing it manually. Try clicking on more than one column and watch them get added to your query.

  1. Confirm your query looks like the below so far:
#standardSQL
# Error 3: Misspelled column names, commas, and case

# Top 10 Non-Profit Schools by Number of Employees
SELECT 
  EIN, 
  totalrevenue AS total_revenue, 
  totfuncexpns AS total_expenses,
  noemplyeesw3cnt AS employee_count,
  noindiv100kcnt AS emp_salary_over_100k
FROM 
`bigquery-public-data.irs_990.irs_990_2015`

# Filter on "Yes" response to operates a school
WHERE 
  operateschools170cd = 'y'

ORDER BY 
  noemplyeesw3cnt DESC
  
LIMIT 10;
  1. Confirm you have a valid query by checking the query validator icon for a green check
  2. Click Run Query
  3. How many results are returned? Answer: 0
  4. Although the query has valid syntax, it is not returning what we hoped. Check to see if we are unintentionally filtering out rows in our WHERE clause.
  5. Make an update to the code line: operateschools170cd = 'y' and re-run your query
  1. Solution: Most query engines are case sensitive. Because of the nature of our data, the values for yes / no fields are stores as capital Y or N. We need to match those strings exactly in our query. If you're unsure of the data range, consider doing a SELECT field_name FROM table_name GROUP BY field_name; to get all possible values for that field.
  1. What is the highest amount of employees found for non-profit schools?
  1. Solution: 56003

Review your query against the below solution

#standardSQL
# Solved 3: Misspelled column names, commas, and case

# Top 10 Non-Profit Schools by Number of Employees
SELECT 
  EIN, 
  totrevenue AS total_revenue, 
  totfuncexpns AS total_expenses,
  noemplyeesw3cnt AS employee_count,
  noindiv100kcnt AS emp_salary_over_100k
FROM 
`bigquery-public-data.irs_990.irs_990_2015`

# Filter on "Yes" response to operates a school
WHERE 
  operateschools170cd = 'Y'

ORDER BY 
  noemplyeesw3cnt DESC
  
LIMIT 10;
  1. Apply a block comment using Ctrl + / (Windows Linux) or Command + / (Mac) to your solved query and move on to the next exercise

Error 4: Missing Quotes from String Literals

Not all data types are treated equally when performing operations. Any string value must be escaped or else it may be treated as a separate field.

  1. Copy and Paste the below query and append it to the end of your existing query from the previous exercise. Note that we are using the irs_990_ein table which has organization details now.
# Error 4: Missing Quotes from String Literals

# All non-profits located in New York
SELECT 
  EIN,
  name,
  city,
  state,
  zip,
  subsection
FROM 
  `bigquery-public-data.irs_990.irs_990_ein`

WHERE state = ny;
  1. Click on the query validator icon ( ! )
  2. Confirm the error
  1. Error: Unrecognized name: ny at [XXX:XX]
  1. Update your WHERE clause in your query to properly match the New York string value. Additionally, keep in mind how state values are stored in your table with respect to case.
  1. Solution:
#standardSQL
# Error 4: Missing Quotes from String Literals

# All non-profits located in New York
SELECT 
  EIN,
  name,
  city,
  state,
  zip,
  subsection
FROM 
  `bigquery-public-data.irs_990.irs_990_ein`

WHERE state = 'NY';
  1. Click Run Query
  2. Using the Query Explanation tab in the bottom panel, answer the following:
  1. How many total Non-Profits are in our dataset? (assume unique EINs)
  1. Answer: 1,587,766* found on the Input Rows
  2. How many Non-Profits are in New York specifically?
  1. Answer: 102,588* found on the Output Rows

* Accuracy Note: The IRS Public Dataset is updated annually, your answer may vary slightly if new IRS 990 data has since been loaded and we have not updated these solutions yet.

Error 5: Non-Aggregate fields not found in GROUP BY

If we don't want to check the query explanation every time to get counts of rows returned, we can employ aggregate functions to summarize our results for us. Keep in mind that when you aggregate one field, other non-aggregate fields need to also be grouped in some fashion as you will see below.

  1. Copy and Paste the below query and append it to the end of your existing query from the previous exercise.
#standardSQL
# Error 5: Non-Aggregate fields not found in GROUP BY

# U.S. States with the most non-profits
SELECT 
  COUNT(EIN) as charity_count,
  state
FROM 
  `bigquery-public-data.irs_990.irs_990_ein`
ORDER BY charity_count DESC;
  1. Click on the query validator icon ( ! )
  2. Confirm the error
  1. Error: SELECT list expression references column state which is neither grouped nor aggregated at [XXX:X]
  1. Modify your query to include the missing clause
  1. Hint: Look at which field is aggregated and which should be grouped as a result
  2. Solution: Add a GROUP BY state clase after the FROM
  1. Compare your query to the solution below
#standardSQL
# Solution 5: Non-Aggregate fields not found in GROUP BY

# U.S. States with the most non-profits
SELECT 
  COUNT(EIN) as charity_count,
  state
FROM 
  `bigquery-public-data.irs_990.irs_990_ein`
GROUP BY state
ORDER BY charity_count DESC;
  1. Click Run Query
  2. Question: Which state had the most charities registered?
  1. Answer: California with 163,874 * (as of this course writing)

Error 6: Type mismatch on a Comparison Operator (<,>,=,!=)

Practice filtering out rows in your query based on more than one condition as well as using comparison operators. In this next exercise, we want to find all non-profits with employees that had ceased operations in 2015.

  1. Copy and Paste the below query and append it to the end of your existing query from the previous exercise.
#standardSQL
# Error 6: Type mismatch on a Comparison Operator (<,>,=,!=)

# Non-Profits with employees who ceased operations in 2015
SELECT
  # Key stats
  EIN,
  noemplyeesw3cnt AS count_employees,
  noindiv100kcnt AS count_emp_over_100k_salary,
  ceaseoperationscd AS ceased_operation,

  # Revenue = Contributions + Programs + Other
  totcntrbgfts AS total_contributions,
  totprgmrevnue AS total_program_revenue,
  totrevenue AS total_revenue,
  
  # Expenses
  totfuncexpns AS total_functional_expenses

FROM `bigquery-public-data.irs_990.irs_990_2015`

WHERE
  ceaseoperationscd != Y # Ceased operations
  AND noemplyeesw3cnt >= '0' # Had employees
  
ORDER BY
  noemplyeesw3cnt DESC;
  1. Click on the query validator icon ( ! )
  2. Confirm the error
  1. Error: Unrecognized name: Y at [XXX:XX]
  1. Modify your query to fix the comparison operators filtering your rows in the WHERE clause. Consult the data type field in your table schema if you are unclear of what type of data is stored (INTEGER, STRING etc..).
  1. Hint: Integers do not need to be in quotes whereas string values do
  2. Hint: Don't forget that a valid query does not mean a correct one. Consider the types of comparison operators used
  1. Click Run Query and fix any remaining validation errors
  2. Compare your code to the solution below:
#standardSQL
# Solution 6: Type mismatch on a Comparison Operator (<,>,=,!=)
# Explanation: Correct quotes for 'Y' and 0 and corrected operators =, >

# Non-Profits with employees who ceased operations in 2015
SELECT
  # Key stats
  EIN,
  noemplyeesw3cnt AS count_employees,
  noindiv100kcnt AS count_emp_over_100k_salary,
  ceaseoperationscd AS ceased_operation,

  # Revenue = Contributions + Programs + Other
  totcntrbgfts AS total_contributions,
  totprgmrevnue AS total_program_revenue,
  totrevenue AS total_revenue,
  
  # Expenses
  totfuncexpns AS total_functional_expenses

FROM `bigquery-public-data.irs_990.irs_990_2015`

WHERE
  ceaseoperationscd = 'Y' # Ceased operations
  AND noemplyeesw3cnt > 0 # Had employees
  
ORDER BY
  noemplyeesw3cnt DESC;
  1. Using either the Query Explanation or the paginated Row Count in the results window, how many non-profits with employees had ceased operations in 2015?
  1. Answer: 399

Nice work! This is the end of the basic query troubleshooting lab. Review the fundamentals you have learned below as they will serve you well in future modules and labs. Even SQL experts commonly run into these errors and even more complex ones that we will cover later.

Learning Review

References

Summary

In this lab, you will explore Google BigQuery pricing and how to estimate query and storage costs. Additionally, you will see how modifying a query can affect the cost.

URL

Category

GCP Big Data

Status

Environment

Feedback Link

g.co/CloudTrainEval

Overview

In this lab, you will explore Google BigQuery pricing and how to estimate query and storage costs. Additionally, you will see how modifying a query can affect the cost.

Objectives

Note: This lab assumes standard on-demand query pricing and not flat-rate (enterprise)

Estimate Query Costs

  1. Open bigquery.cloud.google.com
  2. Click on Compose Query
  3. Copy and Paste the below query
#standardSQL
# Click on the validator to see how many bytes the query will process
SELECT * FROM `data-to-insights.irs_990.irs_990_reporting` 
WHERE UPPER(name) LIKE '%KNIGHT%'
ORDER BY ein
LIMIT 10;
  1. Click the query validator icon to see how much data this query will process
  1. This query will process 1.59 GB when run.
  1. Using the Pricing Calculator: https://cloud.google.com/products/calculator/ estimate how much 1.59 GB is to process
  1. First 1 TB (1,000 GB is free), $5 per 1 TB afterward. 1.59 GB would cost $.008.

Even though our irs_990_reporting dataset is over a million rows, Google BigQuery scales to Billions of rows at the same speed. Let's try pricing on a dataset with over 10 Billion records

  1. Clear the first query from earlier and
  2. Copy and paste the below query on Wikipedia data
# standardSQL
SELECT
  language,
  wikimedia_project,
  title,
  SUM(views) AS views
FROM
  `bigquery-samples.wikipedia_benchmark.Wiki10B`
WHERE
  title LIKE '%Google%'
GROUP BY
  wikimedia_project,
  title,
  language;
  1. Click the query validator icon to see how much data this query will process
  1. This query will process 454 GB when run
  1. Clear or comment out the query
  2. Using the Pricing Calculator: https://cloud.google.com/products/calculator/ estimate how much 454 GB is to process
  1. 454 GB is roughly $2 to process.

Reducing Query Costs

Let's examine two types of modifications to determine whether or not they reduce the cost of a query.

This section of the lab focuses on LIMIT clauses and fields in the SELECT clause, and the effects they have on the amount of data processed. By combining the Pricing Calculator with BigQuery's built-in tools, you can minimize costs before running queries.

  1. In the BigQuery web UI, in the New Query window, type the following query. This query adds a LIMIT clause to the query used in the previous section. Do not run the query!
# do not run, large 4.43 TB query (even with LIMIT)
SELECT
  language,
  wikimedia_project,
  title,
  SUM(views) AS views
FROM
  `bigquery-samples.wikipedia_benchmark.Wiki100B`
WHERE
  title LIKE '%Google%'
GROUP BY
  wikimedia_project,
  title,
  language
LIMIT 10;
  1. Click the query validator icon (if necessary). Notice that the query will still process an estimated 4.43 TB. LIMIT clauses only limit the data a query returns. Since all data is processed before a LIMIT clause is applied, using one has no impact on the cost of a query.
  1. The following modified query removes the wikipedia_project and title fields from the SELECT clause. Type the following query in the New Query window and check the query validator to determine if there is an effect on the amount of data processed. Do not run the query!
# do not run, large query
SELECT
  language,
  SUM(views) AS views
FROM
  `bigquery-samples.wikipedia_benchmark.Wiki100B`
WHERE
  title LIKE '%Google%'
GROUP BY
  language;

The query will now process an estimated 4.06 TB when run, compared to the earlier figure of 4.43 TB. You save about half a TB but let's see if there is a less detailed table we can query from.

Specifying a minimal number of fields in the SELECT clause can reduce costs, though the results are less flexible in terms of future (ad-hoc) analysis.

  1. Using previous steps as a guide, calculate the cost of the modified query using the Pricing Calculator. You should find that reducing the size of the query reduces the cost.
  1. Type the following query in the New Query window. This query retrieves the same information from a smaller table, Wiki1B. This table has the same schema, but contains less data (1,249,541,131 rows vs 20,871,119,740 for Wiki100B).
#standardSQL
SELECT
  language,
  SUM(views) AS views
FROM
  `bigquery-samples.wikipedia_benchmark.Wiki1B`
WHERE
  title LIKE '%Google%'
GROUP BY
  language;

You should find that the query processes 51.7GB of data.

  1. Using previous steps as a guide, calculate the cost of the query using the Pricing Calculator. As expected, reducing the size of the table queried also reduces the cost.
  2. Run the Query and see the time to complete

A key takeaway from this exercise is that queries should be tailored to return only the columns you need. Putting unnecessary selection criteria into a query results in unnecessary cost. Also, a good practice is to do testing and development against a smaller dataset to control query costs.

Estimate Storage Costs

Viewing Dataset size through the UI

  1. Click on irs_990_reporting
  2. In the bottom pane, click on Details
  3. Under Table Info, lookup the Table Size
  1. 1.59 GB
  1. Using the Pricing Calculator: https://cloud.google.com/products/calculator/ estimate how much 1.59 GB is to store
  1. 1.59 * $.02 per GB / month = $.03 per month
  1. What about 10 TB of data?
  1. 10 TB = 10,240 GB * $.02 GB / month = $204.80 per month

Viewing Dataset size through SQL

Google BigQuery exposes the metadata on each table for you to query. This includes the size of the table on disk.

  1. Click Compose Query
  2. Copy and Paste the below query
#standardSQL
# calculate storage costs for a dataset (assume $0.02 / GB / month)
SELECT 
  ROUND((SUM(size_bytes)/1024/1024/1024),4) AS total_size_gb,
  ROUND((.02*SUM(size_bytes)/1024/1024/1024),2) AS storage_cost_per_month
FROM `data-to-insights.irs_990.__TABLES__` # change name
  1. Change the dataset name to your own
  2. Click Run Query
  3. View the total size and estimated cost of your dataset

Protip: If a table is not edited for 90 consecutive days, the price of storage for that table automatically drops by 50 percent to $0.01 per GB per month (as of August 2017). See the Pricing Guide for the latest.


Tip: Storage costs are relatively small compared to query costs. Changing your queries should be your first priority before considering changing your storage settings.

Learning Review

References

Summary

Cloud Dataprep is Google's self-service data preparation tool. In this lab, you will learn how to clean and enrich multiple datasets using Cloud Dataprep.

URL

Category

GCP Big Data

Status

Environment

Feedback Link

g.co/CloudTrainEval

Overview

Cloud Dataprep is Google's self-service data preparation tool. In this lab, you will learn how to clean and enrich multiple datasets using Cloud Dataprep.

Objectives

Create a new Storage Bucket to store Cloud Dataprep settings

  1. Open the Google Cloud Console at console.cloud.google.com
  2. Go to Storage in the Products and Services menu (left-side navigation)
  3. Click Create Bucket
  4. In the Create a bucket window that will appear, add a unique bucket name (use data-insights-course-yourname) and leave the remaining settings at their default values.
  5. Click Create
  6. You now have a Cloud Storage Bucket which we will be using to store raw data for ingestion into Google BigQuery later and for storing Cloud Dataprep settings.

Create a new BigQuery Dataset to store Cloud Dataprep output

  1. Open BigQuery at bigquery.cloud.google.com
  2. In the left side bar, hover over your project: data-to-insights-yourname
  3. Click on the down arrow that appears to the right
  4. In the drop down menu, select Create new dataset
  5. For Dataset ID, type irs_990
  6. Click OK (leave the remaining fields at their default values)

Now you have a new empty dataset that we can populate with tables.

Access and Launch Cloud Dataprep

  1. Open the Google Cloud Console at console.cloud.google.com
  2. Open the the Products & Services menu in the left-side navigation
  3. Under Big Data, click on Dataprep
  4. Click Allow for Trifacta to access project data. Dataprep is provided in collaboration with Trifacta, a Google partner.

  5. Click on the bucket name that you created earlier which will appear in the list.
  6. Click Use this Folder
  7. Wait for Cloud Dataprep to initialize (less than a minute typically)

Import IRS Dataset into a Dataprep Flow

  1. In the Cloud Dataprep UI, click Create Flow
  2. Specify the Flow
  1. Name: "IRS 990 Data Cleanup"
  2. Description: "Deduplication, filtering, and joining IRS Form 990 U.S. Non-Profit Data"
  3. Click Import & Add Datasets
  4. Download these tables locally:
  1. irs_990_2014
  2. irs_990_2015
  3. irs_990_ein
  1. In the data importer left side menu, click Upload
  2. Click on the irs_990 dataset you created earlier into your own project
  3. Add each data table to Dataprep by clicking on the [ + ] plus sign once. Afterward, you should have 3 New Dataprep Datasets showing in the right-side preview panel. (Dataprep terms each BigQuery table as a Dataset).
  4. Click Import Datasets
  5. Wait for the datasets to be loaded into DataPrep
  6. Select all datasets in the Add Datasets to Flow window and click Add

You are now ready to create new transformations using Recipes on each raw data table. Continue on to the next section

Transforming Raw Data with a Dataprep Recipe
Let's start cleaning up the base table with all Non-Profit Organization names: irs_990_ein

  1. Select the irs_990_ein dataset icon
  2. Click Add New Recipe in the Details right-side navigation
  3. Click Edit Recipe
  4. Wait for the dataset to load a sample of your data into the transformer preview

In The Transformer, you will get useful summary stats about the dataset

Explore the Sampled Data

Before we apply any transformations to the dataset, let's get an idea of what data values are present which will inform our recipe making. Note that Dataprep, at the time of writing, loads a sample size of 10MB for you to preview. Keep this limited view in mind during your exploration.

Viewing Frequent Values
Above each field is a histogram which counts frequency of the data values. This is particularly useful to see the shape of the data to quickly highlight fields that are skewed to particular values.

  1. Hover over a few of the taller bars in the name column and see if there are non-profit organizations that share name in your data sample. Do you see four or five names that are more frequent than others?

Assessing Data Quality

The colored horizontal bar beneath each column name indicates what portion of the dataset sample has valid values (according to the data type), missing values (no content or non-existent), or mismatched values.

  1. Scroll horizontally through the columns in the dataset and see if you can find any with mismatched or missing values.
  2. Locate the State field and hover over the data quality bar. Are there any mismatched values according to the Dataprep rules for U.S. states?
  1. Answer: You may find state codes AP, AE, AA classified as mismatched. These actually represent Armed Forces - Pacific, Armed Forces - Europe, and Armed Forces - Americas, GU - Guam. Depending on what you want included or excluded from your dataset you can choose to filter these or keep them automatically in a recipe.

Counting Duplicate Records
You can drill into more stats on each table to find useful details like count of unique values, mismatched data types and more.

  1. Click on the down arrow next to the ein column and select Column Details
  2. In the Summary section, is there a difference between the total Valid values and the total unique values? This indicates the presence of duplicate records.

    A quicker way to visually check for duplicates is by looking at the Top Values chart and seeing if there are values with an occurrence greater than one in the bar label.

Build a Recipe to Transform your Raw Data

Below are the key priorities of our data cleanup and transformation process for the irs_990 dataset.

  1. Eliminate duplicate values
  2. Trim unused columns
  3. Filter by key financial metrics to remove anomalies
  4. Join data tables into a final single reporting table we can query and visualize

Note that these above steps can be accomplished through SQL but the value in Cloud Dataprep comes with quickly exploring and transforming the data through a GUI. It is another tool in the Data Analyst's toolkit.

Eliminate Duplicate Values

  1. In the bottom panel, click cancel if any recipe suggestions were present from your data exploration exercise. You should now see New Step and Choose a Transformation
  2. Start typing deduplicate into the Choose Transformation input box
  3. Select the deduplicate option when it appears
  4. Note the impact this transformation step will have on your data sample with the preview of "Transformed X Rows" in the top right.
  5. Once you are satisfied with the transformation, click Add to Recipe
  6. To review your recipe so far, click the script icon in the right-side toolbar

Trim Unused Columns

We want to trim our data table to remove the columns we know are unclean or unused:

  1. For each of these columns select the down arrow next to the field name and click Drop

Cleaning Historical Tables

Deduplicate Records

  1. Click on your Flow, IRS 990 Data Cleanup, to return to the main page
  2. Click on irs_990_2015 table icon
  3. Click Add Recipe, then Edit Recipe
  4. In the transformation box type deduplicate
  5. Click Add to Recipe
  6. Click on the script icon in the right sidebar and verify your new step has been added

Adding a new field for Calendar Year
Annual Tax periods often vary from the traditional calendar year. We want to capture the Calendar Year from the table name in a field value before we union all historical data together.

  1. Type derive as a new transformation step
  2. In the formula box type "2015" to save this as a static string value
  3. In the New column name field, type "calendar_year_filed" (leave other input boxes at their default values)
  4. Click Add to Recipe

Ensure proper data types

Dataprep does its best to read from the BigQuery metadata and go deeper into data field subtypes to save you effort. But sometimes, you might want to correct a field type.

  1. Confirm the column ein is stored as an integer and not as ssn (social security number). If necessary, convert the column by clicking on the down arrow and Changing Data Type

Repeat the above steps for recipes on the data table irs_990_2014. For the derived column "calendar_year_filed" be sure to change "2015" to the respective year of the table (e.g. "2014").

Unioning Historical Tables Together for a taller dataset

Now that we have 4 years of cleaner historical data it is time to append them vertically together in what we call a Union. Afterward, we will be able to query one master table with multiple years of tax filings instead of 4 underlying tables.

Adding a Union to combine data across years

  1. Click on irs_990_2015 table icon
  2. Click edit recipe
  3. In the transformation box type union
  4. Click Add to Recipe. You will then be taken to the Union Page where you can modify specific settings (i.e. which datasets you want to union with)
  5. Click on Add Datasets and select:
  1. Irs_990_2015 - 2 (should already be present)
  2. Irs_990_2014 - 2

Protip: the " - 2" indicates this is already a transformed version of the underlying table (remember that derived field we added for calendar_field_year?).

  1. Click Add to Recipe

Date Parsing through Derived Fields

Cleaning raw data often involves changing data types, parsing values, and interpreting logic. Dataprep provides the option for you to write in custom formulas to create these calculated fields. We'll perform a simple one to parse out the tax period (YYYYMM) into a useable date value.

  1. Click on the down arrow next to tax_pd
  2. Hover over Formula
  3. Select Custom Formula
  4. For Choose a transformation, type in derive (it is likely already populated)
  5. For Formula paste in the below:
merge([left(tax_pd, 4),right(tax_pd, 2),'01'], '-')

Protip: Use DATEFORMAT() whenever you can instead of string parsing. Unfortunately our data is in YYYYMM which, as of the time of this lab writing, we have to parse manually.

  1. Name the column tax_pd_cleaned (leave other input boxes at their default values)
  2. Click Add to Recipe

We now have a new column in proper date format which will prove useful in our queries and visualizations later.

We're almost done! We have just one final step left to lookup valuable organization details (name, location) from our main ein table against our unioned historical tables which we will do through a dataset join.

Adding a Join to lookup data columns from another dataset

  1. Type in Join under Choose a transformation
  2. Click Add to Recipe. This will bring you to the Join Page.
  3. Select irs_990_ein-2 for the dataset to join against
  4. Click Preview Selected Dataset
  5. Click Select Join Keys to specify how we will link the two datasets
  6. Under Join Keys, ensure we are matching on the field ein between the datasets
# ein = # ein

Tip: If Datarep suggests a different relationship, click Edit and select ein for both tables

  1. Under columns in the left-side bar, click All
  2. Beneath the Join Keys tab, change the drop down marked INNER JOIN to be LEFT OUTER JOIN from the drop down of options. Essentially this means we want all form 990 filings from our historical table even if they have no matching organization ein # in the lookup table.

    Note: We will be covering JOINs in greater detail later in the course.
  3. Click Add to Recipe

We have just added all of the columns from our main ein table (organization name, location etc.) into our historical filings table. This enriches our final dataset by allowing us to have both filing and organizational detail information in one place.

We are not done yet. Once our recipe is finalized, we still need to run the job and specify where we want the results stored.

Running our Dataprep Job

  1. Review your recipe and confirm all the steps you want included are present
  2. Click Run Job in the upper-right corner
  3. By default, Dataprep will store your results in a CSV in Google Cloud Storage. Since we want a new Google BigQuery Table, click the pencil edit icon to modify the publish action
  4. Select BigQuery in the left side panel
  5. Select the irs_990 dataset you created earlier
  6. Click Create a new table in the right side panel
  7. Type irs_990_reporting for your data table name
  8. Choose Replace the data every run as your option for saving the results. The source table we are pulling from will always be a complete historical source even as new records are added upstream.
  9. Click Save Settings
  10. Review your Job and Publishing Actions and confirm the location is data-to-insights-yourname:irs_990.irs_990_reporting
  11. Click Run Job
  12. Monitor your Jobs in the subsequent page or by clicking on Jobs in the top navigation bar in the Dataprep UI.

    Protip: Under-the-hood Cloud Dataprep is passing your recipes to Cloud Dataflow which is a data pipeline automation tool commonly used by Data Engineers (and is the subject of future courses)
  13. Wait for your transformation to be completed (time varies with the complexity of your recipe. Cloud Dataflow will autoscale resources to meet the job need.)
  14. In the Jobs page, click on a completed job to review your results

    Insight: Look at the large amount of missing values from the field elf (Electronic Filing). If you dig deeper into this anomaly you will discover that only the 2015 filing dataset has this new field marking whether filings were Electronic or Paper filed.


Confirming Results in Google BigQuery

  1. Open the Google BigQuery UI at bigquery.cloud.google.com
  2. Click Compose Query
  3. Copy and Paste the below SQL into the editor and click Run Query
#standardSQL

# count filings by year from our
# final reporting table
SELECT
  COUNT(ein) AS ein_count,
  calendar_year_filed
FROM
  `data-to-insights.irs_990.irs_990_reporting` # change to your dataset
GROUP BY
  2
ORDER BY
  2 DESC;
  1. Validate your results against the below solution table:

ein_count

calendar_year_filed

294782

2015

299405

2014

289603

2013

294019

2012

Congratulations! You have successfully transformed and cleaned your raw data into a final reporting table using Cloud Dataprep. In subsequent labs, we will continue to explore, filter, query, and visualize this dataset.

Learning Review

References

Summary

Core to BigQuery are the concepts of datasets, tables, and views. Learn how to create your own datasets, tables, and how to store and export query results.

URL

Category

GCP Big Data

Status

Environment

Feedback Link

g.co/CloudTrainEval

Overview

Core to BigQuery are the concepts of datasets, tables, and views. Learn how to create your own datasets, tables, and how to store and export query results.

Objectives

In this lab you will perform the following:

Create a Dataset with your Google Cloud Project

  1. Open BigQuery at bigquery.cloud.google.com
  2. In the left side bar, hover over your project: data-to-insights-yourname
  3. Click on the down arrow that appears to the right
  4. In the drop down menu, select Create new dataset
  5. For Dataset ID, type irs_990
  6. Click OK (leave the remaining fields at their default values)

Now that we have a new empty dataset, it's time we fill it in with some tables. Next we will look at a couple of different methods for creating permanent tables from existing public data sources already on Google BigQuery. In a future lab we will cover ingesting new data.

Copying Existing Tables

Storage on BigQuery is relatively inexpensive (we'll cover pricing in a later lab: $0.02 per GB / month after the first free 10GB) so a common use case for data teams is to store copies of their tables for testing and development. We will practice by copying the IRS 990 Public BigQuery Dataset into our own personal project.

  1. Expand the bigquery-public-data Project in the left side navigation panel

  1. Expand the irs_990 Dataset within the bigquery-public-data project you expanded earlier.
  2. Hover over the irs_990_ein table and click the down arrow that appears
  3. Click Copy Table
  4. In the modal window, select your Destination Project to the one you created earlier: data-to-insights-yourname
  5. Select irs_990 as the Destination Dataset
  6. Name the Destination Name to be irs_990_ein (the same as the source name)
  7. Repeat steps 3 - 7 copying the additional below tables (keeping their destination names the same as the source table names)
  1. irs_990_2012
  2. irs_990_2013
  3. irs_990_2014
  4. Irs_990_2015
  1. In the left side panel, expand your project data-to-insights-yourname
  2. Expand the irs_990 dataset
  3. Confirm the 5 tables you copied are now present:
  1. irs_990_2012
  2. irs_990_2013
  3. irs_990_2014
  4. Irs_990_2015
  5. Irs_990_ein

Protip: Click on any of the new tables above and select Details (next to Schema). At the bottom of your metadata table you can optionally add a label to this data table. Common uses for labels include marking tables as ‘dev' or ‘prod'.


Saving Query Results into a Permanent Table

  1. Click Compose Query
  2. Copy and Paste the below query in it's entirety into the query editor (don't worry about interpreting it, we'll be covering a lot of those concepts later!)
#standardSQL

  #CTEs
  WITH 
  
  # 2015 filings joined with organization details
  irs_990_2015_ein AS (
  SELECT
    *
  FROM
    `bigquery-public-data.irs_990.irs_990_2015`
  JOIN
    `bigquery-public-data.irs_990.irs_990_ein` USING (ein)),
    
  # duplicate EINs in organization details
  multiples AS (
  SELECT
    ein AS ein,
    COUNT(ein) AS ein_count
  FROM
    irs_990_2015_ein
  GROUP BY
    ein
  HAVING
    ein_count > 1 )
    
# return results to store in a permanent table
SELECT
  irs_990.ein AS ein,
  irs_990.name AS name,
  irs_990.noemplyeesw3cnt AS num_employees,
  irs_990.grsrcptspublicuse AS gross_receipts,
  irs_990.totcntrbgfts AS total_contributions,
  irs_990.lessdirfndrsng AS fundraising_expenses,
  irs_990.totfuncexpns AS total_functional_expenses,
  irs_990.totliabend AS total_liabilities,
  irs_990.totnetliabastend AS total_liabilities_and_net_asset_balances,
  irs_990.totrevenue AS total_revenue,
  irs_990.pubsupplesub509 AS public_support,
  irs_990.officexpns AS office_expenses,
  irs_990.legalfees AS legal_fees,
  irs_990.accntingfees AS accounting_fees,
  irs_990.feesforsrvcmgmt AS management_services_fees,
  irs_990.elf AS efiled_indicator
FROM
  irs_990_2015_ein AS irs_990
LEFT JOIN
  multiples
ON
  irs_990.ein=multiples.ein
WHERE
  # filter out duplicate records
  multiples.ein IS NULL;
  1. Click Run Query
  2. In the results pane you will see many options for saving and exporting the results:
  1. Save Query - simply saves the SQL file you wrote
  2. Save View - saves a logical view and re runs your SQL each time it's queried
  3. Download as CSV
  4. Download as JSON
  5. Save as Table - creates a new BigQuery destination table
  6. Save to Google Sheets


Click on Save as Table (the other options we will cover shortly)

  1. Type in irs_990_2015_reporting for the Destination Name and leave the default values for project and dataset
  2. Confirm the new table shows up in your dataset list in the left panel. You can now query it directly as you would any other table

Creating a View

A view simply saves the SQL script you ran originally as it's definition. It does not store any results but rather re-runs the SQL script whenever the view itself is queried. A common use case for views is creating Authorized Views which allow you to limit the rows a user has access to see in the resulting data based on their role.

  1. With the results of the last query still open (or re-run the above query if need be), click on Save View

    Tip: If you get an error about legacy-sql, click on Show Options and disable Use Legacy SQL.
  2. Type in irs_990_2015_reporting_view for the Destination Name and leave the default values for project and dataset
  3. Confirm the new view shows up in your dataset list in the left panel. Note the different icon for a view as opposed to a table.

Editing an Existing View

  1. Click on the new irs_990_2015_reporting_view in the left side panel
  2. Click on Details in the bottom pane (where the schema is visible)
  3. Scroll to the bottom to see the query that comprises the view
  4. Click on Edit Query to make any changes to the view or leave it as is
  5. Click on Save View

Exporting Results

  1. Click Compose Query
  2. Copy and Paste in the below query and change the project name to yours
#standardSQL
# 2015 filings with > 1000 employees
SELECT
  *
FROM
  `data-to-insights.irs_990.irs_990_2015_reporting_view` # change name
WHERE
  num_employees > 1000;
  1. Click Run Query
  2. Click Save to Google Sheets
  3. Wait for results to save and then Click to View in the notification that appears
  4. Confirm your results are now showing in Google Sheets

Performance Pitfall: Working with large CSVs and query results in Google Sheets (>1M records) is not advised. Consider analyzing your tables with SQL in Google BigQuery and visualizing them in Google Data Studio instead (covered in an upcoming lab).

Access Pitfall: Exporting Data out of BigQuery or Saving as a new Destination Table requires extreme caution when sharing the results. While underlying source tables may be access controlled, any new tables and exports you create may not be.

Query Cache

  1. Click Compose Query
  2. Copy and Paste in the below query and change the project name to yours
#standardSQL
# most employees
# test cache by re-running the same query twice
# disable cache in Show Options
SELECT
  name,
  num_employees
FROM
  `data-to-insights.irs_990.irs_990_2015_reporting_view` # change name
ORDER BY
  num_employees DESC;
  1. Click Run Query and, after completion, make note of the execution time (appears to the right of Show Options)
  2. Without changing the query, click Run Query again and note the execution time. It should now read (X.Xs elapsed, cached) and execute faster.

Performance Tip: Cached query results (which are essentially temporary tables) will last 24 hours before cache is cleared.

Well done! You have successfully created your own dataset, populated it with data tables and a view, and exported a subset of data. Next, we will cover how to get external data into Google BigQuery.

Learning Review

References:

Summary

In this lab, you will ingest new data sources into Google BigQuery and learn how to query external data sources directly.

URL

Category

GCP Big Data

Status

Environment

Feedback Link

g.co/CloudTrainEval

Overview

In this lab, you will ingest new data sources into Google BigQuery and learn how to query external data sources directly.

Objectives

Ingesting Data from Google Cloud Storage

Upload a dataset to Google Cloud Storage

  1. Download this CSV:
  1. NAICS_digit_2017_codes.csv
  2. Open the Google Cloud Console at console.developers.google.com
  3. Go to Storage in the Products and Services menu (left-side navigation)
  4. Click your-bucket-name
  5. Click Upload Files
  6. Navigate to the CSV file you downloaded from Step 1 and upload it
  7. Wait for the file to upload
  8. Confirm the file have uploaded to your storage bucket

Ingesting a CSV into a Google BigQuery Table

  1. Open bigquery.cloud.google.com
  2. Expand your project-name in the left side bar to reveal the irs_990 dataset you created earlier
  3. Hover over the irs_990 dataset name and click on the down arrow
  4. Click Create Table
  5. Populate the following Create Table options:

Source Data

(leave as default)

Repeat Job

(leave as default)

Location

  1. Change drop down to Google Cloud Storage

Location File Path

  1. Click the View Files link
  2. Navigate your Bucket to your files
  3. Hover over the NAICS_digit_2017_codes.csv name and right-click to copy link address
  4. Paste that link into the Location File Path
  5. Replace https://storage.cloud.google.com/ with gs:// so your file path will end up similar to: gs://data-insights-course/labs/lab5-ingesting-and-querying/NAICS_digit_2017_codes.csv

Destination Table

  1. Leave the irs_990 dataset drop down
  2. Populate the name as: naics_digit_2017_codes

Table Type

Native Table (default)

Schema

  1. Check Automatically detect

Other Options

Leave as Default

  1. Click Create Table
  2. Confirm the new naics_digit_2017_codes schema looks similar to the below

Seq__No_

INTEGER

_2017_NAICS_US___Code

STRING

_2017_NAICS_US_Title

STRING

string_field_3

STRING

string_field_4

STRING

string_field_5

STRING

  1. Click on Preview to see sample data values

It looks like we've potentially ingested unnamed or blank columns, we can clean these up using SQL or Cloud Dataprep as we learned in previous labs.

Reading a CSV as an External Data Source in Google BigQuery Table
Instead of ingesting and storing the CSV data table in Google BigQuery, you decide you want to query the underlying data source directly.

The process is essentially the same as before except for changing the Table Type

  1. Hover over the irs_990 dataset name and click on the down arrow
  2. Click Create Table

Source Data

(leave as default)

Repeat Job

(leave as default)

Location

  1. Change drop down to Google Cloud Storage

Location File Path

  1. Copy and Paste this Google Cloud Storage bucket link gs://data-insights-course/labs/lab5-ingesting-and-querying/irs990_code_lookup.csv

Destination Table

  1. Leave the irs_990 dataset drop down
  2. Populate the name as: irs990_code_lookup

Table Type

10. Change the default to External table

Schema

  1. Populate the Schema as follows by filling out the input boxes and clicking Add Field

Name

Type

Mode

irs_990_field

STRING

NULLABLE

code

STRING

NULLABLE

description

STRING

NULLABLE

  1. Under Options, Header rows to skip put a 1 and leave the remaining default values

Name

Type

Mode

irs_990_field

STRING

NULLABLE

code

STRING

NULLABLE

description

STRING

NULLABLE

Other Options

Leave as Default

  1. Click Create Table
  2. Wait for the table to be created
  3. To query the new table, click Compose Query
  4. Copy and Paste the below query into the editor
#standardSQL
# Lookup what IRS code values mean
SELECT
  irs_990_field,
  code,
  description
FROM
  `data-to-insights.irs_990.irs990_code_lookup`
WHERE
  irs_990_field IN ('elf','subcd')
  1. Click Run Query
  2. Insights: Read through the query results,
  1. What does the field ‘elf' mean?
  1. elf denotes how the return was filed: ‘E' for Electronic, ‘P' for Paper
  2. Are the subsection (subcd) codes unique?
  1. No. Code 3 is used multiple times to denote 8 possible different Organization types (Charitable Corporation, Educational Organization, etc..). This insight will become particularly important when we use this as a lookup value for our individual filings. We will learn how to handle this in our upcoming labs.

Performance Pitfall: Creating and querying from External Data Sources directly (e.g. CSVs stored on Google Cloud Storage) has performance impacts as Google BigQuery has less control over data outside of its fully-managed data warehouse.

Congratulations! You have learned how to ingest data into Google BigQuery and query external data sources directly. In future labs we will merge these data sources together for a single enriched reporting data source.

Learning Review

References

Summary

In this lab, you will visually explore Google BigQuery data tables inside of Google Data Studio. You will look for relationships and insights between fields in your dataset.

URL

Category

GCP Big Data

Status

Environment

Feedback Link

g.co/CloudTrainEval

Overview

In this lab, you will visually explore Google BigQuery data tables inside of Google Data Studio. You will look for relationships and insights between fields in your dataset.

Objectives

Create a Blank Report

  1. Open datastudio.google.com
  2. On the Reports page, in the Start a new report section, click the Blank template. This creates a new untitled report.
  3. If prompted, click I accept the terms and conditions and then click Accept. You may need to click the Blank template again after agreeing to the terms and conditions.
  4. In the Add a data source window, click Create new data source.
  5. For Connectors, click BigQuery.
  6. For Authorization, click Authorize. This allows Data Studio access to your GCP project.
  7. In the Request for permission dialog, click Allow to give Data Studio the ability to view data in BigQuery. You may not receive this prompt if you previously used Data Studio.
  8. Click on My Projects
  9. Find your data-insights project and the irs_990 dataset you created in an earlier lab
  10. For table, click irs_990_2015_reporting
  11. For Billing Project, select your GCP project.
  12. In the upper right corner of the window, click Connect.
  13. Once Data Studio has connected to the BigQuery data source, the table's fields are displayed. You can use this page to adjust the field properties or to create new calculated fields. Click Add to report.
  14. When prompted, click Add to report.
  15. In the Request for permission dialog, click Allow to give Data Studio the ability to view and manage files in Google Drive. You may not receive this prompt if you previously used Data Studio.

Create a Bar Chart to Compare Revenue and Expenses

  1. At the top of the page, click Untitled Report to change the report name. For example, type "Exploring U.S. 2015 Non-Profit Data"
  2. When the report editor loads, click Insert > Bar chart.
  3. Using the handle, draw a rectangle on the report to display the chart.
  4. In the Bar chart properties window, on the Data tab, notice the value for Data Source (IRS_990_2015_reporting) and the default values for Dimension and Metric.
  5. Change the Dimension metric to name
  6. Change the Measure metric to total_revenue
  7. Add a second Measure metric for total_functional_expenses
  8. Re-size and reposition the bar chart on the page to your liking
  9. Click View in the upper right corner to preview what the Published dashboard will look like
  10. Hover over the 2015 Non-Profit with the most Revenue for 2015
  11. Insights: Are there any insights you can glean from the relationship between total revenue and total functional expenses by looking at the bar chart?
  1. Generally for these Non-Profits, Revenue matches Expenses for the year
  1. Return to the authoring view by clicking Edit in the upper right


Create a Data Table to show Employee Counts

  1. Click Insert > Table
  2. Using the handle, draw a rectangle on the report to display the chart.
  3. In the Table Properties window, on the Data tab, notice the value for Data Source (IRS_990_2015_reporting) and the default values for Dimension and Metric.
  4. Change the Dimension metric to name
  5. Change the Measure metric to num_employees
  6. Re-size and reposition the bar chart on the page to your liking

Create a Scatter Chart to show financial ratios

  1. Click Insert > Scatter Chart
  2. Using the handle, draw a rectangle on the report to display the chart.
  3. In the Scatter Chart Properties window, on the Data tab, notice the value for Data Source (IRS_990_2015_reporting) and the default values for Dimension and Metric.
  4. Change the Dimension metric to name
  5. Change the Measure X metric to total_liabilities_and_net_asset_balances
  6. Change the Measure Y metric to total_liabilities
  7. Re-size and reposition the bar chart on the page to your liking
  8. Click View to preview
  9. Insight: Non-Profit Organizations below an imaginary Y = X diagonal line mean they have fewer liabilities relative to net assets. Scroll over a few of the outlier dots in the Scatter Chart to see this relationship.

Learning Review

References

Summary

In this lab, you will learn how to apply SQL UNIONs and JOINs to enrich your dataset.

URL

Category

GCP Big Data

Status

Environment

Feedback Link

g.co/CloudTrainEval

Practice Unioning and Joining Datasets

1. Open BigQuery: https://bigquery.cloud.google.com

2. Compose a New Query

3. Ensure #standardSQL is set

4. Write a Query that will count the number of tax filings by calendar year for all IRS Form 990 filings

Use the below partially-written query as a guide. Hint: You will need to use Table Wildcards *.

#standardSQL
# UNION Wildcard and returning a table suffix
SELECT
  COUNT(*) as number_of_filings,
  AS year_filed
FROM `bigquery-public-data.irs_990.irs_990`
GROUP BY year_filed 
ORDER BY year_filed DESC

5. Compare with the below solution

#standardSQL
# UNION Wildcard and returning a table suffix
SELECT
  COUNT(*) as number_of_filings,
  _TABLE_SUFFIX AS year_filed
FROM `bigquery-public-data.irs_990.irs_990_*`
GROUP BY year_filed 
ORDER BY year_filed DESC

6. Run the query and confirm against the results below

Result:

7. Modify the query you just wrote to only include the IRS tables with the following format: irs_990_YYYY (i.e. filter out pf, ez, ein)

Start with the partially completed query below:

#standardSQL
# UNION Wildcard and returning a table suffix
SELECT
  COUNT(*) as number_of_filings,
  CONCAT(,_TABLE_SUFFIX) AS year_filed
FROM `bigquery-public-data.irs_990.irs_990_*`
GROUP BY year_filed 
ORDER BY year_filed DESC

8. Compare with the below solution:

#standardSQL
# UNION Wildcard and returning a table suffix
SELECT
  COUNT(*) as number_of_filings,
  CONCAT("2",_TABLE_SUFFIX) AS year_filed
FROM `bigquery-public-data.irs_990.irs_990_2*`
GROUP BY year_filed 
ORDER BY year_filed DESC

9. Run the query and confirm the result:

10. Lastly, modify your query to only include tax filings from tables on or after 2013. Also include average totrevenue and average totfuncexpns as additional metrics.

11. Compare with the below solution:

#standardSQL
# count of filings, revenue, expenses since 2013
SELECT
  CONCAT("20",_TABLE_SUFFIX) AS year_filed,
  COUNT(ein) AS nonprofit_count,
  AVG(totrevenue) AS avg_revenue,
  AVG(totfuncexpns) AS avg_expenses
FROM `bigquery-public-data.irs_990.irs_990_20*`
WHERE _TABLE_SUFFIX >= '13'
GROUP BY year_filed
ORDER BY year_filed DESC

12. Run the query and confirm the result:

Practice Joining Tables

Find the Org Names of all EINs for 2015 with some revenue or expenses. You will need to join tax filing table data with the organization details table.

1. Start with the below query and fill in the tables, join condition, and any filter you will need.

 #standard SQL
  # Find the Org Names of all EINs for 2015 with some revenue or expenses, limit 100
SELECT
  tax.ein AS tax_ein,
  org.ein AS org_ein,
  org.name,
  tax.totrevenue,
  tax.totfuncexpns
FROM
  AS tax
JOIN
  AS org
ON
  tax.ein = 
WHERE
  > 0
LIMIT
  100;

2. Compare your query to the below solution

 #standard SQL
  # Find the Org Names of all EINs for 2015 with some revenue or expenses, limit 100
SELECT
  tax.ein AS tax_ein,
  org.ein AS org_ein,
  org.name,
  tax.totrevenue,
  tax.totfuncexpns
FROM
  `bigquery-public-data.irs_990.irs_990_2015` AS tax
JOIN
  `bigquery-public-data.irs_990.irs_990_ein` AS org
ON
  tax.ein = org.ein
WHERE
  tax.totrevenue + tax.totfuncexpns > 0
LIMIT
  100;

3. Run the Query

4. Confirm the results show 100 records, the names of the Organization, and at least some expenses or revenues

5. Clear the Query Editor

Practicing Working with NULLs

6. Write a query to find where tax records exist for 2015 but no corresponding Org Name

Fill out the partially written starter query below

#standard SQL
  # Find where tax records exist for 2015 but no corresponding Org Name
SELECT
  tax.ein AS tax_ein,
  org.ein AS org_ein,
  org.name,
  tax.totrevenue,
  tax.totfuncexpns
FROM
  `bigquery-public-data.irs_990.irs_990_2015` tax
FULL 
  `bigquery-public-data.irs_990.irs_990_ein` org
ON
  
WHERE
  IS NULL # put tax.ein or org.ein to check here (one is correct)

7. Compare your solution to the below

#standard SQL
  # Find where tax records exist for 2015 but no corresponding Org Name
SELECT
  tax.ein AS tax_ein,
  org.ein AS org_ein,
  org.name,
  tax.totrevenue,
  tax.totfuncexpns
FROM
  `bigquery-public-data.irs_990.irs_990_2015` tax
FULL JOIN
  `bigquery-public-data.irs_990.irs_990_ein` org
ON
  tax.ein = org.ein
WHERE
  org.ein IS NULL

8. Run the Query

9. How many tax filings occured in 2015 but have no corresponding record in the Organization Details table?

Answer: 1351

Summary

In this lab, you will explore Practice Querying Nested and Repeated Fields

URL

Category

GCP Big Data

Status

Environment

Feedback Link

g.co/CloudTrainEval

You are given a reporting table that has nested and repeated rows. Each ein has it's name and corresponding expense and revenue accounts repeated. You will practice querying repeated rows through the use of UNNEST()

  1. Skim the below schema and note the repeated fields

2. Preview the Contents of the table by running the following query (or using the Web to find and Preview)

#standardSQL
SELECT * FROM `data-to-insights.irs_990.irs_990_repeated` LIMIT 10

The results should look similar to the below:

3. Attempt to filter for all the Legal expenses by running the below query

#standardSQL
SELECT * FROM `data-to-insights.irs_990.irs_990_repeated` 
WHERE expense_struct.type = 'Legal' 
LIMIT 10

4. Confirm the error "Error: Cannot access field type on a value with type ARRAY<STRUCT<type STRING, amount INT64>>"

We need to find another way to access those array elements.

5. Find the top 10 nonprofits that spent the most on legal expenses in the table. As a start, complete the below query by adding in the appropriate UNNEST( ) and WHERE Clause filter on the expense type for ‘Travel'. Hint: UNNEST( ) typically follows the FROM much like a JOIN and should enclose a STRUCT.

#standardSQL
# Expenses by Category for each EIN
SELECT
  ein,
  expense
FROM `data-to-insights.irs_990.irs_990_repeated` n
CROSS JOIN n.expense_struct AS expense
ORDER BY expense.amount DESC
LIMIT 10

6. Compare against the below solution

#standardSQL
# Expenses by Category for each EIN
SELECT
  ein,
  expense
FROM `data-to-insights.irs_990.irs_990_repeated` n
CROSS JOIN UNNEST(n.expense_struct) AS expense
WHERE expense.type = 'Legal'
ORDER BY expense.amount DESC
LIMIT 10

7. Run the Query

8. Confirm the Result:

Summary

In this lab, you will explore Deriving Insights from Advanced SQL Functions

URL

Category

GCP Big Data

Status

Environment

Feedback Link

g.co/CloudTrainEval

Practicing with Sub-Queries

1. Open BigQuery

2. Compose Query

3. Using the provided query WITH clause below, write a SELECT statement that follows which calculates the avg_income (revenue - expenses) and sort the results by avg_income highest to lowest. Also include year_filed and the other metrics from the WITH clause in your result.

#standardSQL
WITH summary AS (
# count of filings, revenue, expenses since 2013
SELECT
  CONCAT("20",_TABLE_SUFFIX) AS year_filed,
  COUNT(ein) AS nonprofit_count,
  AVG(totrevenue) AS avg_revenue,
  AVG(totfuncexpns) AS avg_expenses
FROM `bigquery-public-data.irs_990.irs_990_20*`
WHERE _TABLE_SUFFIX >= '13'
GROUP BY year_filed
ORDER BY year_filed DESC
)

# write your code here

4. Compare your result against the solution below:

#standardSQL
WITH summary AS (
# count of filings, revenue, expenses since 2013
SELECT
  CONCAT("20",_TABLE_SUFFIX) AS year_filed,
  COUNT(ein) AS nonprofit_count,
  AVG(totrevenue) AS avg_revenue,
  AVG(totfuncexpns) AS avg_expenses
FROM `bigquery-public-data.irs_990.irs_990_20*`
WHERE _TABLE_SUFFIX >= '13'
GROUP BY year_filed
ORDER BY year_filed DESC
)

SELECT
  year_filed,
  nonprofit_count,
  avg_revenue,
  avg_expenses,
  avg_revenue - avg_expenses AS avg_income
FROM summary
ORDER BY avg_income DESC

5. Run the Query

6. Which Year had the highest avg_income? What was it?

Solution: 2015 with 541214.61

Summary

In this lab, you will explore and query the Google BigQuery Public Dataset for IRS Form 990 Non-Profit Organizations. You will practice basic Standard SQL syntax and avoid common query pitfalls.

URL

Category

GCP Big Data

Status

Environment

Feedback Link

g.co/CloudTrainEval

Fixing and Troubleshooting Query Performance

1. Open BigQuery: https://bigquery.cloud.google.com

2. Compose a New Query

3. Ensure #standardSQL is set

4. The below query is running slowly, what can you do to correct it? (run the query to get a benchmark)

#standardSQL
# count all paper filings for 2015
SELECT * FROM `bigquery-public-data.irs_990.irs_990_2015`
WHERE UPPER(elf) LIKE '%P%' #Paper Filers in 2015
ORDER BY ein 
# 86,831 as per pagination count, 23s

6. Compare against the below solution

#standardSQL
SELECT COUNT(*) AS paper_filers FROM `bigquery-public-data.irs_990.irs_990_2015`
WHERE elf = 'P' #Paper Filers in 2015

# 86,831 at 2s

/*
Remove ORDER BY when there is no limit
Use Aggregation Functions
Examine data and confirmed P always uppercase
*/

7. Run your updated version and track the time

8. Clear the Query Editor

5. This new below query is running slowly, what can you do to correct it? (run the query to get a benchmark)

#standardSQL
  # get all Organization names who filed in 2015
SELECT
  tax.ein,
  name
FROM
  `bigquery-public-data.irs_990.irs_990_2015` tax
JOIN
  `bigquery-public-data.irs_990.irs_990_ein` org
ON
  tax.tax_pd = org.tax_period

6. Compare against the below solution

#standardSQL
  # get all Organization names who filed in 2015
SELECT
  tax.ein,
  name
FROM
  `bigquery-public-data.irs_990.irs_990_2015` tax
JOIN
  `bigquery-public-data.irs_990.irs_990_ein` org
ON
  tax.ein = org.ein
  
  # 86,831 as per pagination count, 23s

/*
   Incorrect JOIN key resulted in CROSS JOIN
   Correct result: 294,374 at 13s
*/

7. Run your updated version and track the time