In this lab, you learn how to use BigQuery as a data source into Dataflow, and how to use the results of a pipeline as a side input to another pipeline.

What you need

To complete this lab, you need:

Access to a supported Internet browser:

A Google Cloud Platform project

What you learn

In this lab, you learn how to:

The goal of this lab is to learn how to use BigQuery as a data source into Dataflow, and how to use the result of a pipeline as a side input to another pipeline.

Step 1

Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI, and click on Compose Query.

Step 2

Copy-and-paste this query:

SELECT
  content
FROM
  [fh-bigquery:github_extracts.contents_java_2016]
LIMIT
  10

Then, click on Show Options and ensure that the Legacy SQL option is checked.

Step 3

Click on Run Query.

What is being returned? _______________________________ ____________________

The BigQuery table fh-bigquery:github_extracts.contents_java_2016 contains the content (and some metadata) of all the Java files present in github in 2016.

Step 4

To find out how many Java files this table has, type the following query and click Run Query (again make sure Legacy SQL option is checked):

SELECT
  COUNT(*)
FROM
  [fh-bigquery:github_extracts.contents_java_2016]

The reason zero bytes are processed is that this is table metadata.

How many files are there in this dataset? __________________________________

Is this a dataset you want to process locally or on the cloud? ______________

Step 1

On your Cloud Console, start CloudShell and navigate to the directory for this lab:

cd ~/training-data-analyst/courses/data_analysis/lab2 

If this directory doesn't exist, you may need to git clone the repository:

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

Step 2

View the pipeline code using nano and answer the following questions:

cd ~/training-data-analyst/courses/data_analysis/lab2/javahelp
nano src/main/java/com/google/cloud/training/dataanalyst/javahelp/JavaProjectsThatNeedHelp.java

The pipeline looks like this (refer to this diagram as you read the code):

Step 3

Looking at the class documentation at the very top, what is the purpose of this pipeline? __________________________________________________________

Where does GetJava get Java content from? _______________________________

What does ToLines do? (Hint: look at the content field of the BigQuery result) ____________________________________________________

Step 4

Why is the result of ToLines stored in a named PCollection instead of being directly passed to another apply()? ________________________________________________

What are the two actions carried out on javaContent? ____________________________

Step 5

If a file has 3 FIXMEs and 2 TODOs in its content (on different lines), how many calls for help are associated with it? __________________________________________________

If a file is in the package com.google.devtools.build, what are the packages that it is associated with? ____________________________________________________

Why is the numHelpNeeded variable not enough? Why do we need to do Sum.integersPerKey()? ___________________________________ (Hint: there are multiple files in a package)

Why is this converted to a View? ___________________________________________

Step 6

Which operation uses the View as a side input? _____________________________

Instead of simply ParDo.of(), this operation uses ____________________________

Besides c.element() and c.output(), this operation also makes use of what method in ProcessContext? __________________________________________________________

Step 1

If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.

Step 2

Execute the pipeline by typing in (make sure to replace <YOUR-BUCKET-NAME> with the bucket name you created in the previous step):

cd ~/training-data-analyst/courses/data_analysis/lab2/javahelp
./run_oncloud3.sh <PROJECT> <YOUR-BUCKET-NAME> JavaProjectsThatNeedHelp

Monitor the job from the GCP console from the Dataflow section.

Step 3

Once the pipeline has finished executing, download and view the output:

gsutil cp gs://<YOUR-BUCKET-NAME>/javahelp/output.csv .
head output.csv

In this lab, you:

┬ęGoogle, Inc. or its affiliates. All rights reserved. Do not distribute.