In this lab, you learn how to use BigQuery as a data source into Dataflow, and how to use the results of a pipeline as a side input to another pipeline.
To complete this lab, you need:
Access to a supported Internet browser:
A Google Cloud Platform project
In this lab, you learn how to:
The goal of this lab is to learn how to use BigQuery as a data source into Dataflow, and how to use the result of a pipeline as a side input to another pipeline.
Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI, and click on Compose Query.
Copy-and-paste this query:
SELECT content FROM [fh-bigquery:github_extracts.contents_java_2016] LIMIT 10
Then, click on Show Options and ensure that the Legacy SQL option is checked.
Click on Run Query.
What is being returned? _______________________________ ____________________
The BigQuery table
fh-bigquery:github_extracts.contents_java_2016 contains the content (and some metadata) of all the Java files present in github in 2016.
To find out how many Java files this table has, type the following query and click Run Query (again make sure Legacy SQL option is checked):
SELECT COUNT(*) FROM [fh-bigquery:github_extracts.contents_java_2016]
The reason zero bytes are processed is that this is table metadata.
How many files are there in this dataset? __________________________________
Is this a dataset you want to process locally or on the cloud? ______________
On your Cloud Console, start CloudShell and navigate to the directory for this lab:
If this directory doesn't exist, you may need to git clone the repository:
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
View the pipeline code using nano and answer the following questions:
cd ~/training-data-analyst/courses/data_analysis/lab2/javahelp nano src/main/java/com/google/cloud/training/dataanalyst/javahelp/JavaProjectsThatNeedHelp.java
The pipeline looks like this (refer to this diagram as you read the code):
Looking at the class documentation at the very top, what is the purpose of this pipeline? __________________________________________________________
Where does GetJava get Java content from? _______________________________
What does ToLines do? (Hint: look at the content field of the BigQuery result) ____________________________________________________
Why is the result of ToLines stored in a named PCollection instead of being directly passed to another apply()? ________________________________________________
What are the two actions carried out on javaContent? ____________________________
If a file has 3 FIXMEs and 2 TODOs in its content (on different lines), how many calls for help are associated with it? __________________________________________________
If a file is in the package com.google.devtools.build, what are the packages that it is associated with? ____________________________________________________
Why is the numHelpNeeded variable not enough? Why do we need to do Sum.integersPerKey()? ___________________________________ (Hint: there are multiple files in a package)
Why is this converted to a View? ___________________________________________
Which operation uses the View as a side input? _____________________________
Instead of simply ParDo.of(), this operation uses ____________________________
Besides c.element() and c.output(), this operation also makes use of what method in ProcessContext? __________________________________________________________
If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.
Execute the pipeline by typing in (make sure to replace
<YOUR-BUCKET-NAME> with the bucket name you created in the previous step):
cd ~/training-data-analyst/courses/data_analysis/lab2/javahelp ./run_oncloud3.sh <PROJECT> <YOUR-BUCKET-NAME> JavaProjectsThatNeedHelp
Monitor the job from the GCP console from the Dataflow section.
Once the pipeline has finished executing, download and view the output:
gsutil cp gs://<YOUR-BUCKET-NAME>/javahelp/output.csv . head output.csv
In this lab, you:
©Google, Inc. or its affiliates. All rights reserved. Do not distribute.