In this lab, you learn how to use pipeline options and carry out Map and Reduce operations in Dataflow.
You must have completed Lab 0 and have the following:
In this lab, you learn how to:
The goal of this lab is to learn how to write MapReduce operations using Dataflow.
In CloudShell, if you haven't already, git clone the repository:
git clone https://github.com/GoogleCloudPlatform/training-data-analyst.git
Navigate to the folder containing the starter code for this lab:
View the source code for the pipeline using CloudShell Code Editor and navigating to:
What getX() methods are present in the class MyOptions? ____________________
What is the default output prefix? _________________________________________
How is the variable outputPrefix in main() set? _____________________________
What are the key steps in the pipeline? _____________________________________________________________________________
Which of these steps happen in parallel? ____________________________________
Which of these steps are aggregations? _____________________________________
Copy and paste the following Maven command:
export PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:$PATH cd ~/training-data-analyst/courses/machine_learning/deepdive/04_features/dataflow/javahelp mvn compile -e exec:java \ -Dexec.mainClass=com.google.cloud.training.dataanalyst.javahelp.IsPopular
Examine the output file:
Change the output prefix from the default value:
mvn compile -e exec:java \ -Dexec.mainClass=com.google.cloud.training.dataanalyst.javahelp.IsPopular \ -Dexec.args="--outputPrefix=/tmp/myoutput"
What will be the name of the new .csv file that is written out?
Note that we now have a new .csv file in the /tmp directory:
ls -lrt /tmp/*.csv
In this lab, you:
©Google, Inc. or its affiliates. All rights reserved. Do not distribute.