In this lab, you learn how to use pipeline options and carry out Map and Reduce operations in Dataflow.

What you need

To complete this lab, you need:

Access to a supported Internet browser:

A Google Cloud Platform project

What you learn

In this lab, you learn how to:

The goal of this lab is to learn how to write MapReduce operations using Dataflow.

Step 1

Start CloudShell and navigate to the directory for this lab:

cd ~/training-data-analyst/courses/data_analysis/lab2

If this directory doesn't exist, you may need to git clone the repository:

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

Step 2

View the source code for the pipeline using nano:

cd ~/training-data-analyst/courses/data_analysis/lab2/javahelp
nano src/main/java/com/google/cloud/training/dataanalyst/javahelp/IsPopular.java

Normally, you would develop this Java code in an Integrated Development Environment such as Eclipse or Intellij (not in CloudShell).

Step 3

What getX() methods are present in the class MyOptions? ____________________

What is the default output prefix? _________________________________________

How is the variable outputPrefix in main() set? _____________________________

Step 4

What are the key steps in the pipeline? _____________________________________________________________________________

Which of these steps happen in parallel? ____________________________________

Which of these steps are aggregations? _____________________________________

Step 1

Copy and paste the following Maven command:

export PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:$PATH
cd ~/training-data-analyst/courses/data_analysis/lab2/javahelp
mvn compile -e exec:java \
 -Dexec.mainClass=com.google.cloud.training.dataanalyst.javahelp.IsPopular

Step 2

Examine the output file:

cat /tmp/output.csv

Step 1

Change the output prefix from the default value:

mvn compile -e exec:java \
  -Dexec.mainClass=com.google.cloud.training.dataanalyst.javahelp.IsPopular \
 -Dexec.args="--outputPrefix=/tmp/myoutput"

What will be the name of the new .csv file that is written out?

Step 2

Note that we now have a new .csv file in the /tmp directory:

ls -lrt /tmp/*.csv

In this lab, you:

┬ęGoogle, Inc. or its affiliates. All rights reserved. Do not distribute.