In this lab, you learn how to write a simple Dataflow pipeline and run it both locally and on the cloud.

What you need

You must have completed Lab 0 and have the following:

What you learn

In this lab, you learn how to:

The goal of this lab is to become familiar with the structure of a Dataflow project and learn how to execute a Dataflow pipeline.

Step 1

In CloudShell, if you haven't already, git clone the repository:

git clone https://github.com/GoogleCloudPlatform/training-data-analyst.git

Navigate to the folder containing the starter code for this lab:

cd training-data-analyst/courses/machine_learning/deepdive/04_features/dataflow

Step 2

To create a new Dataflow project, use the powerful build tool Maven. Copy-and-paste the following Maven command:

mvn archetype:generate \
  -DarchetypeArtifactId=google-cloud-dataflow-java-archetypes-starter \
  -DarchetypeGroupId=com.google.cloud.dataflow \
  -DgroupId=com.example.pipelinesrus.newidea \
  -DartifactId=newidea \
  -Dversion="[1.0.0,2.0.0]" \
  -DinteractiveMode=false

What directory has been created? ______________________

What package has been created inside the src directory? ______________________

Step 3

Examine the Maven command that was used to create the lab code:

cat create_mvn.sh

What directory will get created? ______________________

What package will get created inside the src directory? ______________________

Step 1

Open the CloudShell code editor by clicking the pencil icon

Step 2

View the source code for the pipeline by navigating to: training-data-analyst/courses/machine_learning/deepdive/04_features/dataflow/javahelp/src/main/java/com/google/cloud/training/dataanalyst/javahelp/Grep.java

Step 3

What files are being read? _____________________________________________________

What is the search term? ______________________________________________________

Where does the output go? ___________________________________________________

There are three apply statements in the pipeline:

  1. What does the first apply() do? _________________________________
  2. What does the second apply() do? ______________________________
  1. What does the third apply() do? _____________________

Step 1

Copy and paste the following Maven command:

export PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:$PATH
cd ~/training-data-analyst/courses/machine_learning/deepdive/04_features/dataflow/javahelp
mvn compile -e exec:java \
 -Dexec.mainClass=com.google.cloud.training.dataanalyst.javahelp.Grep

Step 2

Examine the output file:

cat /tmp/output.txt

Does the output seem logical? ______________________

Step 1

In the search box, type dataflow to find the Dataflow API and click on the hyperlink.

Click Enable if necessary

Step 2

If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.

Step 3

Copy some Java files to the cloud (make sure to replace <YOUR-BUCKET-NAME> with the bucket name you created in the previous step):

gsutil cp src/main/java/com/google/cloud/training/dataanalyst/javahelp/*.java gs://<YOUR-BUCKET-NAME>/javahelp

Step 4

Edit the Dataflow pipeline in Grep.java by opening it up in CloudShell Code Editor:

and changing the input and output variables to be:

String input = "gs://<YOUR-BUCKET-NAME>/javahelp/*.java";
String outputPrefix = "gs://<YOUR-BUCKET-NAME>/javahelp/output";

Make sure that you changed the input and outputPrefix strings that are already present in the source code (do not copy-and-paste the entire line above because you will then end up with two variables named input).

Step 5

Examine the script to submit the Dataflow to the cloud:

cat run_oncloud1.sh

What is the difference between this Maven command and the one to run locally?

Step 6

Submit the Dataflow to the cloud:

bash run_oncloud1.sh <PROJECT-ID> <YOUR-BUCKET-NAME> Grep

Because this is such a small job, running on the cloud will take significantly longer than running it locally (on the order of 2-3 minutes).

Step 7

On your Cloud Console, navigate to the Dataflow section (from the 3 bars on the top-left menu), and look at the Jobs. Select your job and monitor its progress. You will see something like this:

Step 8

Wait for the job status to turn to Succeeded. At this point, your CloudShell will display a command-line prompt. In CloudShell, download and examine the output:

gsutil cp gs://<YOUR-BUCKET-NAME>/javahelp/output.txt .
cat output.txt

In this lab, you:

┬ęGoogle, Inc. or its affiliates. All rights reserved. Do not distribute.