In this lab, you learn how to write a simple Dataflow pipeline and run it both locally and on the cloud.
You must have completed Lab 0 and have the following:
In this lab, you learn how to:
The goal of this lab is to become familiar with the structure of a Dataflow project and learn how to execute a Dataflow pipeline.
In CloudShell, if you haven't already, git clone the repository:
git clone https://github.com/GoogleCloudPlatform/training-data-analyst.git
Navigate to the folder containing the starter code for this lab:
To create a new Dataflow project, use the powerful build tool Maven. Copy-and-paste the following Maven command:
mvn archetype:generate \ -DarchetypeArtifactId=google-cloud-dataflow-java-archetypes-starter \ -DarchetypeGroupId=com.google.cloud.dataflow \ -DgroupId=com.example.pipelinesrus.newidea \ -DartifactId=newidea \ -Dversion="[1.0.0,2.0.0]" \ -DinteractiveMode=false
What directory has been created? ______________________
What package has been created inside the src directory? ______________________
Examine the Maven command that was used to create the lab code:
What directory will get created? ______________________
What package will get created inside the src directory? ______________________
Open the CloudShell code editor by clicking the pencil icon
View the source code for the pipeline by navigating to:
What files are being read? _____________________________________________________
What is the search term? ______________________________________________________
Where does the output go? ___________________________________________________
There are three apply statements in the pipeline:
Copy and paste the following Maven command:
export PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:$PATH cd ~/training-data-analyst/courses/machine_learning/deepdive/04_features/dataflow/javahelp mvn compile -e exec:java \ -Dexec.mainClass=com.google.cloud.training.dataanalyst.javahelp.Grep
Examine the output file:
Does the output seem logical? ______________________
In the search box, type dataflow to find the Dataflow API and click on the hyperlink.
Click Enable if necessary
If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.
Copy some Java files to the cloud (make sure to replace
<YOUR-BUCKET-NAME> with the bucket name you created in the previous step):
gsutil cp src/main/java/com/google/cloud/training/dataanalyst/javahelp/*.java gs://<YOUR-BUCKET-NAME>/javahelp
Edit the Dataflow pipeline in Grep.java by opening it up in CloudShell Code Editor:
and changing the input and output variables to be:
String input = "gs://<YOUR-BUCKET-NAME>/javahelp/*.java"; String outputPrefix = "gs://<YOUR-BUCKET-NAME>/javahelp/output";
Make sure that you changed the
outputPrefix strings that are already present in the source code (do not copy-and-paste the entire line above because you will then end up with two variables named
Examine the script to submit the Dataflow to the cloud:
What is the difference between this Maven command and the one to run locally?
Submit the Dataflow to the cloud:
bash run_oncloud1.sh <PROJECT-ID> <YOUR-BUCKET-NAME> Grep
Because this is such a small job, running on the cloud will take significantly longer than running it locally (on the order of 2-3 minutes).
On your Cloud Console, navigate to the Dataflow section (from the 3 bars on the top-left menu), and look at the Jobs. Select your job and monitor its progress. You will see something like this:
Wait for the job status to turn to Succeeded. At this point, your CloudShell will display a command-line prompt. In CloudShell, download and examine the output:
gsutil cp gs://<YOUR-BUCKET-NAME>/javahelp/output.txt . cat output.txt
In this lab, you:
©Google, Inc. or its affiliates. All rights reserved. Do not distribute.