In this lab, you learn how to write a simple Dataflow pipeline and run it both locally and on the cloud.

What you need

You must have completed Lab 0 and have the following:

What you learn

In this lab, you learn how to:

The goal of this lab is to become familiar with the structure of a Dataflow project and learn how to execute a Dataflow pipeline.

Step 1

In CloudShell, if you haven't already, git clone the repository:

git clone https://github.com/GoogleCloudPlatform/training-data-analyst.git

Navigate to the folder containing the starter code for this lab:

cd training-data-analyst/courses/machine_learning/deepdive/04_features/dataflow/python/

Step 2

Install the necessary dependencies for Python dataflow:

sudo ./install_packages.sh

Verify that you have the right version of pip (should be > 8.0):

pip -V

If not, open a new CloudShell tab and it should pick up the updated pip.

Step 1

Open the CloudShell code editor by clicking the pencil icon

Step 2

View the source code for the pipeline by navigating to: training-data-analyst/courses/machine_learning/deepdive/04_features/dataflow/python/grep.py

Step 3

What files are being read? _____________________________________________________

What is the search term? ______________________________________________________

Where does the output go? ___________________________________________________

There are three transforms in the pipeline:

  1. What does the transform do? _________________________________
  2. What does the second transform do? ______________________________
  1. What does the third transform do? _____________________

Step 1

Execute locally in CloudShell:

python grep.py

Note: if you see an error that says "No handlers could be found for logger "oauth2client.contrib.multistore_file", you may ignore it. The error is simply saying that logging from the oauth2 library will go to stderr.

Step 2

Examine the output file:

cat /tmp/output-*

Does the output seem logical? ______________________

Step 1

In the Console search box, type dataflow to find the Dataflow API and click on the hyperlink.

Click Enable if necessary

Step 2

If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.

Step 3

Copy some Java files to the cloud (make sure to replace <YOUR-BUCKET-NAME> with the bucket name you created in the previous step):

gsutil cp ../javahelp/src/main/java/com/google/cloud/training/dataanalyst/javahelp/*.java gs://<YOUR-BUCKET-NAME>/javahelp

Step 4

Edit the Dataflow pipeline in grepc.py by opening it up in CloudShell Code Editor and changing the PROJECT and BUCKET variables appropriately.

Step 5

Submit the Dataflow to the cloud:

python grepc.py

Because this is such a small job, running on the cloud will take significantly longer than running it locally (on the order of 2-3 minutes).

Step 6

On your Cloud Console, navigate to the Dataflow section (from the 3 bars on the top-left menu), and look at the Jobs. Select your job and monitor its progress. You will see something like this:

Step 7

Wait for the job status to turn to Succeeded.

In CloudShell, examine the output:

gsutil cat gs://<YOUR-BUCKET-NAME>/javahelp/output-*

In this lab, you:

┬ęGoogle, Inc. or its affiliates. All rights reserved. Do not distribute.