In this lab, you learn how to write a simple Dataflow pipeline and run it both locally and on the cloud.

What you need

To complete this lab, you need:

Access to a supported Internet browser:

A Google Cloud Platform project

What you learn

In this lab, you learn how to:

The goal of this lab is to become familiar with the structure of a Dataflow project and learn how to execute a Dataflow pipeline.

Step 1

Start CloudShell and navigate to the directory for this lab:

cd ~/training-data-analyst/courses/data_analysis/lab2/python

If this directory doesn't exist, you may need to git clone the repository first:

cd ~
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
cd ~/training-data-analyst/courses/data_analysis/lab2/python

Step 2

Install the necessary dependencies for Python dataflow:

sudo ./install_packages.sh

Verify that you have the right version of pip (should be > 8.0):

pip -V

If not, open a new CloudShell tab and it should pick up the updated pip.

Step 1

View the source code for the pipeline using nano:

cd ~/training-data-analyst/courses/data_analysis/lab2/python
nano grep.py

Step 2

What files are being read? _____________________________________________________

What is the search term? ______________________________________________________

Where does the output go? ___________________________________________________

There are three transforms in the pipeline:

  1. What does the transform do? _________________________________
  2. What does the second transform do? ______________________________
  1. What does the third transform do? _____________________

Step 1

Execute locally:

python grep.py

Note: if you see an error that says "No handlers could be found for logger "oauth2client.contrib.multistore_file", you may ignore it. The error is simply saying that logging from the oauth2 library will go to stderr.

Step 2

Examine the output file:

cat /tmp/output-*

Does the output seem logical? ______________________

Step 1

If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.

Step 2

Copy some Java files to the cloud (make sure to replace <YOUR-BUCKET-NAME> with the bucket name you created in the previous step):

gsutil cp ../javahelp/src/main/java/com/google/cloud/training/dataanalyst/javahelp/*.java gs://<YOUR-BUCKET-NAME>/javahelp

Step 3

Edit the Dataflow pipeline in grepc.py by opening up in nano:

cd ~/training-data-analyst/courses/data_analysis/lab2/python
nano grepc.py

and changing the PROJECT and BUCKET variables appropriately.

Step 4

Submit the Dataflow to the cloud:

python grepc.py

Because this is such a small job, running on the cloud will take significantly longer than running it locally (on the order of 2-3 minutes).

Step 5

On your Cloud Console, navigate to the Dataflow section (from the 3 bars on the top-left menu), and look at the Jobs. Select your job and monitor its progress. You will see something like this:

Step 6

Wait for the job status to turn to Succeeded. At this point, your CloudShell will display a command-line prompt. In CloudShell, examine the output:

gsutil cat gs://<YOUR-BUCKET-NAME>/javahelp/output-*

In this lab, you:

┬ęGoogle, Inc. or its affiliates. All rights reserved. Do not distribute.