In this lab, you learn how to use pipeline options and carry out Map and Reduce operations in Dataflow.

What you need

You must have completed Lab 0 and have the following:

What you learn

In this lab, you learn how to:

The goal of this lab is to learn how to write MapReduce operations using Dataflow.

Step 1

In CloudShell, if you haven't already, git clone the repository:

git clone https://github.com/GoogleCloudPlatform/training-data-analyst.git

Navigate to the folder containing the starter code for this lab:

cd training-data-analyst/courses/machine_learning/deepdive/04_features/dataflow/python/

Step 2

View the source code for the pipeline using CloudShell Code Editor and navigating to: training-data-analyst/courses/machine_learning/deepdive/04_features/dataflow/python/is_popular.py

Step 3

What custom arguments are defined? ____________________

What is the default output prefix? _________________________________________

How is the variable output_prefix in main() set? _____________________________

How are the pipeline arguments such as --runner set? ______________________

Step 4

What are the key steps in the pipeline? _____________________________________________________________________________

Which of these steps happen in parallel? ____________________________________

Which of these steps are aggregations? _____________________________________

Step 1

Install the necessary dependencies for Python dataflow:

sudo ./install_packages.sh

Verify that you have the right version of pip (should be > 8.0):

pip -V

If not, open a new CloudShell tab and it should pick up the updated pip.

Step 2

Run the pipeline locally:

./is_popular.py

Note: if you see an error that says "No handlers could be found for logger "oauth2client.contrib.multistore_file", you may ignore it. The error is simply saying that logging from the oauth2 library will go to stderr.

Step 3

Examine the output file:

cat /tmp/output-*

Step 1

Change the output prefix from the default value:

./is_popular.py --output_prefix=/tmp/myoutput

What will be the name of the new file that is written out?

Step 2

Note that we now have a new file in the /tmp directory:

ls -lrt /tmp/myoutput*

In this lab, you:

┬ęGoogle, Inc. or its affiliates. All rights reserved. Do not distribute.