In this lab, you learn how to use pipeline options and carry out Map and Reduce operations in Dataflow.
To complete this lab, you need:
Access to a supported Internet browser:
A Google Cloud Platform project
In this lab, you learn how to:
The goal of this lab is to learn how to write MapReduce operations using Dataflow.
Start CloudShell and navigate to the directory for this lab:
If this directory doesn't exist, you may need to git clone the repository:
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
View the source code for the pipeline using nano:
cd ~/training-data-analyst/courses/data_analysis/lab2/python nano is_popular.py
What custom arguments are defined? ____________________
What is the default output prefix? _________________________________________
How is the variable output_prefix in main() set? _____________________________
How are the pipeline arguments such as --runner set? ______________________
What are the key steps in the pipeline? _____________________________________________________________________________
Which of these steps happen in parallel? ____________________________________
Which of these steps are aggregations? _____________________________________
Install the necessary dependencies for Python dataflow:
Verify that you have the right version of pip (should be > 8.0):
If not, open a new CloudShell tab and it should pick up the updated pip.
Run the pipeline locally:
Note: if you see an error that says "
No handlers could be found for logger "oauth2client.contrib.multistore_file", you may ignore it. The error is simply saying that logging from the oauth2 library will go to stderr.
Examine the output file:
Change the output prefix from the default value:
What will be the name of the new file that is written out?
Note that we now have a new file in the /tmp directory:
ls -lrt /tmp/myoutput*
In this lab, you:
©Google, Inc. or its affiliates. All rights reserved. Do not distribute.