In this lab, you learn how to write a simple Dataflow pipeline and run it both locally and on the cloud.
To complete this lab, you need:
Access to a supported Internet browser:
A Google Cloud Platform project
In this lab, you learn how to:
The goal of this lab is to become familiar with the structure of a Dataflow project and learn how to execute a Dataflow pipeline.
Start CloudShell and navigate to the directory for this lab:
If this directory doesn't exist, you may need to git clone the repository first:
cd ~ git clone https://github.com/GoogleCloudPlatform/training-data-analyst cd ~/training-data-analyst/courses/data_analysis/lab2/python
Install the necessary dependencies for Python dataflow:
Verify that you have the right version of pip (should be > 8.0):
If not, open a new CloudShell tab and it should pick up the updated pip.
View the source code for the pipeline using nano:
cd ~/training-data-analyst/courses/data_analysis/lab2/python nano grep.py
What files are being read? _____________________________________________________
What is the search term? ______________________________________________________
Where does the output go? ___________________________________________________
There are three transforms in the pipeline:
Note: if you see an error that says "
No handlers could be found for logger "oauth2client.contrib.multistore_file", you may ignore it. The error is simply saying that logging from the oauth2 library will go to stderr.
Examine the output file:
Does the output seem logical? ______________________
If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.
Copy some Java files to the cloud (make sure to replace
<YOUR-BUCKET-NAME> with the bucket name you created in the previous step):
gsutil cp ../javahelp/src/main/java/com/google/cloud/training/dataanalyst/javahelp/*.java gs://<YOUR-BUCKET-NAME>/javahelp
Edit the Dataflow pipeline in
grepc.py by opening up in nano:
cd ~/training-data-analyst/courses/data_analysis/lab2/python nano grepc.py
and changing the PROJECT and BUCKET variables appropriately.
Submit the Dataflow to the cloud:
Because this is such a small job, running on the cloud will take significantly longer than running it locally (on the order of 2-3 minutes).
On your Cloud Console, navigate to the Dataflow section (from the 3 bars on the top-left menu), and look at the Jobs. Select your job and monitor its progress. You will see something like this:
Wait for the job status to turn to Succeeded. At this point, your CloudShell will display a command-line prompt. In CloudShell, examine the output:
gsutil cat gs://<YOUR-BUCKET-NAME>/javahelp/output-*
In this lab, you:
©Google, Inc. or its affiliates. All rights reserved. Do not distribute.