In this lab, you learn how to write a simple Dataflow pipeline and run it both locally and on the cloud.
You must have completed Lab 0 and have the following:
In this lab, you learn how to:
The goal of this lab is to become familiar with the structure of a Dataflow project and learn how to execute a Dataflow pipeline.
In CloudShell, if you haven't already, git clone the repository:
git clone https://github.com/GoogleCloudPlatform/training-data-analyst.git
Navigate to the folder containing the starter code for this lab:
Install the necessary dependencies for Python dataflow:
Verify that you have the right version of pip (should be > 8.0):
If not, open a new CloudShell tab and it should pick up the updated pip.
Open the CloudShell code editor by clicking the pencil icon
View the source code for the pipeline by navigating to:
What files are being read? _____________________________________________________
What is the search term? ______________________________________________________
Where does the output go? ___________________________________________________
There are three transforms in the pipeline:
Execute locally in CloudShell:
Note: if you see an error that says "
No handlers could be found for logger "oauth2client.contrib.multistore_file", you may ignore it. The error is simply saying that logging from the oauth2 library will go to stderr.
Examine the output file:
Does the output seem logical? ______________________
In the Console search box, type dataflow to find the Dataflow API and click on the hyperlink.
Click Enable if necessary
If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.
Copy some Java files to the cloud (make sure to replace
<YOUR-BUCKET-NAME> with the bucket name you created in the previous step):
gsutil cp ../javahelp/src/main/java/com/google/cloud/training/dataanalyst/javahelp/*.java gs://<YOUR-BUCKET-NAME>/javahelp
Edit the Dataflow pipeline in
grepc.py by opening it up in CloudShell Code Editor and changing the PROJECT and BUCKET variables appropriately.
Submit the Dataflow to the cloud:
Because this is such a small job, running on the cloud will take significantly longer than running it locally (on the order of 2-3 minutes).
On your Cloud Console, navigate to the Dataflow section (from the 3 bars on the top-left menu), and look at the Jobs. Select your job and monitor its progress. You will see something like this:
Wait for the job status to turn to Succeeded.
In CloudShell, examine the output:
gsutil cat gs://<YOUR-BUCKET-NAME>/javahelp/output-*
In this lab, you:
©Google, Inc. or its affiliates. All rights reserved. Do not distribute.