In this lab you process Landsat data in a distributed manner using Apache Beam and Cloud Dataflow. This lab is part of a series of labs on processing scientific data.

What you need

To complete this lab, you need:

What you learn

In this lab, you:

Consider using Apache Beam on Cloud Dataflow to scale out compute-intensive jobs that meet these characteristics:

  1. Your data is not tabular and you can not use SQL to do the analysis. (If it is tabular, use BigQuery).
  2. Large portions of the job are embarrassingly parallel -- in other words, you can process different subsets of the data on different machines.
  3. Your logic involves custom functions, iterations, etc..
  4. The distribution of the work varies across your data subsets.

To enable the Dataflow API:

Step 1

In the web console, navigate to API Manager and click on Enable API

Step 2

In the search window, type "Dataflow" and select Google Cloud Dataflow API

Step 3

If not already enabled, click ENABLE

The Apache Beam code sets up a pipeline to carry out distributed processing of Landsat data.

Step 1

Read this blog post on what this pipeline does, why it does it that way, and what the results look like.

Step 2

Open Cloud Shell. The cloud shell icon is at the top right:

Step 3

In Cloud Shell, clone the repository containing the code for data processing:

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

Step 4

In Cloud Shell, click on the "Launch code editor" button on the top-right ribbon. In the code editor, navigate to training-data-analyst/blogs/landsat/ and click on dfndvi.py and examine the pipeline code in the method run()

To submit Beam code to the Dataflow service:

Step 1

In Cloud Shell, navigate to the folder containing the code for this lab and install the necessary packages.

cd ~/training-data-analyst/blogs/landsat/
yes | sudo ./install_packages.sh

Step 2

Submit the Beam pipeline to Dataflow specifying the PROJECT and BUCKET appropriately (create a bucket on the web console if necessary):

./run_oncloud.sh   <PROJECT>   <BUCKET>

Step 1

Navigate to the web console's Dataflow section and notice the newly submitted Job.

Step 2

Click on the job link and view the pipeline and the autoscaling worker graphs.

Step 3

When the job finishes (this will take 15-20 minutes, possibly more if your quota doesn't allow 10 simultaneous workers), navigate to your bucket (in the Storage section of the web console) and verify that there is now Landsat output there.

In this lab, you learned how to create data pipelines in Python and run it in an autoscaled manner on Google Cloud Platform.

┬ęGoogle, Inc. or its affiliates. All rights reserved. Do not distribute.