In this lab you process Landsat data in a distributed manner using Apache Beam and Cloud Dataflow. This lab is part of a series of labs on processing scientific data.
To complete this lab, you need:
In this lab, you:
Consider using Apache Beam on Cloud Dataflow to scale out compute-intensive jobs that meet these characteristics:
To enable the Dataflow API:
In the web console, navigate to API Manager and click on Enable API
In the search window, type "Dataflow" and select Google Cloud Dataflow API
If not already enabled, click ENABLE
The Apache Beam code sets up a pipeline to carry out distributed processing of Landsat data.
Read this blog post on what this pipeline does, why it does it that way, and what the results look like.
Open Cloud Shell. The cloud shell icon is at the top right:
In Cloud Shell, clone the repository containing the code for data processing:
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
In Cloud Shell, click on the "Launch code editor" button on the top-right ribbon. In the code editor, navigate to
training-data-analyst/blogs/landsat/ and click on
dfndvi.py and examine the pipeline code in the method run()
To submit Beam code to the Dataflow service:
In Cloud Shell, navigate to the folder containing the code for this lab and install the necessary packages.
cd ~/training-data-analyst/blogs/landsat/ yes | sudo ./install_packages.sh
Submit the Beam pipeline to Dataflow specifying the PROJECT and BUCKET appropriately (create a bucket on the web console if necessary):
./run_oncloud.sh <PROJECT> <BUCKET>
Navigate to the web console's Dataflow section and notice the newly submitted Job.
Click on the job link and view the pipeline and the autoscaling worker graphs.
When the job finishes (this will take 15-20 minutes, possibly more if your quota doesn't allow 10 simultaneous workers), navigate to your bucket (in the Storage section of the web console) and verify that there is now Landsat output there.
In this lab, you learned how to create data pipelines in Python and run it in an autoscaled manner on Google Cloud Platform.
©Google, Inc. or its affiliates. All rights reserved. Do not distribute.