In this lab you spin up a virtual machine, configure its security, access it remotely, and then carry out the steps of an ingest-transform-and-publish data pipeline manually.
You must have completed Lab 0 and have the following:
In this lab, you:
In this lab you spin up a virtual machine, install software on it, and use it to do scientific data processing. We do not recommend that you work with Compute Engine instances at such a low-level, but you can!
In this lab, you will use Google Cloud Platform in a manner similar to the way you likely use clusters today. Spinning up a virtual machine and running your jobs on it is the closest you can get to working with the public cloud as simply rented infrastructure. It doesn't take advantage of the other benefits that Google Cloud Platform provides -- namely the ability to forget about infrastructure and work with your scientific computation problems simply as software that requires to be run.
You will ingest real-time earthquake data published by the United States Geological Survey (USGS) and create maps that look like this:
To create a Compute Engine instance:
Browse to https://cloud.google.com/
Click on Console.
Click on the Menu (three horizontal lines):
Select Compute Engine.
Click Create Instance and wait for a form to load. You will need to change some options on the form that comes up.
Change Identify and API access for the Compute Engine default service account to Allow full access to all Cloud APIs:
Now, click Create
You can remotely access your Compute Engine instance using Secure Shell (SSH):
Click on SSH:
To find some information about the Compute Engine instance, type the following into the command-line:
Type the following into command-line:
sudo apt-get update sudo apt-get -y -qq install git
Verify that git is now installed
On the command-line, type:
git clone https://github.com/GoogleCloudPlatform/training-data-analyst.git
This clones the code repo.
Navigate to the folder corresponding to this lab:
Examine the ingest code using
less command allows you to view the file (Press the
spacebar to scroll down; the letter
b to back up a page; the letter
q to quit).
ingest.sh downloads a dataset of earthquakes in the past 7 days from the US Geological Survey. Where is this file downloaded? To disk or to Cloud Storage? __________________________
Run the ingest code:
Verify that some data has been downloaded:
head command shows you the first few lines of the file.
You will use a Python program to transform the raw data into a map of earthquake activity:
The transformation code is explained in detail in this notebook:
Feel free to read the narrative to understand what the transformation code does. The notebook itself was written in Datalab, a GCP product that you will use later in this set of labs.
First, install the necessary Python packages on the Compute Engine instance:
Then, run the transformation code:
You will notice a new image file if you list the contents of the directory:
Create a bucket using the GCP console:
Browse to the GCP Console by visiting http://cloud.google.com) and clicking on Go To Console
Click on the Menu (3 bars) at the top-left and select Storage
Click on Create Bucket.
Choose a globally unique bucket name (your project name is unique, so you could use that). You can leave it as Multi-Regional, or improve speed and reduce costs by making it Regional (choose the same region as your Compute Engine instance). Make sure that Machine Learning Engine is available in the region you select since we will be needing that later in the course by going to https://cloud.google.com/about/locations/. Then, click Create.
Note down the name of your bucket: _______________________________
In this and future labs, you will insert this whenever the directions ask for
To store the original and transformed data in Cloud Storage
In the SSH window of the Compute Engine instance, type:
gsutil cp earthquakes.* gs://<YOUR-BUCKET>/earthquakes/
to copy the files to Cloud Storage
On the GCP console, click on your bucket name, and notice there are three new files present in the earthquakes folder.
To publish Cloud Storage files to the web:
Follow the instructions in the cloud docs to make the objects public
What is the URL of the published Cloud Storage file? How does it relate to your bucket name and content?
What are some advantages of publishing to Cloud Storage? _____________________________________________
To delete the Compute Engine instance (since we won't need it any more):
On the GCP console, click the Menu (three horizontal bars) and select Compute Engine
Click on the checkbox corresponding to the instance that you created (the default name was instance-1)
Click on the Delete button in the top-right corner
Does deleting the instance have any impact on the files that you stored on Cloud Storage? ___________________
In this lab, you used Google Cloud Platform (GCP) as rented infrastructure. You can spin up a Compute Engine VM, install custom software on it, and run your processing jobs. However, using GCP in this way doesn't take advantage of the other benefits that Google Cloud Platform provides -- namely the ability to forget about infrastructure and work with your scientific computation problems simply as software that requires to be run.
©Google, Inc. or its affiliates. All rights reserved. Do not distribute.