In this lab you spin up a virtual machine, configure its security, access it remotely, and then carry out the steps of an ingest-transform-and-publish data pipeline manually. This lab is part of a series of labs on processing scientific data.
To complete this lab, you need:
In this lab, you:
In this lab you spin up a virtual machine, install software on it, and use it to do scientific data processing. We do not recommend that you work with Compute Engine instances at such a low-level, but you can!
In this lab, you will use Google Cloud Platform in a manner similar to the way you likely use clusters today. Spinning up a virtual machine and running your jobs on it is the closest you can get to working with the public cloud as simply rented infrastructure. It doesn't take advantage of the other benefits that Google Cloud Platform provides -- namely the ability to forget about infrastructure and work with your scientific computation problems simply as software that requires to be run.
You will ingest real-time earthquake data published by the United States Geological Survey (USGS) and create maps that look like this:
If you don't already have a Google Account (Gmail or Google Apps), you must create one. Sign-in to Google Cloud Platform console (console.cloud.google.com) and create a new project:
Remember the project ID, a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!). It will be referred to later in this codelab as
Next, you'll need to enable billing in the Developers Console in order to use Google Cloud resources.
Running through this codelab shouldn't cost you more than a few dollars, but it could be more if you decide to use more resources or if you leave them running (see "cleanup" section at the end of this document). Google Container Engine pricing is documented here.
New users of Google Cloud Platform are eligible for a $300 free trial.
To create a Compute Engine instance:
Browse to https://cloud.google.com/
Click on Go to Console.
Click on the Menu (three horizontal lines):
Select Compute Engine.
Click Create Instance and wait for a form to load. You will need to change some options on the form that comes up.
Change Identify and API access for the Compute Engine default service account to Allow full access to all Cloud APIs:
Now, click Create
You can remotely access your Compute Engine instance using Secure Shell (SSH):
Click on SSH:
To find some information about the Compute Engine instance, type the following into the command-line:
Type the following into command-line:
sudo apt-get update sudo apt-get -y -qq install git
Verify that git is now installed
On the command-line, type:
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
This downloads the code from github.
Navigate to the folder corresponding to this lab:
Examine the ingest code using
less command allows you to view the file (Press the
spacebar to scroll down; the letter
b to back up a page; the letter
q to quit).
ingest.sh downloads a dataset of earthquakes in the past 7 days from the US Geological Survey. Notice where the file is downloaded (disk or to Cloud Storage).
Run the ingest code:
Verify that some data has been downloaded:
head command shows you the first few lines of the file.
You will use a Python program to transform the raw data into a map of earthquake activity:
The transformation code is explained in detail in this notebook:
Feel free to read the narrative to understand what the transformation code does. The notebook itself was written in Datalab, a GCP product that you will use later in this set of labs.
First, install the necessary Python packages on the Compute Engine instance:
Then, run the transformation code:
You will notice a new image file if you list the contents of the directory:
Create a bucket using the GCP console:
Browse to the GCP Console by visiting http://cloud.google.com and clicking on Go To Console
Click on the Menu (3 bars) at the top-left and select Storage
Click on Create Bucket.
Choose a globally unique bucket name (your project name is unique, so you could use that). You can leave it as Multi-Regional, or improve speed and reduce costs by making it Regional (choose the same region as your Compute Engine instance). Then, click Create.
Note down the name of your bucket.
You will insert this whenever the instructions ask for
To store the original and transformed data in Cloud Storage
In the SSH window of the Compute Engine instance, type:
gsutil cp earthquakes.* gs://<YOUR-BUCKET>/earthquakes/
to copy the files to Cloud Storage. Remember to change
<YOUR-BUCKET> to the bucket name you created earlier.
On the GCP console, click on your bucket name, and notice there are three new files present in the earthquakes folder. (You may want to click the Refresh button on the top).
To publish Cloud Storage files to the web:
On the GCP console, select all three earthquakes files that you uploaded to the bucket and click on Share publicly
Click on the Public link corresponding to earthquakes.htm
What is the URL of the published Cloud Storage file? How does it relate to your bucket name and content?
What are some advantages of publishing to Cloud Storage?
You may close the SSH window now.
To delete the Compute Engine instance (since we won't need it any more):
On the GCP console, click the Menu (three horizontal bars) and select Compute Engine
Click on the checkbox corresponding to the instance that you created (the default name was instance-1)
Click on the Delete button in the top-right corner
Does deleting the instance have any impact on the files that you stored on Cloud Storage? ___________________
In this lab, you used Google Cloud Platform (GCP) as rented infrastructure. You can spin up a Compute Engine VM, install custom software on it, and run your processing jobs. However, using GCP in this way doesn't take advantage of the other benefits that Google Cloud Platform provides -- namely the ability to forget about infrastructure and work with your scientific computation problems simply as software that requires to be run.
In the rest of the labs of this quest, you will learn how to take advantage of the separation of data and compute that GCP enables.