In this lab you spin up a virtual machine, configure its security, access it remotely, and then carry out the steps of an ingest-transform-and-publish data pipeline manually. This lab is part of a series of labs on processing scientific data.

What you need

To complete this lab, you need:

What you learn

In this lab, you:

In this lab you spin up a virtual machine, install software on it, and use it to do scientific data processing. We do not recommend that you work with Compute Engine instances at such a low-level, but you can!

In this lab, you will use Google Cloud Platform in a manner similar to the way you likely use clusters today. Spinning up a virtual machine and running your jobs on it is the closest you can get to working with the public cloud as simply rented infrastructure. It doesn't take advantage of the other benefits that Google Cloud Platform provides -- namely the ability to forget about infrastructure and work with your scientific computation problems simply as software that requires to be run.

You will ingest real-time earthquake data published by the United States Geological Survey (USGS) and create maps that look like this:

Self-paced environment setup

If you don't already have a Google Account (Gmail or Google Apps), you must create one. Sign-in to Google Cloud Platform console (console.cloud.google.com) and create a new project:

Remember the project ID, a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!). It will be referred to later in this codelab as PROJECT_ID.

Next, you'll need to enable billing in the Developers Console in order to use Google Cloud resources.

Running through this codelab shouldn't cost you more than a few dollars, but it could be more if you decide to use more resources or if you leave them running (see "cleanup" section at the end of this document). Google Container Engine pricing is documented here.

New users of Google Cloud Platform are eligible for a $300 free trial.

To create a Compute Engine instance:

Step 1

Browse to https://cloud.google.com/

Step 2

Click on Go to Console.

Step 3

Click on the Menu (three horizontal lines):

Step 4

Select Compute Engine.

Step 5

Click Create Instance and wait for a form to load. You will need to change some options on the form that comes up.

Step 6

Change Identify and API access for the Compute Engine default service account to Allow full access to all Cloud APIs:

Step 7

Now, click Create

You can remotely access your Compute Engine instance using Secure Shell (SSH):

Step 1

Click on SSH:

Step 2

To find some information about the Compute Engine instance, type the following into the command-line:

cat /proc/cpuinfo

Step 1

Type the following into command-line:

sudo apt-get update
sudo apt-get -y -qq install git

Step 2

Verify that git is now installed

git --version

Step 1

On the command-line, type:

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

This downloads the code from github.

Step 2

Navigate to the folder corresponding to this lab:

cd training-data-analyst/CPB100/lab2b

Step 3

Examine the ingest code using less:

less ingest.sh

The less command allows you to view the file (Press the spacebar to scroll down; the letter b to back up a page; the letter q to quit).

The program ingest.sh downloads a dataset of earthquakes in the past 7 days from the US Geological Survey. Notice where the file is downloaded (disk or to Cloud Storage).

Step 4

Run the ingest code:

bash ingest.sh

Step 5

Verify that some data has been downloaded:

head earthquakes.csv

The head command shows you the first few lines of the file.

You will use a Python program to transform the raw data into a map of earthquake activity:

Step 1

The transformation code is explained in detail in this notebook:

https://github.com/GoogleCloudPlatform/datalab-samples/blob/master/basemap/earthquakes.ipynb

Feel free to read the narrative to understand what the transformation code does. The notebook itself was written in Datalab, a GCP product that you will use later in this set of labs.

Step 2

First, install the necessary Python packages on the Compute Engine instance:

bash install_missing.sh

Step 3

Then, run the transformation code:

python transform.py

Step 4

You will notice a new image file if you list the contents of the directory:

ls -l

Create a bucket using the GCP console:

Step 1

Browse to the GCP Console by visiting http://cloud.google.com and clicking on Go To Console

Step 2

Click on the Menu (3 bars) at the top-left and select Storage

Step 3

Click on Create Bucket.

Step 4

Choose a globally unique bucket name (your project name is unique, so you could use that). You can leave it as Multi-Regional, or improve speed and reduce costs by making it Regional (choose the same region as your Compute Engine instance). Then, click Create.

Step 5

Note down the name of your bucket.

You will insert this whenever the instructions ask for <YOUR-BUCKET>.

To store the original and transformed data in Cloud Storage

Step 1

In the SSH window of the Compute Engine instance, type:

gsutil cp earthquakes.* gs://<YOUR-BUCKET>/earthquakes/

to copy the files to Cloud Storage. Remember to change <YOUR-BUCKET> to the bucket name you created earlier.

Step 2

On the GCP console, click on your bucket name, and notice there are three new files present in the earthquakes folder. (You may want to click the Refresh button on the top).

To publish Cloud Storage files to the web:

Step 1

On the GCP console, select all three earthquakes files that you uploaded to the bucket and click on Share publicly

Step 2

Click on the Public link corresponding to earthquakes.htm

Step 3

What is the URL of the published Cloud Storage file? How does it relate to your bucket name and content?

______________________________________________________

Step 4

What are some advantages of publishing to Cloud Storage?

_____________________________________________

Step 5

You may close the SSH window now.

To delete the Compute Engine instance (since we won't need it any more):

Step 1

On the GCP console, click the Menu (three horizontal bars) and select Compute Engine

Step 2

Click on the checkbox corresponding to the instance that you created (the default name was instance-1)

Step 3

Click on the Delete button in the top-right corner

Step 4

Does deleting the instance have any impact on the files that you stored on Cloud Storage? ___________________

In this lab, you used Google Cloud Platform (GCP) as rented infrastructure. You can spin up a Compute Engine VM, install custom software on it, and run your processing jobs. However, using GCP in this way doesn't take advantage of the other benefits that Google Cloud Platform provides -- namely the ability to forget about infrastructure and work with your scientific computation problems simply as software that requires to be run.

In the rest of the labs of this quest, you will learn how to take advantage of the separation of data and compute that GCP enables.