In this lab you spin up a virtual machine, configure its security, access it remotely, and then carry out the steps of an ingest-transform-and-publish data pipeline manually.

What you need

You must have completed Lab 0 and have the following:

What you learn

In this lab, you:

In this lab you spin up a virtual machine, install software on it, and use it to do scientific data processing. We do not recommend that you work with Compute Engine instances at such a low-level, but you can!

In this lab, you will use Google Cloud Platform in a manner similar to the way you likely use clusters today. Spinning up a virtual machine and running your jobs on it is the closest you can get to working with the public cloud as simply rented infrastructure. It doesn't take advantage of the other benefits that Google Cloud Platform provides -- namely the ability to forget about infrastructure and work with your scientific computation problems simply as software that requires to be run.

You will ingest real-time earthquake data published by the United States Geological Survey (USGS) and create maps that look like this:

To create a Compute Engine instance:

Step 1

Browse to https://cloud.google.com/

Step 2

Click on Console.

Step 3

Click on the Menu (three horizontal lines):

Step 4

Select Compute Engine.

Step 5

Click Create Instance and wait for a form to load. You will need to change some options on the form that comes up.

Step 6

Change Identify and API access for the Compute Engine default service account to Allow full access to all Cloud APIs:

Step 7

Now, click Create

You can remotely access your Compute Engine instance using Secure Shell (SSH):

Step 1

Click on SSH:

Step 2

To find some information about the Compute Engine instance, type the following into the command-line:

cat /proc/cpuinfo

Step 1

Type the following into command-line:

sudo apt-get update
sudo apt-get -y -qq install git

Step 2

Verify that git is now installed

git --version

Step 1

On the command-line, type:

gcloud source repos clone asl-ml-immersion --project=asl-ml-immersion

This clones the code repo.

Step 2

Navigate to the folder corresponding to this lab:

cd asl-ml-immersion/1.1_welcome_soln/

Step 3

Examine the ingest code using less:

less ingest.sh

The less command allows you to view the file (Press the spacebar to scroll down; the letter b to back up a page; the letter q to quit).

The program ingest.sh downloads a dataset of earthquakes in the past 7 days from the US Geological Survey. Where is this file downloaded? To disk or to Cloud Storage? __________________________

Step 4

Run the ingest code:

bash ingest.sh

Step 5

Verify that some data has been downloaded:

head earthquakes.csv

The head command shows you the first few lines of the file.

You will use a Python program to transform the raw data into a map of earthquake activity:

Step 1

The transformation code is explained in detail in this notebook:

https://github.com/GoogleCloudPlatform/datalab-samples/blob/master/basemap/earthquakes.ipynb

Feel free to read the narrative to understand what the transformation code does. The notebook itself was written in Datalab, a GCP product that you will use later in this set of labs.

Step 2

First, install the necessary Python packages on the Compute Engine instance:

bash install_missing.sh

Step 3

Then, run the transformation code:

python transform.py

Step 4

You will notice a new image file if you list the contents of the directory:

ls -l

Create a bucket using the GCP console:

Step 1

Browse to the GCP Console by visiting http://cloud.google.com) and clicking on Go To Console

Step 2

Click on the Menu (3 bars) at the top-left and select Storage

Step 3

Click on Create Bucket.

Step 4

Choose a globally unique bucket name (your project name is unique, so you could use that). You can leave it as Multi-Regional, or improve speed and reduce costs by making it Regional (choose the same region as your Compute Engine instance). Then, click Create.

Step 5

Note down the name of your bucket: _______________________________

In this and future labs, you will insert this whenever the directions ask for <YOUR-BUCKET>.

To store the original and transformed data in Cloud Storage

Step 1

In the SSH window of the Compute Engine instance, type:

gsutil cp earthquakes.* gs://<YOUR-BUCKET>/earthquakes/

to copy the files to Cloud Storage

Step 2

On the GCP console, click on your bucket name, and notice there are three new files present in the earthquakes folder.

To publish Cloud Storage files to the web:

Step 1

On the GCP console, select all three earthquakes files that you uploaded to the bucket and click on Share publicly

Step 2

Click on the Public link corresponding to earthquakes.htm

Step 3

What is the URL of the published Cloud Storage file? How does it relate to your bucket name and content?

______________________________________________________

Step 4

What are some advantages of publishing to Cloud Storage? _____________________________________________

To delete the Compute Engine instance (since we won't need it any more):

Step 1

On the GCP console, click the Menu (three horizontal bars) and select Compute Engine

Step 2

Click on the checkbox corresponding to the instance that you created (the default name was instance-1)

Step 3

Click on the Delete button in the top-right corner

Step 4

Does deleting the instance have any impact on the files that you stored on Cloud Storage? ___________________

In this lab, you used Google Cloud Platform (GCP) as rented infrastructure. You can spin up a Compute Engine VM, install custom software on it, and run your processing jobs. However, using GCP in this way doesn't take advantage of the other benefits that Google Cloud Platform provides -- namely the ability to forget about infrastructure and work with your scientific computation problems simply as software that requires to be run.

┬ęGoogle, Inc. or its affiliates. All rights reserved. Do not distribute.