In this lab, you will create a Dataproc cluster that includes Datalab and the Google Python Client API. You will then create iPython notebooks that integrate with BigQuery and storage and utilize Spark.

What you learn

In this lab, you:

Additional software can be added to Dataproc clusters, and clusters can be customized using initialization actions. Initialization actions are simply executables that are run when the cluster is being created.

You will use a pre-built initialization action to install Datalab and a custom one to install the Google Client Python API.

Datalab allows you to write interactive Python and PySpark notebooks that are useful in data analysis. You will create a couple of notebooks in this exercise that make use of our Dataproc cluster and also integrate with Google BigQuery and Google Cloud Storage.

You will create a custom initialization action to install a Python package.

Step 1

In your Cloud Console, open a Google Cloud Shell and enter the following command to create a Cloud Storage bucket with the same name as your project ID

gsutil mb -c regional -l us-central1 gs://$DEVSHELL_PROJECT_ID

Step 2

In your Cloud Shell, git clone the course repository, and upload the custom initialization script to GCS. Change the bucket name as necessary.

git clone https://github.com/GoogleCloudPlatform/training-data-analyst
cd training-data-analyst/courses/unstructured/
bash replace_and_upload.sh <YOUR-BUCKET-NAME>

Step 3

View the custom initialization script. Change the bucket name as necessary.

gsutil cat gs://<YOUR-BUCKET-NAME>/unstructured/init-script.sh

What does this initialization action do on all nodes? What does it do only on the master node?

____________________________________________________________________________________

You will create a cluster that will include two initialization actions: (1) a pre-built action from Google to install Datalab, and (2) a custom initialization action to install a Python package.

Step 1

Use the Products and Services menu to navigate to the Dataproc service. Click to Enable API. If you have any clusters currently running, you can delete them.

Step 2

Click the Create cluster button and set the following parameters.

Click on the link shown below to expand more options.

Copy and paste the following script URL into the Initialization actions text box and press Enter. (This script installs Google Cloud Datalab on your cluster's master node.)

gs://dataproc-initialization-actions/datalab/datalab.sh

Copy and paste this second initialization action into the Initialization actions text box and press Enter. Change the bucket name appropriately. (This script installs the Google Python Client API on all the machines in the cluster and clones the course repository to the Master node, so that Datalab will have access to the notebooks that are in the repository.)

gs://<YOUR-BUCKET-NAME>/unstructured/init-script.sh

Check the Project access box as shown below to allow your cluster to access other Google Cloud Platform services..

Step 3

To create the cluster, either click the Create button or click on the Command line link and copy the command onto your clipboard and then run it from Google Cloud Shell.

Step 4

It will take a little longer for your cluster to be created this time, because the scripts have to run. While you are waiting, browse to the following github site where you will find many other initialization actions that have been written for you.

https://github.com/GoogleCloudPlatform/dataproc-initialization-actions

Step 5

You are going to allow access to your Dataproc cluster, but only to your machine. To do this, you will need to know your IP Address. Go to the following URL to find out what it is:

http://ip4.me/

Step 6

In the Cloud Console, click the menu on the left and select Networking from the Compute section. Click Firewall rules in the left-hand navigation pane. Click on the Create Firewall Rule button. Then, Enter the following:

tcp:8088;tcp:9870;tcp:8080

Step 7

When your cluster is finished initializing, click on its name to go to its details page, then click on the VM Instances tab, and finally click on the master node to view its details.

Scroll down and find the master node's external IP address and copy it to your clipboard.

Open a new browser tab, paste in this IP address and then add :8080 after the address. This opens Datalab. You will be redirected to the Datalab main screen as shown below:

Let's just create a simple Python Notebook and make sure everything is working.

Step 1

On the left side of the Datalab home page click the + Notebook button.

Step 2

In the first cell, just enter the following Python code.

temp = 212.0

def toCelsius(fahrenheit):
    return (fahrenheit - 32) * 5.0 / 9.0

print toCelsius(temp)

Step 3

Click the Run button in the toolbar and examine the results. It should look as shown below. (It might take a little while for the notebook to start.)

The Python package Pandas comes with support to run BigQuery queries.

Step 1

In the second code block add the following code and click Run. These import statements will allow you to run a BigQuery query.

import pandas as pd
from pandas.io import gbq

print "Imports run."

Step 2

In the next code block, add the following code changing the projectId variable to your project id.

(You can find your project id in the Google Cloud Platform Web Console. Select Home from the Cloud Console menu.)

projectId = "YOUR-PROJECT-ID-HERE" # CHANGE
sql = """
SELECT
  year,
  AVG(weight_pounds) AS avg_weight
FROM
  publicdata.samples.natality
GROUP BY
  year
ORDER BY
  year ASC
"""

print 'Running query...'
data = gbq.read_gbq(sql, project_id=projectId)

data[:5]

Click the Run button. The BigQuery query is run and the results put into a Pandas DataFrame. The last line just outputs the first 5 records. The results are shown below

Step 3

In the next code block, add the following code to plot a graph using Pandas

data.plot(x='year', y='avg_weight');

You should get a graph that looks like this:

Step 4

In the Datalab menu bar, select Notebook | Rename. Name the notebook BigQuery-Test and then click OK. You can then close that tab and return to the Datalab Home page.

Step 5

Back at the Datalab home page in the upper right corner of the toolbar are 4 icons. Hover over the second one (the one that looks like a stack of progress bars) and the resulting tooltip should read Running Sessions. Click on that icon.

On the resulting page you should see one active notebook, the BigQuery-Test notebook you just created.

Click the Shutdown button on the right side and then close this tab.

The last notebook didn't run anything in parallel on your Dataproc cluster. This time, let's get a notebook from the GitHub repository and execute it. This notebook uses PySpark and makes use of your Spark cluster.

Step 1

Back at the Datalab home page in the upper right corner of the toolbar are 4 icons. Hover over the first one (the one that looks like a fork in the road) and the resulting tooltip should read Open ungit. Click on that icon.

Step 2

Fill out the form to clone the github repository corresponding to the course:

https://github.com/GoogleCloudPlatform/training-data-analyst

Then click on Clone repository

Step 3

Back on the Datalab home page click the Home icon and navigate to datalab/notebooks/training-data-analyst/courses/unstructured. Click on PySpark-Test-Solution.ipynb to open that notebook.

Step 4

In the notebook, Click on Clear | All Cells. Now, execute each cell in turn, making sure to change any occurrences of BUCKET_NAME to be the name of your bucket.

Step 5

You will want to stop this notebook as you did the previous one. Click the Running Sessions link on the right side of the toolbar. Then, click the Shutdown button to the left of the PySpark-Test-Solution notebook.

Close this tab and return to the Datalab home page.

There's no need to keep any clusters.

Step 1

Close the Datalab tab.

Step 2

Navigate to the Dataproc service using the Web Console. Delete any clusters that you created in this exercise.