In this workshop, we walk through the process of building a complete machine learning pipeline covering ingest, exploration, training, evaluation, deployment, and prediction. Along the way, we will discuss how to explore and split large data sets correctly using BigQuery and Cloud Datalab. The machine learning model in TensorFlow will be developed on a small sample locally. The preprocessing operations will be implemented in Cloud Dataflow, so that the same preprocessing can be applied in streaming mode as well. The training of the model will then be distributed and scaled out on Cloud ML Engine. The trained model will be deployed as a microservice and predictions invoked from a web application.

This lab consists of 7 parts and will take you about 3 hours. It goes along with this slide deck.

What you need

To complete this lab, you need:

What you learn

In this lab, you:

This lab illustrates how you can carry out data exploration of large datasets, but continue to use familiar tools like Pandas and Jupyter. The "trick" is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. Cloud Datalab provides a managed Jupyter experience, so that you don't need to run notebook servers yourself.

Launch Cloud Datalab

To launch Cloud Datalab:

Step 1

Open Cloud Shell. The cloud shell icon is at the top right of the Google Cloud Platform web console:

Step 2

In Cloud Shell, type:

gcloud compute zones list

Pick a zone in a geographical region close to you.

Step 3

In Cloud Shell, type:

datalab create babyweight --zone <ZONE>

substituting the zone you chose in Step 2 for "<ZONE>". Datalab will take about 5 minutes to start. Move on to the next step while it continues working to setup your Jupyter server.

Invoke BigQuery

You will now use BigQuery, a serverless data warehouse, to explore the natality dataset so that we can choose the features for our machine learning model.

To invoke a BigQuery query:

Step 1

Navigate to the BigQuery console by selecting BigQuery from the top-left-corner ("hamburger") menu.

Step 2

In the BigQuery Console, click on Compose Query.

Step 3

In the query textbox, type:

#standardsql
SELECT
  plurality,
  COUNT(1) AS num_babies,
  AVG(weight_pounds) AS avg_wt
FROM
  publicdata.samples.natality
WHERE
  year > 2000 AND year < 2005
GROUP BY
  plurality

How many triplets were born in the US between 2000 and 2005? ___________

Draw graphs in Cloud Datalab

Switch back to your Cloud Shell window.

Step 1

If necessary, wait for Datalab to finish launching. Datalab is ready when you see a message prompting you to do a "Web Preview".

Step 2

Click on the Web Preview icon on the top-right corner of the Cloud Shell ribbon. Switch to port 8081.

Step 3

In Datalab, start a new notebook by clicking on the +Notebook icon.

Step 4

In a cell in Datalab, type the following, then click Run and wait until you see a table of data.

query="""
SELECT
  weight_pounds,
  is_male,
  mother_age,
  plurality,
  gestation_weeks
FROM
  publicdata.samples.natality
WHERE year > 2000
"""
import google.datalab.bigquery as bq
df = bq.Query(query + " LIMIT 100").execute().result().to_dataframe()
df.head()

Note that we have gotten the results from BigQuery as a Pandas dataframe.

Step 5

In the next cell in Datalab, type the following, then click Run

def get_distinct_values(column_name):
  sql = """
SELECT
  {0},
  COUNT(1) AS num_babies,
  AVG(weight_pounds) AS avg_wt
FROM
  publicdata.samples.natality
WHERE
  year > 2000
GROUP BY
  {0}
  """.format(column_name)
  return bq.Query(sql).execute().result().to_dataframe()

df = get_distinct_values('is_male')
df.plot(x='is_male', y='avg_wt', kind='bar');

Are male babies heavier or lighter than female babies? Did you know this? _______

Is the sex of the baby a good feature to use in our machine learning model? _____

Step 6

In the next cell in Datalab, type the following, then click Run

df = get_distinct_values('gestation_weeks')
df = df.sort_values('gestation_weeks')
df.plot(x='gestation_weeks', y='avg_wt', kind='bar');

This graph shows the average weight of babies born in the each week of pregancy. The way you'd read the graph is to look at the y-value for x=35 to find out the average weight of a baby born in the 35th week of pregnancy.

Is gestation_weeks a good feature to use in our machine learning model? _____

Is gestation_weeks always available? __________

Compare the variability of birth weight due to sex of baby and due to gestation weeks. Which factor do you think is more important for accurate weight prediction? __________________________________

Summary

In this step, you learned how to carry out data exploration of large datasets using BigQuery, Pandas, and Juypter. The "trick" is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. Cloud Datalab provides a managed Jupyter experience, so that you don't need to run notebook servers yourself.

Clone repository

In Datalab:

Step 1

Click on the Datalab icon to go to the listing page and start a new notebook by clicking on the +Notebook icon.

Step 2

In the cell in Datalab, type the following, then click Run

%bash
git clone \
    https://github.com/GoogleCloudPlatform/training-data-analyst/

Run notebook

Step 1

Go back to the listing page in Datalab and navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 2_sample.ipynb

Step 2

Clear all the cells in the notebook (look for the Clear button on the Datalab toolbar), change the region, project and bucket settings in the first cell, and then Run the cells one by one.

Summary

In this step, you learned how to use Pandas in Datalab and sample a dataset for local development.

Step 1

Go back to the listing page in Datalab and navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 3_tensorflow.ipynb

Step 2

Clear all the cells in the notebook, change the project and bucket settings in the first cell, and then Run the cells one by one.

Summary

In this step, you learned how to develop Tensorflow models in Datalab on a small sampled dataset.

Step 1

Go back to the listing page in Datalab and navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 4_preproc.ipynb

Step 2

Clear all the cells in the notebook, change the project and bucket settings in the first cell, and then Run the cells one by one.

Summary

In this step, you learned how to preprocess data at scale for machine learning.

Step 1

Go back to the listing page in Datalab and navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 5_train.ipynb

Step 2

Clear all the cells in the notebook, change the project and bucket settings in the first cell, and then Run the cells one by one.

Summary

In this step, you learned how to train a large scale model, hyperparameter tune it, and create a model ready to deploy.

Step 1

Go back to the listing page in Datalab and navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 6_deploy.ipynb

Step 2

Clear all the cells in the notebook, change the project and bucket settings in the first cell, and then Run the cells one by one.

Summary

In this step, you learned to deploy a trained model as a microservice and get it to do both online and batch prediction.

Step 1

Open CloudShell, and git clone the repository if necessary:

git clone \
    https://github.com/GoogleCloudPlatform/training-data-analyst/

Step 2

In CloudShell, deploy the website application:

cd training-data-analyst/courses/machine_learning/deepdive
cd 06_structured/serving
./deploy.sh

Step 3

In a browser, visit https://<PROJECT>.appspot.com/ and try out the application.

Step 4

In CloudShell, call a Java program that invokes the web service:

cd ~/training-data-analyst/courses/machine_learning/deepdive
cd 06_structured/serving
./run_once.sh

Step 5

In CloudShell, call a Dataflow pipeline that invokes the web service on a text file:

cd ~/training-data-analyst/courses/machine_learning/deepdive
cd 06_structured/serving
./run_ontext.sh

The code will also work real-time, reading from Pub/Sub and writing to BigQuery:

cd ~/training-data-analyst/courses/machine_learning/deepdive
cd 06_structured/serving
cat ./run_dataflow.sh

Summary

In this step, you deployed an AppEngine web application that consumes the machine learning service. You also looked at how to consume the ML predictions from Dataflow, both in batch mode and in real-time.

Step 1

Click on the person icon in the top-right corner of your Datalab window and click on the link to manage the VM.

Step 2

In the web console, select the Datalab VM and click DELETE.

Note: Cloud Datalab creates a a persistent disk to store your notebooks. These notebooks will be available the next time you start Datalab. Datalab also creates a firewall rule for tunneling. Use the web console to delete the disk and firewall rule if you will not use Datalab again.

┬ęGoogle, Inc. or its affiliates. All rights reserved. Do not distribute.