In this workshop, we walk through the process of building a complete machine learning pipeline covering ingest, exploration, training, evaluation, deployment, and prediction. Along the way, we will discuss how to explore and split large data sets correctly using BigQuery and Cloud Datalab. The machine learning model in TensorFlow will be developed on a small sample locally. The preprocessing operations will be implemented in Cloud Dataflow, so that the same preprocessing can be applied in streaming mode as well. The training of the model will then be distributed and scaled out on Cloud ML Engine. The trained model will be deployed as a microservice and predictions invoked from a web application.
This lab consists of 8 parts and will take you about 3 hours. It goes along with this slide deck.
To complete this lab, you need:
In this lab, you:
This lab illustrates how you can carry out data exploration of large datasets, but continue to use familiar tools like Pandas and Jupyter. The "trick" is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. Cloud Datalab provides a managed Jupyter experience, so that you don't need to run notebook servers yourself.
To launch Cloud Datalab:
Open Cloud Shell. The cloud shell icon is at the top right of the Google Cloud Platform web console:
In Cloud Shell, type:
gcloud compute zones list
Pick a zone in a geographical region close to you.
In Cloud Shell, type:
datalab create babyweight --zone <ZONE>
substituting the zone you chose in Step 2 for "<ZONE>". Datalab will take about 5 minutes to start. Move on to the next step while it continues working to setup your Jupyter server.
You will now use BigQuery, a serverless data warehouse, to explore the natality dataset so that we can choose the features for our machine learning model.
To invoke a BigQuery query:
Navigate to the BigQuery console by selecting BigQuery from the top-left-corner ("hamburger") menu.
In the BigQuery Console, click on Compose Query.
In the query textbox, type:
#standardsql SELECT plurality, COUNT(1) AS num_babies, AVG(weight_pounds) AS avg_wt FROM publicdata.samples.natality WHERE year > 2000 AND year < 2005 GROUP BY plurality
How many triplets were born in the US between 2000 and 2005? ___________
Switch back to your Cloud Shell window.
If necessary, wait for Datalab to finish launching. Datalab is ready when you see a message prompting you to do a "Web Preview".
Click on the Web Preview icon on the top-right corner of the Cloud Shell ribbon. Switch to port 8081.
In Datalab, start a new notebook by clicking on the +Notebook icon.
In a cell in Datalab, type the following, then click Run and wait until you see a table of data.
query=""" SELECT weight_pounds, is_male, mother_age, plurality, gestation_weeks FROM publicdata.samples.natality WHERE year > 2000 """ import google.datalab.bigquery as bq df = bq.Query(query + " LIMIT 100").execute().result().to_dataframe() df.head()
Note that we have gotten the results from BigQuery as a Pandas dataframe.
In the next cell in Datalab, type the following, then click Run
def get_distinct_values(column_name): sql = """ SELECT {0}, COUNT(1) AS num_babies, AVG(weight_pounds) AS avg_wt FROM publicdata.samples.natality WHERE year > 2000 GROUP BY {0} """.format(column_name) return bq.Query(sql).execute().result().to_dataframe() df = get_distinct_values('is_male') df.plot(x='is_male', y='avg_wt', kind='bar');
Are male babies heavier or lighter than female babies? Did you know this? _______
Is the sex of the baby a good feature to use in our machine learning model? _____
In the next cell in Datalab, type the following, then click Run
df = get_distinct_values('gestation_weeks') df = df.sort_values('gestation_weeks') df.plot(x='gestation_weeks', y='avg_wt', kind='bar');
This graph shows the average weight of babies born in the each week of pregancy. The way you'd read the graph is to look at the y-value for x=35 to find out the average weight of a baby born in the 35th week of pregnancy.
Is gestation_weeks
a good feature to use in our machine learning model? _____
Is gestation_weeks
always available? __________
Compare the variability of birth weight due to sex of baby and due to gestation weeks. Which factor do you think is more important for accurate weight prediction? __________________________________
In this step, you learned how to carry out data exploration of large datasets using BigQuery, Pandas, and Jupyter. The "trick" is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. Cloud Datalab provides a managed Jupyter experience, so that you don't need to run notebook servers yourself.
In Datalab:
Click on the Datalab icon to go to the listing page and start a new notebook by clicking on the +Notebook icon.
In the cell in Datalab, type the following, then click Run
%bash git clone \ https://github.com/GoogleCloudPlatform/training-data-analyst/
Go back to the listing page in Datalab and navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 2_sample.ipynb
Clear all the cells in the notebook (look for the Clear button on the Datalab toolbar), change the region, project and bucket settings in the first cell, and then Run the cells one by one.
In this step, you learned how to use Pandas in Datalab and sample a dataset for local development.
Go back to the listing page in Datalab and navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 3_tensorflow.ipynb
Clear all the cells in the notebook, change the project and bucket settings in the first cell, and then Run the cells one by one.
In this step, you learned how to develop Tensorflow models in Datalab on a small sampled dataset.
Go back to the listing page in Datalab and navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 4_preproc.ipynb
Clear all the cells in the notebook, change the project and bucket settings in the first cell, and then Run the cells one by one.
In this step, you learned how to preprocess data at scale for machine learning.
Go back to the listing page in Datalab and navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 5_train.ipynb
Clear all the cells in the notebook, change the project and bucket settings in the first cell, and then Run the cells one by one.
In this step, you learned how to train a large scale model, hyperparameter tune it, and create a model ready to deploy.
Go back to the listing page in Datalab and navigate to: training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 6_deploy.ipynb
Clear all the cells in the notebook, change the project and bucket settings in the first cell, and then Run the cells one by one.
In this step, you learned to deploy a trained model as a microservice and get it to do both online and batch prediction.
In CloudShell, type:
git clone \ https://github.com/GoogleCloudPlatform/training-data-analyst/ cd training-data-analyst/training-data-analyst/tree/master/courses/machine_learning/deepdive/06_structured/pipelines ./create_cluster.sh
Edit the following files in your favorite text editor so that the pipeline finishes quickly:
containers/hypertrain/train.sh | Change train_examples to 2000 |
containers/hypertrain/hyperparam.yaml | Change maxTrials to 2 |
Change maxParallelTrials to 2 | |
containers/traintuned/ | Change train_examples to 20000 |
Navigate to the GKE section of the GCP console and make sure cluster is created.
In CloudShell, build the containers:
cd containers ./build_all.sh cd ..
In CloudShell, start the UI:
./start_ui.sh
Navigate to https://localhost:8085
Open up a Terminal in the Jupyter environment
Type in:
git clone \ https://github.com/GoogleCloudPlatform/training-data-analyst/
Go back to the listing page in Jupyter and navigate to: training-data-analyst/courses/machine_learning/deepdive/06_structured/pipelines and click on mlp_babyweight.py
Edit the project name to match your project. It has to be a static string.
Go back to the listing page in Jupyter and navigate to: training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 7_pipelines.ipynb
Run the notebook to upload the pipeline to Kubeflow pipelines.
Step 13
Go to the Kubeflow pipelines and create a run of the pipeline, changing project and bucket as necessary.
Step 14
Monitor the logs until done (will take about an hour).
In this step, you learned to deploy the entire training pipeline onto Kubeflow Pipelines and automate the training, hyperparameter tuning, and deployment of the model.
Open CloudShell, and git clone the repository if necessary:
git clone \ https://github.com/GoogleCloudPlatform/training-data-analyst/
In CloudShell, deploy the website application:
cd training-data-analyst/courses/machine_learning/deepdive cd 06_structured/serving ./deploy.sh
In a browser, visit https://<PROJECT>.appspot.com/ and try out the application.
In CloudShell, call a Java program that invokes the web service:
cd ~/training-data-analyst/courses/machine_learning/deepdive cd 06_structured/serving ./run_once.sh
In CloudShell, call a Dataflow pipeline that invokes the web service on a text file:
cd ~/training-data-analyst/courses/machine_learning/deepdive cd 06_structured/serving ./run_ontext.sh
The code will also work real-time, reading from Pub/Sub and writing to BigQuery:
cd ~/training-data-analyst/courses/machine_learning/deepdive cd 06_structured/serving cat ./run_dataflow.sh
In this step, you deployed an AppEngine web application that consumes the machine learning service. You also looked at how to consume the ML predictions from Dataflow, both in batch mode and in real-time.
Step 1
Click on the person icon in the top-right corner of your Datalab window and click on the link to manage the VM.
Step 2
In the web console, select the Datalab VM and click DELETE.
Note: Cloud Datalab creates a a persistent disk to store your notebooks. These notebooks will be available the next time you start Datalab. Datalab also creates a firewall rule for tunneling. Use the web console to delete the disk and firewall rule if you will not use Datalab again.
©Google, Inc. or its affiliates. All rights reserved. Do not distribute.