In this lab you analyze a large (137 million rows) natality dataset using Google BigQuery and AI Platform Notebooks. This lab is part of a series of labs on processing scientific data.

What you need

To complete this lab, you need:

What you learn

In this lab, you:

This lab illustrates how you can carry out data exploration of large datasets, but continue to use familiar tools like Pandas and Jupyter Notebooks. The "trick" is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. Google Cloud provides a managed Jupyter experience, so that you don't need to run notebook servers yourself.

To launch a notebook instance on GCP:

Step 1

Click the Navigation menu and scroll to AI Platform, then select Notebooks.

Step 2

Click New Instance and select TensorFlow 2.x > Without GPUs

Step 3

Once the instance has fully started, click Open JupyterLab to get a new notebook environment.

You will now use BigQuery, a serverless data warehouse, to explore the natality dataset so that we can choose the features for our machine learning model.

To invoke a BigQuery query:

Step 1

In the GCP console, selecting BigQuery from the top-left-corner Navigation Menu icon.

Step 3

In the Query editor textbox, enter the following query:

SELECT
  plurality,
  COUNT(1) AS num_babies,
  AVG(weight_pounds) AS avg_wt
FROM
  publicdata.samples.natality
WHERE
  year > 2000 AND year < 2005
GROUP BY
  plurality

How many triplets were born in the US between 2000 and 2005? ___________

Step 1

Switch back to the JupyterLab window.

Step 2

In JupyterLab, start a new notebook by clicking on the Python 3 icon under the Notebook header.

Step 4

In a cell in the notebook, type the following, then click the Run button (which looks like a play button) and wait until you see a table of data.

query="""
SELECT
  weight_pounds,
  is_male,
  mother_age,
  plurality,
  gestation_weeks
FROM
  publicdata.samples.natality
WHERE year > 2000
"""
from google.cloud import bigquery
df = bigquery.Client().query(query + " LIMIT 100").to_dataframe()
df.head()

Note that we have gotten the results from BigQuery as a Pandas dataframe.

Step 5

In the next cell in the notebook, type the following, then click Run.

def get_distinct_values(column_name):
  sql = """
SELECT
  {0},
  COUNT(1) AS num_babies,
  AVG(weight_pounds) AS avg_wt
FROM
  publicdata.samples.natality
WHERE
  year > 2000
GROUP BY
  {0}
  """.format(column_name)
  return bigquery.Client().query(sql).to_dataframe()

df = get_distinct_values('is_male')
df.plot(x='is_male', y='avg_wt', kind='bar');

Are male babies heavier or lighter than female babies? Did you know this? _______

Is the sex of the baby a good feature to use in our machine learning model? _____

Step 6

In the next cell in the notebook, type the following, then click Run.

df = get_distinct_values('gestation_weeks')
df = df.sort_values('gestation_weeks')
df.plot(x='gestation_weeks', y='avg_wt', kind='bar');

This graph shows the average weight of babies born in the each week of pregancy. The way you'd read the graph is to look at the y-value for x=35 to find out the average weight of a baby born in the 35th week of pregnancy.

Is gestation_weeks a good feature to use in our machine learning model? _____

Is gestation_weeks always available? __________

Compare the variability of birth weight due to sex of baby and due to gestation weeks. Which factor do you think is more important for accurate weight prediction? __________________________________

Step 1

In the AI Platform Notebooks page on the GCP console, select the notebook instance and click DELETE.

In this lab, you learned how to carry out data exploration of large datasets using BigQuery, Pandas, and Juypter. The "trick" is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. AI Platform Notebooks provides a managed Jupyter notebooks experience, so that you don't need to run notebook servers yourself.

┬ęGoogle, Inc. or its affiliates. All rights reserved. Do not distribute.