In this lab you analyze a large (137 million rows) natality dataset using Google BigQuery and Cloud Datalab. This lab is part of a series of labs on processing scientific data.

What you need

To complete this lab, you need:

What you learn

In this lab, you:

This lab illustrates how you can carry out data exploration of large datasets, but continue to use familiar tools like Pandas and Juypter. The "trick" is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. Cloud Datalab provides a managed Jupyter experience, so that you don't need to run notebook servers yourself.

To launch Cloud Datalab:

Step 1

Open Cloud Shell. The cloud shell icon is at the top right of the Google Cloud Platform web console:

Step 2

In Cloud Shell, type:

gcloud compute zones list

Pick a zone in a geographically closeby region.

Step 2

In Cloud Shell, type:

datalab create babyweight --zone <ZONE>

Datalab will take about 5 minutes to start. Move on to the next step.

To invoke a BigQuery query:

Step 1

Navigate to the BigQuery console by selecting BigQuery from the top-left-corner ("hamburger") menu.

Step 2

In the BigQuery Console, click on Compose Query. Then, select Show Options and ensure that the Legacy SQL menu is NOT checked (we will be using Standard SQL).

Step 3

In the query textbox, type:

SELECT
  plurality,
  COUNT(1) AS num_babies,
  AVG(weight_pounds) AS avg_wt
FROM
  publicdata.samples.natality
WHERE
  year > 2000 AND year < 2005
GROUP BY
  plurality

How many triplets were born in the US between 2000 and 2005? ___________

Switch back to your Cloud Shell window.

Step 1

If necessary, wait for Datalab to finish launching. Datalab is ready when you see a message prompting you to do a "Web Preview".

Step 2

Click on the Web Preview icon on the top-right corner of the Cloud Shell ribbon. Switch to port 8081.

Step 3

In Datalab, start a new notebook by clicking on the +Notebook icon.

Step 4

In a cell in Datalab, type the following, then click Run

query="""
SELECT
  weight_pounds,
  is_male,
  mother_age,
  mother_race,
  plurality,
  gestation_weeks,
  mother_married,
  ever_born,
  cigarette_use,
  alcohol_use,
  FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
FROM
  publicdata.samples.natality
WHERE year > 2000
"""
import google.datalab.bigquery as bq
df = bq.Query(query + " LIMIT 100").execute().result().to_dataframe()
df.head()

Note that we have gotten the results from BigQuery as a Pandas dataframe.

Step 4

In the next cell in Datalab, type the following, then click Run

def get_distinct_values(column_name):
  sql = """
SELECT
  {0},
  COUNT(1) AS num_babies,
  AVG(weight_pounds) AS avg_wt
FROM
  publicdata.samples.natality
WHERE
  year > 2000
GROUP BY
  {0}
  """.format(column_name)
  return bq.Query(sql).execute().result().to_dataframe()

df = get_distinct_values('is_male')
df.plot(x='is_male', y='avg_wt', kind='bar');

Are male babies heavier or lighter than female babies? Did you know this?

Step 5

In the next cell in Datalab, type the following, then click Run

df = get_distinct_values('gestation_weeks')
df = df.sort_values('gestation_weeks')
df.plot(x='gestation_weeks', y='avg_wt', kind='bar');

Step 1

Click on the person icon in the top-right corner of your Datalab window and click on the link to manage the VM.

Step 2

In the web console, select the Datalab VM and click DELETE.

In this lab, you learned how to carry out data exploration of large datasets using BigQuery, Pandas, and Juypter. The "trick" is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. Cloud Datalab provides a managed Jupyter experience, so that you don't need to run notebook servers yourself.

┬ęGoogle, Inc. or its affiliates. All rights reserved. Do not distribute.