In this lab you analyze a large (137 million rows) natality dataset using Google BigQuery and Cloud Datalab. This lab is part of a series of labs on processing scientific data.
To complete this lab, you need:
In this lab, you:
This lab illustrates how you can carry out data exploration of large datasets, but continue to use familiar tools like Pandas and Juypter. The "trick" is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. Cloud Datalab provides a managed Jupyter experience, so that you don't need to run notebook servers yourself.
To launch Cloud Datalab:
Open Cloud Shell. The cloud shell icon is at the top right of the Google Cloud Platform web console:
In Cloud Shell, type:
gcloud compute zones list
Pick a zone in a geographically closeby region.
In Cloud Shell, type:
datalab create babyweight --zone <ZONE>
Datalab will take about 5 minutes to start. Move on to the next step.
To invoke a BigQuery query:
Navigate to the BigQuery console by selecting BigQuery from the top-left-corner ("hamburger") menu.
In the BigQuery Console, click on Compose Query. Then, select Show Options and ensure that the Legacy SQL menu is NOT checked (we will be using Standard SQL).
In the query textbox, type:
SELECT plurality, COUNT(1) AS num_babies, AVG(weight_pounds) AS avg_wt FROM publicdata.samples.natality WHERE year > 2000 AND year < 2005 GROUP BY plurality
How many triplets were born in the US between 2000 and 2005? ___________
Switch back to your Cloud Shell window.
If necessary, wait for Datalab to finish launching. Datalab is ready when you see a message prompting you to do a "Web Preview".
Click on the Web Preview icon on the top-right corner of the Cloud Shell ribbon. Switch to port 8081.
In Datalab, start a new notebook by clicking on the +Notebook icon.
In a cell in Datalab, type the following, then click Run
query=""" SELECT weight_pounds, is_male, mother_age, mother_race, plurality, gestation_weeks, mother_married, ever_born, cigarette_use, alcohol_use, FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth FROM publicdata.samples.natality WHERE year > 2000 """ import google.datalab.bigquery as bq df = bq.Query(query + " LIMIT 100").execute().result().to_dataframe() df.head()
Note that we have gotten the results from BigQuery as a Pandas dataframe.
In the next cell in Datalab, type the following, then click Run
def get_distinct_values(column_name): sql = """ SELECT {0}, COUNT(1) AS num_babies, AVG(weight_pounds) AS avg_wt FROM publicdata.samples.natality WHERE year > 2000 GROUP BY {0} """.format(column_name) return bq.Query(sql).execute().result().to_dataframe() df = get_distinct_values('is_male') df.plot(x='is_male', y='avg_wt', kind='bar');
Are male babies heavier or lighter than female babies? Did you know this?
In the next cell in Datalab, type the following, then click Run
df = get_distinct_values('gestation_weeks') df = df.sort_values('gestation_weeks') df.plot(x='gestation_weeks', y='avg_wt', kind='bar');
Step 1
Click on the person icon in the top-right corner of your Datalab window and click on the link to manage the VM.
Step 2
In the web console, select the Datalab VM and click DELETE.
In this lab, you learned how to carry out data exploration of large datasets using BigQuery, Pandas, and Juypter. The "trick" is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. Cloud Datalab provides a managed Jupyter experience, so that you don't need to run notebook servers yourself.
©Google, Inc. or its affiliates. All rights reserved. Do not distribute.