In this lab you analyze a large (70 million rows, 8 GB) airline dataset using Google BigQuery and Cloud Datalab.

What you need

You must have completed Lab 0 and have the following:

What you learn

In this lab, you:

This lab illustrates how you can carry out data exploration of large datasets, but continue to use familiar tools like Pandas and Juypter. The "trick" is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. Cloud Datalab provides a managed Jupyter experience, so that you don't need to run notebook servers yourself.

Duration: 2

To launch Cloud Datalab:

Step 1

From the GCP Console click the Cloud Shell icon on the top right toolbar:

Then click "Start Cloud Shell":

Step 2

In Cloud Shell, type:

gcloud compute zones list

Step 3

In Cloud Shell, type:

datalab create <NAME> --zone <ZONE>

Datalab will take about 5 minutes to start. Move on to the next step.

To invoke a BigQuery query:

Step 1

Navigate to the BigQuery console by selecting BigQuery from the top-left-corner ("hamburger") menu.

Step 2

In the BigQuery Console, click on Compose Query. Then, select Show Options and ensure that the Legacy SQL menu is NOT checked (we will be using Standard SQL).

Step 3

In the query textbox, type:

#standardSQL
SELECT
  departure_delay,
  COUNT(1) AS num_flights,
  APPROX_QUANTILES(arrival_delay, 4) AS arrival_delay_quantiles
FROM
  `bigquery-samples.airline_ontime_data.flights`
GROUP BY
  departure_delay
HAVING
  num_flights > 100
ORDER BY
  departure_delay ASC

And then click on "Run Query".

What is the median arrival delay for flights left 35 minutes early? ___________

(Answer: the typical flight that left 35 minutes early arrived 19 minutes early.)

Step 4

Look back at Cloud Shell, and follow any prompts. If asked for a ssh passphrase, just hit return (for no passphrase).

Step 5 (Optional)

Can you write a query to find the airport pair (departure and arrival airport) that had the maximum number of flights between them?

Hint: you can group by multiple fields.

One possible answer:

#standardSQL
SELECT
  departure_airport,
  arrival_airport,
  COUNT(1) AS num_flights
FROM
  `bigquery-samples.airline_ontime_data.flights`
GROUP BY
  departure_airport,
  arrival_airport
ORDER BY
  num_flights DESC
LIMIT
  10

Step 1

If necessary, wait for Datalab to finish launching. Datalab is ready when you see a message prompting you to do a "Web Preview".

Step 2

Click on the Web Preview icon on the top-right corner of the Cloud Shell ribbon. Switch to port 8081.

Step 3

In Cloud Datalab home page (browser), navigate into notebooks. You should now be in datalab/notebooks/

Step 4

Start a new notebook by clicking on the +Notebook icon. Rename the notebook to be flights.

Step 5

In a cell in Datalab, type the following, then click Run

query="""
SELECT
  departure_delay,
  COUNT(1) AS num_flights,
  APPROX_QUANTILES(arrival_delay, 10) AS arrival_delay_deciles
FROM
  `bigquery-samples.airline_ontime_data.flights`
GROUP BY
  departure_delay
HAVING
  num_flights > 100
ORDER BY
  departure_delay ASC
"""

import google.datalab.bigquery as bq
df = bq.Query(query).execute().result().to_dataframe()
df.head()

Note that we have gotten the results from BigQuery as a Pandas dataframe.

In what Python data structure are the deciles in?

Step 6

In the next cell in Datalab, type the following, then click Run

import pandas as pd
percentiles = df['arrival_delay_deciles'].apply(pd.Series)
percentiles = percentiles.rename(columns = lambda x : str(x*10) + "%")
df = pd.concat([df['departure_delay'], percentiles], axis=1)
df.head()

What has the above code done to the columns in the Pandas DataFrame?

Step 7

In the next cell in Datalab, type the following, then click Run

without_extremes = df.drop(['0%', '100%'], 1)
without_extremes.plot(x='departure_delay', xlim=(-30,50), ylim=(-50,50));

Suppose we were creating a machine learning model to predict the arrival delay of a flight. Do you think departure delay is a good input feature? Is this true at all ranges of departure delays?

Hint: Try removing the xlim and ylim from the plotting command.

Let's save our changes to the git repository in your project.

Step 1

In Datalab, click on the "Ungit" icon in the Datalab toolbar.

Step 2

Write a commit message and commit the changed notebook.

Step 3

A git commit only commits changes locally, i.e. to your Datalab instance. To push changes to the actual repository associated with your project, select the master branch and click on the Push button:

Note that after this, your local "master" branch is up-to-date with the remote "master" branch. Now, you can delete your Datalab VM without losing the notebook.

Step 4

In GCP console, navigate to Source Repositories | Repositories. Click on datalab-notebooks. Notice that your flights.ipynb is now saved on the cloud.

Step 1

You could leave Datalab instance running until your class ends. The default machine type is relatively inexpensive. However, if you want to be frugal, you can stop and restart the instance between labs or when you go home for the day. To do so, follow the next two steps.

Step 2

Click on the person icon in the top-right corner of your Datalab window and click on the button to STOP the VM.

Step 3

You are not billed for stopped VMs. Whenever you want to restart Datalab, open Cloud Shell and type in:

datalab connect <NAME>

This will restart the virtual machine and launch the Docker image that runs Datalab.

In this lab, you learned how to carry out data exploration of large datasets using BigQuery, Pandas, and Juypter. The "trick" is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. Cloud Datalab provides a managed Jupyter experience, so that you don't need to run notebook servers yourself.

©Google, Inc. or its affiliates. All rights reserved. Do not distribute.