In this lab you analyze historical weather observations using BigQuery and use weather data in conjunction with other datasets. This lab is part of a series of labs on processing scientific data.

What you need

To complete this lab, you need:

What you learn

In this lab, you:

In this lab you use two public datasets in BigQuery -- weather data from NOAA and citizen complaints data from New York City.

In this lab, you will encounter, for the first time, several aspects of Google Cloud Platform that are of great benefit to scientists:

  1. Serverless. You will not need to download data to your machine in order to work with it. The dataset will remain on the cloud.
  2. Ease of use. You will run ad-hoc SQL queries on your dataset without having to prepare the data beforehand (in other words, no indexes, etc.). This is invaluable for data exploration.
  3. Scale. You can carry out data exploration on extremely large datasets interactively. You don't need to sample the data in order to work with it in a timely manner.
  4. Shareability. You will be able to run queries on data from different datasets without any issues -- BigQuery is a convenient way to share datasets. Of course, you can also keep your data private, or share them only with specific persons -- not all data need to be public.

The end-result is that you will find what types of municipal complaints are correlated with weather. For example, you will find (not surprisingly) that complaints about residential furnaces are most common when it is cold outside:

To invoke a BigQuery query:

Step 1

Navigate to the BigQuery console by selecting BigQuery from the top-left-corner ("hamburger") menu of the Google Cloud console.

Step 2

In the BigQuery Console, see if the bigquery-public-data project is listed on the left-hand side menu. If not, click on the blue arrow next to your project name and select Switch to project > Display project and type in bigquery-public-data

Step 3

Navigate to the table bigquery-public-data > noaa_gsod > gsod2014 and click on the table and in the table details that show on the right-hand pane, click on Preview.

Examine the columns and some of the data values.

Step 4

In the BigQuery Console, click on Compose Query. Then, select Show Options and ensure that the Legacy SQL menu is NOT checked (we will be using Standard SQL).

Step 5

In the query textbox, type:

#standardsql
SELECT
  -- Create a timestamp from the date components.
  stn,
  TIMESTAMP(CONCAT(year,"-",mo,"-",da)) AS timestamp,
  -- Replace numerical null values with actual null
  AVG(IF (temp=9999.9,
      null,
      temp)) AS temperature,
  AVG(IF (wdsp="999.9",
      null,
      CAST(wdsp AS Float64))) AS wind_speed,
  AVG(IF (prcp=99.99,
      0,
      prcp)) AS precipitation
FROM
  `bigquery-public-data.noaa_gsod.gsod20*`
WHERE
  CAST(YEAR AS INT64) > 2010
  AND CAST(MO AS INT64) = 6
  AND CAST(DA AS INT64) = 12
  AND (stn="725030" OR  -- La Guardia
    stn="744860")    -- JFK
GROUP BY
  stn,
  timestamp
ORDER BY
  timestamp DESC,
  stn ASC

Then, click on Run Query. What does this query do?

Step 1

In the BigQuery Console, select the table bigquery-public-data > new_york > nypd_mv_collisions and click on Preview. What columns are present? What does the data look like?

Step 2

In the BigQuery Console, click on Compose Query and run the following query:

#standardsql
SELECT
  EXTRACT(YEAR
  FROM
    created_date) AS year,
  complaint_type,
  COUNT(1) AS num_complaints
FROM
  `bigquery-public-data.new_york.311_service_requests`
GROUP BY
  year,
  complaint_type
ORDER BY
  num_complaints DESC

What are the most common complaints? Do you think the number of complaints about some of these problems will be highly correlated with the weather on a particular day?

Step 1

In the left hand-side of the BigQuery Console, click on the blue arrow next to your project name and click on Create dataset. Give the dataset the name demos.

Step 2

In the BigQuery Console, click on Compose Query and then click on Show Options if necessary. Click on the Select Table button to select a destination table. Specify the dataset demos and table nyc_weather.

Step 3

In the BigQuery Console, click on Compose Query and run the following query:

#standardsql
SELECT
  -- Create a timestamp from the date components.
  timestamp(concat(year,"-",mo,"-",da)) as timestamp,
  -- Replace numerical null values with actual nulls
  AVG(IF (temp=9999.9, null, temp)) AS temperature,
  AVG(IF (visib=999.9, null, visib)) AS visibility,
  AVG(IF (wdsp="999.9", null, CAST(wdsp AS Float64))) AS wind_speed,
  AVG(IF (gust=999.9, null, gust)) AS wind_gust,
  AVG(IF (prcp=99.99, null, prcp)) AS precipitation,
  AVG(IF (sndp=999.9, null, sndp)) AS snow_depth
FROM
  `bigquery-public-data.noaa_gsod.gsod20*`
WHERE
  CAST(YEAR AS INT64) > 2008
  AND (stn="725030" OR  -- La Guardia
       stn="744860")    -- JFK
GROUP BY timestamp

The results are now saved in a new table in your project.

Step 4

Click on the x to remove the destination table for future queries.

Step 1

In the BigQuery Console, run the following query. This query uses the CORR function to compute the correlation between number of complaints and the temperature.

#standardsql
SELECT
  descriptor,
  sum(complaint_count) as total_complaint_count,
  count(temperature) as data_count,
  ROUND(corr(temperature, avg_count),3) AS corr_count,
  ROUND(corr(temperature, avg_pct_count),3) AS corr_pct
From (
SELECT
  avg(pct_count) as avg_pct_count,
  avg(day_count) as avg_count,
  sum(day_count) as complaint_count,
  descriptor,
  temperature
FROM (
  SELECT
    DATE(timestamp) AS date,
    temperature
  FROM
    demos.nyc_weather) a
  JOIN (
  SELECT x.date, descriptor, day_count, day_count / all_calls_count as pct_count
  FROM
    (SELECT
      DATE(created_date) AS date,
      concat(complaint_type, ": ", descriptor) as descriptor,
      COUNT(*) AS day_count
    FROM
      `bigquery-public-data.new_york.311_service_requests` 
    GROUP BY
      date,
      descriptor)x 
    JOIN (
      SELECT
        DATE(created_date) AS date,
        COUNT(*) AS all_calls_count
      FROM `bigquery-public-data.new_york.311_service_requests` 
      GROUP BY date
    )y
  ON x.date=y.date
)b
ON
  a.date = b.date
GROUP BY
  descriptor,
  temperature
)
GROUP BY descriptor
HAVING 
  total_complaint_count > 5000 AND 
  ABS(corr_pct) > 0.5 AND
  data_count > 5
ORDER BY
  ABS(corr_pct) DESC

The results indicate that Heating complaints are negatively correlated with temperature (i.e., more heating calls on cold days) and calls about dead trees are positively correlated with temperature (i.e., more calls on hot days).

Step 2

In the BigQuery Console, run the following query. This query uses the CORR function to compute the correlation between number of complaints and wind speed.

#standardsql
SELECT
  descriptor,
  sum(complaint_count) as total_complaint_count,
  count(wind_speed) as data_count,
  ROUND(corr(wind_speed, avg_count),3) AS corr_count,
  ROUND(corr(wind_speed, avg_pct_count),3) AS corr_pct
From (
SELECT
  avg(pct_count) as avg_pct_count,
  avg(day_count) as avg_count,
  sum(day_count) as complaint_count,
  descriptor,
  wind_speed
FROM (
  SELECT
    DATE(timestamp) AS date,
    wind_speed
  FROM
    demos.nyc_weather)a
  JOIN (
  SELECT x.date, descriptor, day_count, day_count / all_calls_count as pct_count
  FROM
    (SELECT
      DATE(created_date) AS date,
      concat(complaint_type, ": ", descriptor) as descriptor,
      COUNT(*) AS day_count
    FROM
      `bigquery-public-data.new_york.311_service_requests` 
    GROUP BY
      date,
      descriptor)x 
    JOIN (
      SELECT
        DATE(created_date) AS date,
        COUNT(*) AS all_calls_count
      FROM `bigquery-public-data.new_york.311_service_requests` 
      GROUP BY date
    )y
  ON x.date=y.date
)b
ON
  a.date = b.date
GROUP BY
  descriptor,
  wind_speed
)
GROUP BY descriptor
HAVING 
  total_complaint_count > 5000 AND 
  ABS(corr_pct) > 0.5 AND
  data_count > 10
ORDER BY
  ABS(corr_pct) DESC

Do you have a hypothesis for why noise complaints reduce on windy days?

In this lab, you did ad-hoc queries on two datasets. You were able to query the data without setting up any clusters, creating any indexes, etc. You were also able to mash up the two datasets and get some interesting insights. All without ever leaving your browser!

For more fun analysis of the NYC data and how it is correlated with weather, see Reto Meier's blog post.

┬ęGoogle, Inc. or its affiliates. All rights reserved. Do not distribute.