Time Series Forecasting with Cloud AI Platform

In this lab, you'll learn how to build a time-series forecasting model using AutoML and with TensorFlow, and then learn how to deploy these models with the Google Cloud AI Platform.

What you learn

You'll learn how to:

  • Transform data so that it can be used in an ML model
  • Visualize and explore data
  • Build a time-series forecasting model with TensorFlow using LSTM and CNN architectures

The focus of this codelab is on how to apply time-series forecasting techniques using the Google Cloud Platform. It isn't a general time-series forecasting course, but a brief tour of the concepts may be helpful for our users.

Time Series Data

First, what is a time series? It's a dataset with data recorded at regular time intervals. A time-series dataset contains both time and at least one variable that is dependent on time.

85af6a1ff05c69f2.png

Components

A time-series can be decomposed into components:

  • Trend: moves up or down in a reasonably predictable pattern
  • Seasonal: repeats over a specific period such as a day, week, month, season, etc.
  • Random: residual fluctuations

There can be multiple layers of seasonality. For example, a call center might see a pattern in call volume on certain days of the week as well as on given months. The residual might be able to be explained by other variables besides time.

6e8d45bbbbc388ec.png

Stationarity

For best results in forecasting, time-series data should be made stationary, where statistical properties such as mean and variance are constant over time. Techniques such as differencing and detrending can be applied to raw data to make it more stationary.

For example, the plot below of CO2 concentration shows a repeating yearly pattern with an upward trend. ( Source)

ab82857e2e7d0b89.png

After removing the linear trend, the data is more suitable for forecasting, as it now has a constant mean.

c936381ab1095528.png

Using Time Series Data for Machine Learning

To use time-series data in a machine learning problem, it needs to be transformed so that previous values can be used to predict future values. This table shows an example of how lagged variables are created to help predict the target.

d667a941dbd470f5.png

Now that we've covered some fundamentals, let's get started with exploring data and forecasting!

Now that we have gone through a brief introduction to the data, let's now set up our model development environment.

Step 1: Enable APIs

The BigQuery connector uses the BigQuery Storage API. Search for the BigQuery Storage API in the console and enable the API if it is currently disabled.

9895a2fd3cdf8f8c.png

Step 2: Create an AI Platform Notebooks instance

Navigate to AI Platform Notebooks section of your Cloud Console and click New Instance. Then select the latest TensorFlow Enterprise 2.x instance type without GPUs:

51c53c81072b6edb.png

Use the default options and then click Create. Once the instance has been created, select Open JupyterLab:

709a5a8f9d77f1ea.png

Then, create a Python 3 notebook from JupyterLab:

58523671a252b95a.png

Step 3: Download lab materials

Create a new Terminal window from the JupyterLab interface: File -> New -> Terminal.

From there, clone the source material with this command:

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

In this lab, you will:

  • Create a query that groups data into a time-series
  • Fill missing values
  • Visualize data
  • Decompose time-series into trend and seasonal components

Step 1

In AI Platform Notebooks, navigate to training-data-analyst/courses/ai-for-time-series/notebooks and open 01-explore.ipynb.

Step 2

Clear all the cells in the notebook (Edit > Clear All Outputs), change the region, project and bucket settings in one of the first few cells, and then Run the cells one by one.

In this lab, you will:

  • Remove outliers from the data
  • Perform multi-step forecasting
  • Include additional features in a time-series model
  • Learn about neural network architectures for time-series forecasting: LSTM and CNN
  • Learn about statistical models, including Holt-Winters Exponential Smoothing
  • Ensemble models

Step 1

In AI Platform Notebooks, navigate to training-data-analyst/courses/ai-for-time-series/notebooks and open 02-model.ipynb.

Step 2

Clear all the cells in the notebook (Edit > Clear All Outputs), change the region, project and bucket settings in one of the first few cells, and then Run the cells one by one.

In this lab, you will:

  • Prepare data and models for training in the cloud
  • Train your model and monitor the progress of the job with AI Platform Training
  • Predict using the model with AI Platform Predictions

Step 1

In AI Platform Notebooks, navigate to training-data-analyst/courses/ai-for-time-series/notebooks and open 03-cloud-training.ipynb.

Step 2

Clear all the cells in the notebook (Edit > Clear All Outputs), change the region, project and bucket settings in one of the first few cells, and then Run the cells one by one.

In this section, you will try applying the concepts you learned to a new dataset!

We won't provide detailed instructions, just some hints (if you want them!).

The goal is to predict 311 service requests from the City of New York. These non-emergency requests include noise complaints, street light issues, etc.

Step 1

Let's start by understanding the dataset.

First, access the City of New York 311 Service Requests dataset.

To get to know the data better, try out a couple of the sample queries listed in the dataset description:

  • What is the number of 311 requests related to ice cream trucks?
  • What days get the most 311 requests related to parties?

In the BigQuery UI, select Create Query to see how to access the dataset. Note the select statement is querying from bigquery-public-data.new_york_311.311_service_requests.

Step 2

We're ready to get started. In this section, make modifications to the Explore and Visualize notebook to work with this data.

Hints

  • Duplicate the 01-explore.ipynb notebook and begin working from it.
  • To explore the data, try this query:
from google.cloud import bigquery as bq

sql = """
SELECT * FROM `bigquery-public-data.new_york_311.311_service_requests` LIMIT 5
"""

client = bq.Client(project=PROJECT)
df = client.query(sql).to_dataframe()

df.head()
  • To get the counts of incidents by month, use this query:
SELECT
  COUNT(unique_key) as y,
  DATE_TRUNC(DATE(created_date), month) as ds  
FROM `bigquery-public-data.new_york_311.311_service_requests`
GROUP by ds ORDER BY ds asc
  • Update the column variables in the constants section. In the query above, the target column is y, and the date column is ds. There are no additional features.
  • Consider changing the file name in which you export the data for the next lab.

Step 3

Let's now create a time-series model with the daily data.

Hints

  • Duplicate the 02-model.ipynb notebook and begin working from it.
  • Change the input file name if you changed it in the previous notebook.
  • There doesn't appear to be any obvious outliers in the data, so skip or comment out those cells.
  • Adjust the LSTM units and CNN filters and kernel size for this new model.

Step 4

For a final challenge, let's predict with monthly data, which will require several changes to the parameters:

n_features = 1  # Holidays aren't included in the monthly data set we created)
n_input_steps = 12 # Lookback window of 12 months
n_output_steps = 1 # Predict one month ahead
n_seasons = 12 # For the statistical model, use yearly periodicity (12 months)

Hints

  • The LSTM model does not perform well with the monthly dataset. More complex architectures such as stacked LSTMs or LSTM-CNN could be investigated.
  • The 1-Dimensional CNN architecture does much better using the following parameters (which should be optimized using hyperparameter tuning): filters=64 and kernel_size=12.
  • The exponential smoothing model performs better than the LSTM but worse than the CNN.
  • An ensemble of the CNN and exponential smooth models does even better than either model individually. In basic testing, using a weight of 2:1 CNN to ES provided good results. (In this case, the weights array would be set to [2, 0, 1]).
  • Note that you will see slightly different evaluation results when inspecting an ensembled model. The code to evaluate an ensemble of models uses the intersection of the test data used between model types. For instance, the TensorFlow model requires the first n_input_steps of test data to predict, while the statistical model predicts right away on the first test data point.

If you'd like to continue using this notebook, it is recommended that you turn it off when not in use. From the Notebooks UI in your Cloud Console, select the notebook and then select Stop:

57213ef2edad9257.png

If you'd like to delete all the resources you've created in this lab, simply Delete the notebook instance instead of stopping it.

Using the Navigation menu in your Cloud Console, browse to Storage and delete both buckets you created to store your model assets.