Time Series Forecasting with the Cloud AI Platform and BQML

In this lab, you'll learn how to build a time-series forecasting model using AutoML and with TensorFlow, and then learn how to deploy these models with the Google Cloud AI Platform.

What you learn

You'll learn how to:

  • Transform data so that it can be used in an ML model
  • Visualize and explore data
  • Build a time-series forecasting model with TensorFlow using LSTM and CNN architectures

The focus of this codelab is on how to apply time-series forecasting techniques using the Google Cloud Platform. It isn't a general time-series forecasting course, but a brief tour of the concepts may be helpful for our users.

Time Series Data

First, what is a time series? It's a dataset with data recorded at regular time intervals. A time-series dataset contains both time and at least one variable that is dependent on time.

85af6a1ff05c69f2.png

Components

A time-series can be decomposed into components:

  • Trend: moves up or down in a reasonably predictable pattern
  • Seasonal: repeats over a specific period such as a day, week, month, season, etc.
  • Random: residual fluctuations

There can be multiple layers of seasonality. For example, a call center might see a pattern in call volume on certain days of the week as well as on given months. The residual might be able to be explained by other variables besides time.

6e8d45bbbbc388ec.png

Stationarity

For best results in forecasting, time-series data should be made stationary, where statistical properties such as mean and variance are constant over time. Techniques such as differencing and detrending can be applied to raw data to make it more stationary.

For example, the plot below of CO2 concentration shows a repeating yearly pattern with an upward trend. ( Source)

ab82857e2e7d0b89.png

After removing the linear trend, the data is more suitable for forecasting, as it now has a constant mean.

c936381ab1095528.png

Using Time Series Data for Machine Learning

To use time-series data in a machine learning problem, it needs to be transformed so that previous values can be used to predict future values. This table shows an example of how lagged variables are created to help predict the target.

d667a941dbd470f5.png

Now that we've covered some fundamentals, let's get started with exploring data and forecasting!

Now that we have gone through a brief introduction to the data, let's now set up our model development environment.

Step 1: Enable APIs

The BigQuery connector uses the BigQuery Storage API. Search for the BigQuery Storage API in the console and enable the API if it is currently disabled.

9895a2fd3cdf8f8c.png

Step 2: Create an AI Platform Notebooks instance

Navigate to AI Platform Notebooks section of your Cloud Console and click New Instance. Then select the latest TensorFlow Enterprise 2.x instance type without GPUs:

51c53c81072b6edb.png

Use the default options and then click Create. Once the instance has been created, select Open JupyterLab:

709a5a8f9d77f1ea.png

Then, create a Python 3 notebook from JupyterLab:

58523671a252b95a.png

Step 3: Download lab materials

Create a new Terminal window from the JupyterLab interface: File -> New -> Terminal.

From there, clone the source material with this command:

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

In this lab, you will:

  • Create a query that groups data into a time-series
  • Fill missing values
  • Visualize data
  • Decompose time-series into trend and seasonal components

Step 1

In AI Platform Notebooks, navigate to training-data-analyst/courses/ai-for-time-series/notebooks and open 01-explore.ipynb.

Step 2

Clear all the cells in the notebook (Edit > Clear All Outputs), change the region, project and bucket settings in one of the first few cells, and then Run the cells one by one.

In this lab, you will:

  • Import your time series input data into a BigQuery table
  • Create a time series model using BQML syntax
  • Learn how to evaluate your model parameters and accuracy
  • Forecast using your model

Step 1

We are going to create a BigQuery table with the raw data from the CSV we just explored. Let's start by downloading the CSV from the notebook environment.

From the training-data-analyst/courses/ai-for-time-series/notebooks/data directory, right-click on cta_ridership.csv and Download it to your local environment.

Step 2

Next, we will upload this data into a BigQuery table.

Navigate to BigQuery in the console (by searching or using this link):

649e7ab1c44b75e8.png

You can add the table to a new or existing dataset, which groups related tables. In case you haven't already created a dataset, you can click on your project in the lower-left corner, and then select CREATE DATASET in the lower-right corner.

281b97020cd52f29.png

Pick a name of your choice, such as demo, accept the defaults, and continue.

With that dataset selected, select CREATE TABLE in the lower-right corner to create a new table.

ad47810d44cfb289.png

For the table creation options, select:

  • Create table from: Upload
  • Select file: cta_ridership.csv
  • Table name: cta_ridership
  • Schema: Check the box to auto detect Schema and input parameters

213e4177e9e79544.png

Step 3

It's now time to create our model! BigQuery ML provides a straightforward syntax similar to SQL that enables you to create a wide variety of model types.

In the query editor, paste/type in this query, replacing demo if needed with your dataset name in both places:

CREATE OR REPLACE MODEL
  `demo.cta_ridership_model` OPTIONS(MODEL_TYPE='ARIMA',
    TIME_SERIES_TIMESTAMP_COL='service_date',
    TIME_SERIES_DATA_COL='total_rides',
    HOLIDAY_REGION='us') AS
SELECT
  service_date, total_rides
FROM
  `demo.cta_ridership`

Let's go through key elements of the syntax for understanding:

CREATE OR REPLACE MODEL
demo.cta_ridership_model

This statement creates the model. There are variants of this statement, e.g. CREATE MODEL, but we chose to replace an existing model with the same name here.

OPTIONS(MODEL_TYPE=‘ARIMA' ... )

Here, we define the model options, with the first option being the model type. Selecting ARIMA will create a time-series forecasting model.

TIME_SERIES_TIMESTAMP_COL=‘service_date'

The column with date/time information

TIME_SERIES_DATA_COL=‘total_rides'

The data column

HOLIDAY_REGION=‘us'

This optional parameter allows us to include holidays into the model. Since our data exploration in the previous step showed that ridership was lower on holidays, and the data comes from Chicago, IL, USA, we are including US holidays into the model.

AS SELECT ... FROM ...

This section selects the input data we will use to train the model with.

There are a number of other options you can add to the query, such as defining a column if you have multiple time series, or choosing whether to automatically discover the ARIMA model parameters. You can find out more details in the CREATE MODEL statement for time series models syntax reference.

Step 4

Let's find out more about our model. After it has finished training, let's run another query, again replacing demo if needed:

SELECT
  *
FROM
  ML.EVALUATE(MODEL `demo.cta_ridership_model`)

Let's interpret the results. In each row, you will see a candidate model, with its parameters and evaluation statistics. The results are returned in ascending order of AIC, or Akaike information criterion, which provides a relative indicator of model quality. So, the model in the first row has the lowest AIC, and is considered the best model.

You will be able to see the p, d, and q parameters of the ARIMA model, as well as the seasonality discovered in the model. In this case, the top model includes both weekly and yearly seasonality.

5b5b1e129c70a340.png

Step 5

Now, we're ready to forecast with the ML.FORECAST function!

Paste/type in the following (replacing demo if needed):

SELECT
  *
FROM
  ML.FORECAST(MODEL `demo.cta_ridership_model`,
    STRUCT(7 AS horizon))

This query simply forecasts 7 days out using our model! We can see the seven rows returned below. The forecast also includes a confidence interval, defaulting to 0.95 but configurable in the query.

e1efdaea83347d2c.png

Great work: we've created a time series model with just a few BQML queries.

In this lab, you will:

  • Remove outliers from the data
  • Perform multi-step forecasting
  • Include additional features in a time-series model
  • Learn about neural network architectures for time-series forecasting: LSTM and CNN
  • Learn about statistical models, including Holt-Winters Exponential Smoothing
  • Ensemble models

Step 1

In AI Platform Notebooks, navigate to training-data-analyst/courses/ai-for-time-series/notebooks and open 02-model.ipynb.

Step 2

Clear all the cells in the notebook (Edit > Clear All Outputs), change the region, project and bucket settings in one of the first few cells, and then Run the cells one by one.

In this lab, you will:

  • Prepare data and models for training in the cloud
  • Train your model and monitor the progress of the job with AI Platform Training
  • Predict using the model with AI Platform Predictions

Step 1

In AI Platform Notebooks, navigate to training-data-analyst/courses/ai-for-time-series/notebooks and open 03-cloud-training.ipynb.

Step 2

Clear all the cells in the notebook (Edit > Clear All Outputs), change the region, project and bucket settings in one of the first few cells, and then Run the cells one by one.

In this section, you will try applying the concepts you learned to a new dataset!

We won't provide detailed instructions, just some hints (if you want them!).

The goal is to predict 311 service requests from the City of New York. These non-emergency requests include noise complaints, street light issues, etc.

Step 1

Let's start by understanding the dataset.

First, access the City of New York 311 Service Requests dataset.

To get to know the data better, try out a couple of the sample queries listed in the dataset description:

  • What is the number of 311 requests related to ice cream trucks?
  • What days get the most 311 requests related to parties?

In the BigQuery UI, select Create Query to see how to access the dataset. Note the select statement is querying from bigquery-public-data.new_york_311.311_service_requests.

Step 2

We're ready to get started. In this section, make modifications to the Explore and Visualize notebook to work with this data.

Hints

  • Duplicate the 01-explore.ipynb notebook and begin working from it.
  • To explore the data, try this query:
from google.cloud import bigquery as bq

sql = """
SELECT * FROM `bigquery-public-data.new_york_311.311_service_requests` LIMIT 5
"""

client = bq.Client(project=PROJECT)
df = client.query(sql).to_dataframe()

df.head()
  • To get the counts of incidents by month, use this query:
SELECT
  COUNT(unique_key) as y,
  DATE_TRUNC(DATE(created_date), month) as ds  
FROM `bigquery-public-data.new_york_311.311_service_requests`
GROUP by ds ORDER BY ds asc
  • Update the column variables in the constants section. In the query above, the target column is y, and the date column is ds. There are no additional features.
  • Consider changing the file name in which you export the data for the next lab.

Step 3

Let's now create a time-series model with the daily data.

Hints

  • Duplicate the 02-model.ipynb notebook and begin working from it.
  • Change the input file name if you changed it in the previous notebook.
  • There doesn't appear to be any obvious outliers in the data, so skip or comment out those cells.
  • Adjust the LSTM units and CNN filters and kernel size for this new model.

Step 4

For a final challenge, let's predict with monthly data, which will require several changes to the parameters:

n_features = 1  # Holidays aren't included in the monthly data set we created)
n_input_steps = 12 # Lookback window of 12 months
n_output_steps = 1 # Predict one month ahead
n_seasons = 12 # For the statistical model, use yearly periodicity (12 months)

Hints

  • The LSTM model does not perform well with the monthly dataset. More complex architectures such as stacked LSTMs or LSTM-CNN could be investigated.
  • The 1-Dimensional CNN architecture does much better using the following parameters (which should be optimized using hyperparameter tuning): filters=64 and kernel_size=12.
  • The exponential smoothing model performs better than the LSTM but worse than the CNN.
  • An ensemble of the CNN and exponential smooth models does even better than either model individually. In basic testing, using a weight of 2:1 CNN to ES provided good results. (In this case, the weights array would be set to [2, 0, 1]).
  • Note that you will see slightly different evaluation results when inspecting an ensembled model. The code to evaluate an ensemble of models uses the intersection of the test data used between model types. For instance, the TensorFlow model requires the first n_input_steps of test data to predict, while the statistical model predicts right away on the first test data point.

If you'd like to continue using this notebook, it is recommended that you turn it off when not in use. From the Notebooks UI in your Cloud Console, select the notebook and then select Stop:

57213ef2edad9257.png

If you'd like to delete all the resources you've created in this lab, simply Delete the notebook instance instead of stopping it.

Using the Navigation menu in your Cloud Console, browse to Storage and delete both buckets you created to store your model assets.