Time Series Forecasting with Vertex AI and BigQuery ML

1. Overview

In this lab, you'll learn how to build a time-series forecasting model with TensorFlow, and then learn how to deploy these models with the Vertex AI.

What you learn

You'll learn how to:

Transform data so that it can be used in an ML model
Visualize and explore data
Use BigQuery ML to create a time-series forecasting model
Build a time-series forecasting model with TensorFlow using LSTM and CNN architectures

2. Introduction to Time-Series Forecasting

The focus of this codelab is on how to apply time-series forecasting techniques using the Google Cloud Platform. It isn't a general time-series forecasting course, but a brief tour of the concepts may be helpful for our users.

Time Series Data

First, what is a time series? It's a dataset with data recorded at regular time intervals. A time-series dataset contains both time and at least one variable that is dependent on time.

Components

A time-series can be decomposed into components:

Trend: moves up or down in a reasonably predictable pattern
Seasonal: repeats over a specific period such as a day, week, month, season, etc.
Random: residual fluctuations

There can be multiple layers of seasonality. For example, a call center might see a pattern in call volume on certain days of the week as well as on given months. The residual might be able to be explained by other variables besides time.

Stationarity

For best results in forecasting, time-series data should be made stationary, where statistical properties such as mean and variance are constant over time. Techniques such as differencing and detrending can be applied to raw data to make it more stationary.

For example, the plot below of CO2 concentration shows a repeating yearly pattern with an upward trend. ( Source)

After removing the linear trend, the data is more suitable for forecasting, as it now has a constant mean.

Using Time Series Data for Machine Learning

To use time-series data in a machine learning problem, it needs to be transformed so that previous values can be used to predict future values. This table shows an example of how lagged variables are created to help predict the target.

Now that we've covered some fundamentals, let's get started with exploring data and forecasting!

3. Setup your Notebook environment

Now that we have gone through a brief introduction to the data, let's now set up our model development environment.

Step 1: Enable APIs

The BigQuery connector uses the BigQuery Storage API. Search for the BigQuery Storage API in the console and enable the API if it is currently disabled.

Step 2: Create a Vertex AI Workbench notebook

Navigate to the Vertex AI Workbench section of your Cloud Console and click New Notebook. Then select the latest TensorFlow Enterprise 2.x notebook type without GPUs:

Use the default options and then click Create. Once the instance has been created, select Open JupyterLab:

Then, create a Python 3 notebook from JupyterLab:

Step 3: Download lab materials

Create a new Terminal window from the JupyterLab interface: File -> New -> Terminal.

From there, clone the source material with this command:

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

4. Explore and Visualize Data

In this section, you will:

Create a query that groups data into a time-series
Fill missing values
Visualize data
Decompose time-series into trend and seasonal components

Step 1

In Vertex AI Workbench, navigate to training-data-analyst/courses/ai-for-time-series/notebooks and open 01-explore.ipynb.

Step 2

Clear all the cells in the notebook (Edit > Clear All Outputs), change the region, project and bucket settings in one of the first few cells, and then Run the cells one by one.

Step 3

In this section, you have imported the data, and visualized various dimensions of it. Now that you have a clearer picture of the data, you will be ready to proceed to machine learning modeling using this data.

5. Create a Model with BigQuery Time Series Forecasting

In this section, you will:

Import your time series input data into a BigQuery table
Create a time series model using BQML syntax
Learn how to evaluate your model parameters and accuracy
Forecast using your model

Step 1

We are going to create a BigQuery table with the raw data from the CSV we just explored. Let's start by downloading the CSV from the notebook environment.

From the training-data-analyst/courses/ai-for-time-series/notebooks/data directory, right-click on cta_ridership.csv and Download it to your local environment.

Step 2

Next, we will upload this data into a BigQuery table.

Navigate to BigQuery in the console (by searching or using this link):

You can add the table to a new or existing dataset, which groups related tables. In case you haven't already created a dataset, you can click on your project in the lower-left corner, and then select Create Datasetin the lower-right corner.

Pick a name of your choice, such as demo, accept the defaults, and continue.

With that dataset selected, select Create Table in the lower-right corner to create a new table.

For the table creation options, select:

Create table from: Upload
Select file: cta_ridership.csv
Table name: cta_ridership
Schema: Check the box to auto detect Schema and input parameters

Step 3

It's now time to create our model! BigQuery ML provides a straightforward syntax similar to SQL that enables you to create a wide variety of model types.

In the query editor, paste/type in this query, replacing demo if needed with your dataset name in both places:

CREATE OR REPLACE MODEL
  `demo.cta_ridership_model` OPTIONS(MODEL_TYPE='ARIMA',
    TIME_SERIES_TIMESTAMP_COL='service_date',
    TIME_SERIES_DATA_COL='total_rides',
    HOLIDAY_REGION='us') AS
SELECT
  service_date, total_rides
FROM
  `demo.cta_ridership`

Let's go through key elements of the syntax for understanding:

`CREATE OR REPLACE MODEL demo.cta_ridership_model`	This statement creates the model. There are variants of this statement, e.g. `CREATE MODEL`, but we chose to replace an existing model with the same name here.
`OPTIONS(MODEL_TYPE=‘ARIMA' ... )`	Here, we define the model options, with the first option being the model type. Selecting ARIMA will create a time-series forecasting model.
`TIME_SERIES_TIMESTAMP_COL=‘service_date'`	The column with date/time information
`TIME_SERIES_DATA_COL=‘total_rides'`	The data column
`HOLIDAY_REGION=‘us'`	This optional parameter allows us to include holidays into the model. Since our data exploration in the previous step showed that ridership was lower on holidays, and the data comes from Chicago, IL, USA, we are including US holidays into the model.
`AS SELECT ... FROM ...`	This section selects the input data we will use to train the model with.

There are a number of other options you can add to the query, such as defining a column if you have multiple time series, or choosing whether to automatically discover the ARIMA model parameters. You can find out more details in the CREATE MODEL statement for time series models syntax reference.

Step 4

Let's find out more about our model. After it has finished training, let's run another query, again replacing demo if needed:

SELECT
  *
FROM
  ML.EVALUATE(MODEL `demo.cta_ridership_model`)

Let's interpret the results. In each row, you will see a candidate model, with its parameters and evaluation statistics. The results are returned in ascending order of AIC, or Akaike information criterion, which provides a relative indicator of model quality. So, the model in the first row has the lowest AIC, and is considered the best model.

You will be able to see the p, d, and q parameters of the ARIMA model, as well as the seasonality discovered in the model. In this case, the top model includes both weekly and yearly seasonality.

Step 5

Now, we're ready to forecast with the ML.FORECAST function!

Paste/type in the following (replacing demo if needed):

SELECT
  *
FROM
  ML.FORECAST(MODEL `demo.cta_ridership_model`,
    STRUCT(7 AS horizon))

This query simply forecasts 7 days out using our model! We can see the seven rows returned below. The forecast also includes a confidence interval, defaulting to 0.95 but configurable in the query.

Great work: we've created a time series model with just a few BQML queries.

6. Build a Custom Forecasting Model

In this section, you will:

Remove outliers from the data
Perform multi-step forecasting
Include additional features in a time-series model
Learn about neural network architectures for time-series forecasting: LSTM and CNN
Learn about statistical models, including Holt-Winters Exponential Smoothing
Ensemble models

Step 1

In Vertex AI Workbench, navigate to training-data-analyst/courses/ai-for-time-series/notebooks and open 02-model.ipynb.

Step 2

Clear all the cells in the notebook (Edit > Clear All Outputs), change the region, project and bucket settings in one of the first few cells, and then Run the cells one by one.

Step 3

In the notebook, you've now explored multiple model architectures: LSTM, CNN, and statistical models. For each model, you can see how the model performs against the test data:

7. Train and Predict in the Cloud

In this section, you will:

Prepare data and models for training in the cloud
Train your model and monitor the progress of the job with AI Platform Training
Predict using the model with AI Platform Predictions

Step 1

In Vertex AI Workbench, navigate to training-data-analyst/courses/ai-for-time-series/notebooks and open 03-cloud-training.ipynb.

Step 2

Clear all the cells in the notebook (Edit > Clear All Outputs), change the region, project and bucket settings in one of the first few cells, and then Run the cells one by one.

Step 3

In the previous section, we trained a model and predicted with it, all within a Workbench notebook. In this section, we demonstrated how to use the Python SDK for Vertex AI from your notebook to use Vertex AI services for training and deployment.

8. Challenge

In this section, you will try applying the concepts you learned to a new dataset!

We won't provide detailed instructions, just some hints (if you want them!).

The goal is to predict 311 service requests from the City of New York. These non-emergency requests include noise complaints, street light issues, etc.

Step 1

Let's start by understanding the dataset.

First, access the City of New York 311 Service Requests dataset.

To get to know the data better, try out a couple of the sample queries listed in the dataset description:

What is the number of 311 requests related to ice cream trucks?
What days get the most 311 requests related to parties?

In the BigQuery UI, select Create Query to see how to access the dataset. Note the select statement is querying from bigquery-public-data.new_york_311.311_service_requests.

Step 2

We're ready to get started. In this section, make modifications to the Explore and Visualize notebook to work with this data.

Hints

Duplicate the 01-explore.ipynb notebook and begin working from it.
To explore the data, try this query:

from google.cloud import bigquery as bq

sql = """
SELECT * FROM `bigquery-public-data.new_york_311.311_service_requests` LIMIT 5
"""

client = bq.Client(project=PROJECT)
df = client.query(sql).to_dataframe()

df.head()

To get the counts of incidents by month, use this query:

SELECT
  COUNT(unique_key) as y,
  DATE_TRUNC(DATE(created_date), month) as ds  
FROM `bigquery-public-data.new_york_311.311_service_requests`
GROUP by ds ORDER BY ds asc

Update the column variables in the constants section. In the query above, the target column is y, and the date column is ds. There are no additional features.
Consider changing the file name in which you export the data for the next lab.
Export the data using: df.to_csv(YOUR-EXPORT-FILENAME, index=False)

Step 3

Let's now create a time-series model with the monthly data.

Hints:

Duplicate the 02-model.ipynb notebook and begin working from it.
Update the dataset parameters:
Update the target_col and ts_col parameters to match your new dataset.
Update the model parameters:
Frequency to monthly (code for month start is ‘MS')
Input steps: 12 (lookback window is 12 months)
Output steps: 3 (predict 3 months forward)
Seasons: 12 (Seasonality is 12 months)
Change the input file name if you changed it in the previous notebook.
If you run the query mid-month, the monthly total for the final month will be much lower than expected. So, for the purposes of this lab, let's remove the final month from the dataset using: df = df[:-1]
There doesn't seem any obvious outliers in the data, so skip or comment out those cells.
Adjust the LSTM units and CNN filters and kernel size for this new model.

9. Cleanup

If you'd like to continue using this notebook, it is recommended that you turn it off when not in use. From the Workbench UI in your Cloud Console, select the notebook and then select Stop:

If you'd like to delete all the resources you've created in this lab, simply Delete the workbench notebook instead of stopping it.

Using the Navigation menu in your Cloud Console, browse to Storage and delete both buckets you created to store your model assets (WARNING: only do this if you created new buckets just for this lab).