BigQuery ML (BQML) enables users to create and execute machine learning models in BigQuery using SQL queries. The goal is to democratise machine learning by enabling SQL practitioners to build models using their existing tools and to increase development speed by eliminating the need for data movement.

What you'll build

You'll use the sample Analytics 360 dataset to create a model that predicts whether a visitor will make a transaction.

What you'll learn

How to create, evaluate and use machine learning models in BigQuery

What you'll need

Self-paced environment setup

If you don't already have a Google Account (Gmail or Google Apps), you must create one. Sign-in to Google Cloud Platform console (console.cloud.google.com) and create a new project:

Remember the project ID, a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!). It will be referred to later in this codelab as PROJECT_ID.

Next, you'll need to enable billing in the Cloud Console in order to use Google Cloud resources.

Running through this codelab shouldn't cost you more than a few dollars, but it could be more if you decide to use more resources or if you leave them running (see "cleanup" section at the end of this document).

New users of Google Cloud Platform are eligible for a $300 free trial.

Open BigQuery Console

In the Google Developer Console, click the Menu icon on the top left of the screen.

Scroll down the menu to the bottom, and click BigQuery:

This will open up the BigQuery console in a new browser tab, that looks like this:

But, there is nothing in here! Luckily, there are tons of public datasets available in BigQuery for you to explore.

To create a dataset, click the arrow next to your project name and select create new dataset.

Next, name your dataset bqml_codelab and click ok.

Logistic regression for Analytics 360

Now, let's move on to our task. Here is how you would create a model to predict whether a visitor will make a transaction.

#standardSQL
CREATE OR REPLACE MODEL `bqml_codelab.sample_model` 
OPTIONS(model_type='logistic_reg') AS
SELECT
  IF(totals.transactions IS NULL, 0, 1) AS label,
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(geoNetwork.country, "") AS country,
  IFNULL(totals.pageviews, 0) AS pageviews
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20160801' AND '20170631'
LIMIT 100000;

Here, we use the visitor's device's operating system, whether said device is a mobile device, the visitor's country and the number of page views as the criteria for whether a transaction has been made.

In this case, "codelab" is the name of the dataset and "sample_model" is the name of our model. The model type specified is binary logistic regression. In this case, label is what we're trying to fit to. Note that if you're only interested in 1 column, this is an alternative way to setting input_label_cols. We're also limiting our training data to those collected from 1 August 2016 to 31 June 2017. We're doing this to save the last month of data for "prediction". Furthermore, we're limiting to 100,000 data points to save us some time. Feel free to remove the last line if you're not in a rush.

Running the CREATE MODEL command creates a Query Job that will run asynchronously so you can, for example, close or refresh the BigQuery UI window.

[Optional] Model information & training statistics

If interested, you can get information about the model by clicking on sample_model under bqml_codelab dataset in the UI. Under Details, you should find some basic model info and training options used to produce the model. Under Training Stats, you should see a table similar to this:

Evaluate our model

#standardSQL
SELECT
  *
FROM
  ml.EVALUATE(MODEL `bqml_codelab.sample_model`, (
SELECT
  IF(totals.transactions IS NULL, 0, 1) AS label,
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(geoNetwork.country, "") AS country,
  IFNULL(totals.pageviews, 0) AS pageviews
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'));

If used with a linear regression model, the above query returns the following columns: mean_absolute_error, mean_squared_error, mean_squared_log_error, median_absolute_error, r2_score, explained_variance. If used with a a logistic regression model, the above query returns the following columns: precision, recall, accuracy, f1_score, log_loss, roc_auc. Please consult the machine learning glossary or run a Google search to understand how each of these metrics are calculated and what they mean.

Concretely, you'll realise the SELECT and FROM portions of the query is identical to that used during training. The WHERE portion reflects the change in time frame and the FROM portion shows that we're calling ml.EVALUATE. You should see a table similar to this:

Predict purchases per country

Here we try to predict the number of transactions made by visitors of each country, sort the results and select the top 10 countries by purchases.

#standardSQL
SELECT
  country,
  SUM(predicted_label) as total_predicted_purchases
FROM
  ml.PREDICT(MODEL `bqml_codelab.sample_model`, (
SELECT
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(totals.pageviews, 0) AS pageviews,
  IFNULL(geoNetwork.country, "") AS country
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY country
ORDER BY total_predicted_purchases DESC
LIMIT 10;

Notice this query is very similar to the evaluation query we demonstrated in the previous section. Instead of ml.EVALUATE, we use ml.PREDICT here and we wrap the BQML portion of the query with standard SQL commands. Concretely, we're interested in the country and the sum of purchases for each country, so that's what we SELECT, GROUP BY and ORDER BY. LIMIT is used here to ensure we only get the top 10 results. You should see a table similar to this:

Predict purchases per user

Here is another example. This time we try to predict the number of transactions each visitor makes, sort the results and select the top 10 visitors by transactions.

#standardSQL
SELECT
  fullVisitorId,
  SUM(predicted_label) as total_predicted_purchases
FROM
  ml.PREDICT(MODEL `bqml_codelab.sample_model`, (
SELECT
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(totals.pageviews, 0) AS pageviews,
  IFNULL(geoNetwork.country, "") AS country,
  fullVisitorId
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY fullVisitorId
ORDER BY total_predicted_purchases DESC
LIMIT 10;

You should see a table similar to this:

You're done with this codelab. Looking for a challenge? Try making a linear regression model with BQML.

What we've covered

Next Steps