1. Overview
BigQuery ML (BQML) enables users to create and execute machine learning models in BigQuery using SQL queries. The goal is to democratise machine learning by enabling SQL practitioners to build models using their existing tools and to increase development speed by eliminating the need for data movement.
What you'll build
You'll use the sample Analytics 360 dataset to create a model that predicts whether a visitor will make a transaction.
What you'll learn
How to create, evaluate and use machine learning models in BigQuery
What you'll need
- A Browser, such as Chrome or Firefox
- Basic knowledge of SQL or BigQuery
- BQML documentation
2. Setup and Requirements
Self-paced environment setup
- Sign in to Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one.
Remember the project ID, a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!). It will be referred to later in this codelab as PROJECT_ID
.
- Next, you'll need to enable billing in Cloud Console in order to use Google Cloud resources.
Running through this codelab shouldn't cost much, if anything at all. Be sure to to follow any instructions in the "Cleaning up" section which advises you how to shut down resources so you don't incur billing beyond this tutorial. New users of Google Cloud are eligible for the $300 USD Free Trial program.
Open BigQuery Console
In the Google Developer Console, click the Menu icon on the top left of the screen.
Scroll down the menu to the bottom, and click BigQuery:
This will open up the BigQuery console in a new browser tab, that looks like this:
But, there is nothing in here! Luckily, there are tons of public datasets available in BigQuery for you to explore.
3. Create a dataset
To create a dataset, click the arrow next to your project name and select create new dataset.
Next, name your dataset bqml_codelab and click ok.
4. Create a Model
Logistic regression for Analytics 360
Now, let's move on to our task. Here is how you would create a model to predict whether a visitor will make a transaction.
#standardSQL CREATE OR REPLACE MODEL `bqml_codelab.sample_model` OPTIONS(model_type='logistic_reg') AS SELECT IF(totals.transactions IS NULL, 0, 1) AS label, IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(geoNetwork.country, "") AS country, IFNULL(totals.pageviews, 0) AS pageviews FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20160801' AND '20170631' LIMIT 100000;
Here, we use the visitor's device's operating system, whether said device is a mobile device, the visitor's country and the number of page views as the criteria for whether a transaction has been made.
In this case, "codelab" is the name of the dataset and "sample_model" is the name of our model. The model type specified is binary logistic regression. In this case, label
is what we're trying to fit to. Note that if you're only interested in 1 column, this is an alternative way to setting input_label_cols
. We're also limiting our training data to those collected from 1 August 2016 to 31 June 2017. We're doing this to save the last month of data for "prediction". Furthermore, we're limiting to 100,000 data points to save us some time. Feel free to remove the last line if you're not in a rush.
Running the CREATE MODEL command creates a Query Job that will run asynchronously so you can, for example, close or refresh the BigQuery UI window.
[Optional] Model information & training statistics
If interested, you can get information about the model by clicking on sample_model under bqml_codelab dataset in the UI. Under Details, you should find some basic model info and training options used to produce the model. Under Training Stats, you should see a table similar to this:
5. Evaluate the Model
Evaluate our model
#standardSQL SELECT * FROM ml.EVALUATE(MODEL `bqml_codelab.sample_model`, ( SELECT IF(totals.transactions IS NULL, 0, 1) AS label, IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(geoNetwork.country, "") AS country, IFNULL(totals.pageviews, 0) AS pageviews FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'));
If used with a linear regression model, the above query returns the following columns: mean_absolute_error
, mean_squared_error
, mean_squared_log_error
, median_absolute_error
, r2_score
, explained_variance
. If used with a a logistic regression model, the above query returns the following columns: precision
, recall
, accuracy
, f1_score
, log_loss
, roc_auc
. Please consult the machine learning glossary or run a Google search to understand how each of these metrics are calculated and what they mean.
Concretely, you'll realise the SELECT
and FROM
portions of the query is identical to that used during training. The WHERE
portion reflects the change in time frame and the FROM
portion shows that we're calling ml.EVALUATE
. You should see a table similar to this:
6. Use the Model
Predict purchases per country
Here we try to predict the number of transactions made by visitors of each country, sort the results and select the top 10 countries by purchases.
#standardSQL SELECT country, SUM(predicted_label) as total_predicted_purchases FROM ml.PREDICT(MODEL `bqml_codelab.sample_model`, ( SELECT IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(totals.pageviews, 0) AS pageviews, IFNULL(geoNetwork.country, "") AS country FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170801')) GROUP BY country ORDER BY total_predicted_purchases DESC LIMIT 10;
Notice this query is very similar to the evaluation query we demonstrated in the previous section. Instead of ml.EVALUATE
, we use ml.PREDICT
here and we wrap the BQML portion of the query with standard SQL commands. Concretely, we're interested in the country and the sum of purchases for each country, so that's what we SELECT
, GROUP BY
and ORDER BY
. LIMIT
is used here to ensure we only get the top 10 results. You should see a table similar to this:
Predict purchases per user
Here is another example. This time we try to predict the number of transactions each visitor makes, sort the results and select the top 10 visitors by transactions.
#standardSQL SELECT fullVisitorId, SUM(predicted_label) as total_predicted_purchases FROM ml.PREDICT(MODEL `bqml_codelab.sample_model`, ( SELECT IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(totals.pageviews, 0) AS pageviews, IFNULL(geoNetwork.country, "") AS country, fullVisitorId FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170801')) GROUP BY fullVisitorId ORDER BY total_predicted_purchases DESC LIMIT 10;
You should see a table similar to this:
7. Congratulations!
You're done with this codelab. Looking for a challenge? Try making a linear regression model with BQML.
What we've covered
- Create a binary logistic regression model
- Evaluate the model
- Use model to make predictions
Next Steps
- For more information on BQML, please refer to the documentation.