Movie Rating Prediction with BQML using SQL

1. Introduction

In this codelab, we'll create a movie score prediction model with BigQuery ML custom model using only SQL and deploy the model in VertexAI. You will learn:

  1. How to create a BigQuery dataset using Cloud Shell and load data from file
  2. How to use BigQuery ML for supervised learning to create a Multi-class Regression Prediction model for the Movie Score prediction without much coding but only using SQL queries

What you'll build

You'll create a

  • A BigQuery Dataset to contain the table and model components
  • BigQuery ML model for predicting the movie rating using SQL queries
  • Deploy the model in Vertex AI Model Registry

Requirements

  • A browser, such as Chrome or Firefox
  • A Google Cloud project with billing enabled

2. Create your project

  1. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
  2. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
  3. Navigate to BigQuery to enable the API. You can also open the BigQuery web UI directly by entering the following URL in your browser: https://console.cloud.google.com/bigquery

3. Activate Cloud Shell

  1. You will use Cloud Shell, a command-line environment running in Google Cloud that comes pre-loaded with bq: From the Cloud Console, click Activate Cloud Shell on the top right corner: 6757b2fb50ddcc2d.png
  2. Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your project ID. Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list
  1. Run the following command in Cloud Shell to confirm that the gcloud command knows about your project
gcloud config list project
  1. If your project is not set, use the following command to set it:
gcloud config set project <PROJECT_ID>

Refer documentation for gcloud commands and usage.

4. Preparing training data

This is a major step in all data related projects, products and apps that requires a lot of domain expertise in addition to the technology to prepare an optimal dataset and make it ready for your ML project. For the purpose of this codelab we are going to use the data file that is already prepared.

5. Creating and loading the dataset

A BigQuery dataset is a collection of tables. All tables in a dataset are stored in the same data location. You can also attach custom access controls to limit access to a dataset and its tables.

  1. In Cloud Shell, use the bq mk command to create a dataset called "movies."
bq mk --location=<<LOCATION>> movies

Set the location to a region (asia-south1).

  1. Make sure you have the data file (.csv) ready. Execute the following commands in Cloud Shell to clone the repository and navigate to the project:
git clone <<repository link>>

cd movie-score
  1. Use the bq load command to load your CSV file into a BigQuery table (please note that you can also directly upload from the BigQuery UI):
bq load --source_format=CSV --skip_leading_rows=1 movies.movies_score \
./movies_bq_src.csv \ Id:numeric,name:string,rating:string,genre:string,year:numeric,released:string,score:string,director:string,writer:string,star:string,country:string,budget:numeric,company:string,runtime:numeric,data_cat:string

Options Description: –source_format=CSV - uses the CSV data format when parsing the data file. –skip_leading_rows=1 - skips the first line in the CSV file because it is a header row. Movies.movies - the first positional argument—defines which table the data should be loaded into. ./movies.csv—the second positional argument—defines which file to load. In addition to local files, the bq load command can load files from Cloud Storage with gs://my_bucket/path/to/file URIs. A schema, which can be defined in a JSON schema file or as a comma-separated list (I used a comma-separated list). Hurray! Our csv data is now loaded in the table movies.movies.

  1. Query in one of the 3 ways:

We can interact with BigQuery in three ways, we'll try out two of those: a. BigQuery web UI b. The bq command c. API

SELECT name, rating, genre, runtime FROM movies.movies_score limit 3;

I have used the BigQuery Web SQL Workspace to run queries. The SQL Workspace looks like this:

109a0b2c7155e9b3.png

Using bq command:

bq query --use_legacy_sql=false \
SELECT name, rating, genre, runtime FROM movies.movies_score limit 3;

6. Predicting movie rating on a scale of 1 to 10

BigQuery ML supports supervised learning with the logistic regression model type. You can use the binary logistic regression model type to predict whether a value falls into one of two categories; or, you can use the multi-class regression model type to predict whether a value falls into one of multiple categories. These are known as classification problems, because they attempt to classify data into two or more categories.

A quick note about the choice of model: This is an experimental choice of model chosen here, based on the evaluation of results I ran across a few models initially and finally went ahead with LOGISTIC REG to keep it simple and to get results closer to the actual movie rating from several databases. Please note that this should be considered just as a sample for implementing the model and is NOT necessarily the recommended model for this use case. One other way of implementing this is to predict the outcome of the movie as GOOD / BAD using this Logistic Regression model instead of predicting the score.

Select your training data

We have already split the movie data (in csv) into 3 categories in the table using the field "data_cat" that has one of 3 values - TRAIN, TEST and PREDICT. Splitting the dataset for testing and training purposes is an important aspect of the model. If you need more information on understanding dataset splitting, refer to the documentation.

Create the logistic regression model

We can use the CREATE MODEL statement with the option ‘LOGISTIC_REG' to create and train a logistic regression model.

Run the below query in BigQuery Web UI SQL Workspace:

CREATE OR REPLACE MODEL
  `movies.movies_score_model`
OPTIONS
  ( model_type='LOGISTIC_REG',
    auto_class_weights=TRUE,
   data_split_method='NO_SPLIT',
    input_label_cols=[‘score']
  ) AS
SELECT
  * EXCEPT(id, data_cat)
FROM
  ‘movies.movies_score'
WHERE
  data_cat = 'TRAIN';

Query Details:

  1. The CREATE MODEL statement trains a model using the training data in the SELECT statement
  2. The OPTIONS clause specifies the model type and training options. Here, the LOGISTIC_REG option specifies a logistic regression model type. It is not necessary to specify a binary logistic regression model versus a multiclass logistic regression model: BigQuery ML can determine which to train based on the number of unique values in the label column
  3. data_split_method=‘NO_SPLIT' forces BQML to train on the data per the query conditions (data_cat = ‘TRAIN'), also note that it's better to use the ‘AUTO_SPLIT' in this option to allow the framework (or service in this case) to randomize the partition of train/test splits
  4. The input_label_cols option specifies which column in the SELECT statement to use as the label column. Here, the label column is score, so the model will learn which of the 10 values of score is most likely based on the other values present in each row
  5. The ‘auto_class_weights=TRUE' option balances the class labels in the training data. By default, the training data is unweighted. If the labels in the training data are imbalanced, the model may learn to predict the most popular class of labels more heavily
  6. The SELECT statement queries the table we loaded with the csv data. The WHERE clause filters the rows in the input table so that only the TRAIN dataset is selected in this step

Once created, the below appears in the SCHEMA section of the BigQuery SQL Workspace:

Labels

93efd0c1a7883690.png

Features

8c539338df1a9652.png

Evaluate your logistic regression model

After creating your model, evaluate the performance of the model using the ML.EVALUATE function. The ML.EVALUATE function evaluates the predicted values against the actual data. The query to evaluate the model is as follows:

SELECT
  *
FROM
  ML.EVALUATE (MODEL movies.movies_score_model,
    (
    SELECT
      *
    FROM
      movies.movies_score
    WHERE
      data_cat= ‘TEST'
    )
  )

The ML.EVALUATE function takes the model trained in our previous step and evaluation data returned by a SELECT subquery. The function returns a single row of statistics about the model.

b54b0ebd6514a498.png

Because you performed a logistic regression, the results include the metrics you see in the screenshot above - precision, recall, accuracy, f1-score, log_loss, roc_auc which are really critical in evaluating the performance of the model. You can also call ML.EVALUATE without providing input data. ML.EVALUATE retrieves the evaluation metrics derived during training, which uses the automatically reserved evaluation dataset.

Key metrics at a glance:

Precision - What proportion of positive identifications was actually correct? Precision = True Positive / (True Positive + False Positive) Recall - What proportion of actual positives was identified correctly? Recall = True Positive / (True Positive + False Negative) Accuracy - A metric for evaluating classification models, it is the fraction of predictions our model actually got right Accuracy = Number of correct predictions / Total number of predictions

Predict movie rating using ML.PREDICT

The following query predicts the score of each movie in the PREDICT portion of the dataset.

SELECT
  *
FROM
  ML.PREDICT (MODEL movies.movies_score_model,
    (
    SELECT
      *
    FROM
      movies.movies_score
    WHERE
      data_cat= ‘PREDICT'
     )
  )

Below is a snippet of the results:

1efb91967acc1f0c.png

The model result shows the predicted SCORE of the movie on a scale of 1 to 10 (classification). You must be wondering why there are several prediction rows against each movie. That is because the model has returned the possible predicted labels and the probability of occurrence of each one in the decreasing order.

Analyze predicted results and the model

Additionally, you can do two great analysis steps as part of prediction results:

  1. To understand why your model is generating these prediction results, you can use the ML.EXPLAIN_PREDICT function
  2. To know which features are the most important to determine the income bracket in general, you can use the ML.GLOBAL_EXPLAIN function

You can read about these steps in detail in the documentation.

7. Predict movie rating using BigQuery AutoML

AutoML is great if you want to let the BigQuery service create the model for you and relax until your model results are generated. AUTOML capability is available directly in BQML (model_type=AUTOML_CLASSIFIER or AUTOML_REGRESSOR) and much easier to call if the data is already in BigQuery.

CREATE OR REPLACE MODEL movies.auto_movies
       OPTIONS(model_type='AUTOML_CLASSIFIER',
               input_label_cols=['score'],
               budget_hours=1.0)
AS 
SELECT
  * EXCEPT(id, data_cat)
FROM
  movies.movies_score;

Once the model is created, you can carry out the steps to EVALUATE, PREDICT and EXPLAIN_PREDICT just as we discussed in the custom BQML model. Read more about BigQuery AutoML in the documentation.

8. One-click to Deploy BQML Models to Vertex AI Model Registry

You can now see all your ML models within Vertex AI Model Registry, making it easier for your organization to manage and deploy models. This includes models built with BigQuery ML, AutoML, and custom trained models. Here's how you do it:

CREATE OR REPLACE MODEL
  `movies.movies_score_model`
OPTIONS
  ( model_type='LOGISTIC_REG',
    auto_class_weights=TRUE,
   data_split_method='NO_SPLIT',
   model_registry='vertex_ai',
  vertex_ai_model_version_aliases=['logistic_reg', 'experimental'],
    input_label_cols=[‘score']
  ) AS
SELECT
  * EXCEPT(id, data_cat)
FROM
  ‘movies.movies_score'
WHERE
  data_cat = 'TRAIN';

Include the model_registry="vertex_ai" option in the CREATE MODEL query so BigQuery ML can pick and choose which models it explicitly wants to register to the Vertex AI Model Registry. You can read about this more in this blog.

9. Summary

In less than 30 minutes, we have:

  1. Set up BigQuery as our database for the analytics project
  2. Created custom prediction model using only BigQuery data SQLs and no other coding
  3. Created BQ Auto ML model using only SQL queries
  4. Deployed BQML model to Vertex AI Model Registry

10. Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this post, follow these steps:

  1. In the Google Cloud console, go to the Manage resources page
  2. In the project list, select the project that you want to delete, and then click Delete
  3. In the dialog, type the project ID, and then click Shut down to delete the project

11. Congratulations

Congratulations! You have successfully created a Movie Score Prediction Model with BQML custom and AutoML and deployed the model in Vertex AI Model Registry and all of this only using SQL.