Generative Insights with BigQuery SQL and Vertex AI

1. Introduction

In this codelab, we will build a Movie Success Rating prediction and prescription app with BigQuery SQL queries and Vertex AI PaLM API. The model used to perform text generation is text-bison and is hosted as a remote function in BigQuery.

The list of services used are:

  1. BigQuery ML
  2. Vertex AI PaLM API
  3. Cloud Shell

What you'll build

You'll create

  • A BigQuery dataset to contain the model
  • A BigQuery ML model that predicts the success score of a movie based on the GENRE and RUNTIME attributes of the movie
  • A BigQuery model that hosts the Vertex AI PaLM API as a remote function
  • An external connection to establish the connection between BigQuery and Vertex AI

2. Requirements

  • A browser, such as Chrome or Firefox
  • A Google Cloud project with billing enabled

Before you begin

  1. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project
  2. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project
  3. Make sure all the necessary APIs (BigQuery API, Vertex AI API, BigQuery Connection API) are enabled
  4. You will use Cloud Shell, a command-line environment running in Google Cloud that comes pre-loaded with bq. Refer documentation for gcloud commands and usage

From the Cloud Console, click Activate Cloud Shell on the top right corner:

51622c00acec2fa.png

If your project is not set, use the following command to set it:

gcloud config set project <YOUR_PROJECT_ID>
  1. Navigate to BigQuery console directly by entering the following URL in your browser: https://console.cloud.google.com/bigquery

3. Preparing data

In this use case, we will use the movies dataset derived from movielens source.

  1. Create a dataset:

A BigQuery dataset is a collection of tables. All tables in a dataset are stored in the same data location. You can also attach custom access controls to limit access to a dataset and its tables.

In Cloud Shell, use the bq mk command to create a dataset called "movie_insights"

bq mk --location=us-central1 movie_insights
  1. Clone the source file to you Cloud Shell Machine:
git clone https://github.com/AbiramiSukumaran/movie_score_genai_insights
  1. Navigate to the new project directory that is created in your Cloud Shell Machine:
cd movie_score_genai_insights
  1. Use the bq load command to load your CSV file into a BigQuery table (please note that you can also directly upload from the BigQuery UI):
bq load --source_format=CSV --skip_leading_rows=1 movie_insights.movie_score \
./movies_data.csv \ Id:numeric,name:string,rating:string,genre:string,year:numeric,released:string,score:string,director:string,writer:string,star:string,country:string,budget:numeric,company:string,runtime:numeric,data_cat:string
  1. You can query a sample to check if the table movie_score and the data are created in the dataset:
bq query --use_legacy_sql=false \
SELECT name, rating, genre, runtime FROM movie_insights.movie_score limit 3;

4. Data to ML

Let's create a classification model to predict the success score of the movie based on GENRE and RUNTIME attributes. We will use the CREATE MODEL statement with the option ‘LOGISTIC_REG' to create and train a logistic regression model.

Run the below query in BigQuery console SQL Workspace QUERY EDITOR section:

CREATE OR REPLACE MODEL
  `movie_insights.model_rating_by_runtime_genre`
OPTIONS
  ( model_type='LOGISTIC_REG',
    auto_class_weights=TRUE,
    data_split_method='NO_SPLIT',
    model_registry='vertex_ai',   
    vertex_ai_model_version_aliases=['logistic_reg', 'experimental'],
    input_label_cols=['score']
  ) AS
SELECT name, genre,runtime, score
FROM
  movie_insights.movie_score
WHERE
  data_cat = 'TRAIN';

Query Details:

  1. The CREATE MODEL statement trains a model using the training data in the SELECT statement.
  2. The OPTIONS clause specifies the model type and training options. Here, the LOGISTIC_REG option specifies a logistic regression model type. It is not necessary to specify a binary logistic regression model versus a multiclass logistic regression model: BigQuery ML can determine which to train based on the number of unique values in the label column.
  3. data_split_method=‘NO_SPLIT' forces BQML to train on the data per the query conditions (data_cat = ‘TRAIN'), also note that it's better to use the ‘AUTO_SPLIT' in this option to allow the framework (or service in this case) to randomize the partition of train/test splits.
  4. The input_label_cols option specifies which column in the SELECT statement to use as the label column. Here, the label column is score, so the model will learn which of the 10 values of score is most likely based on the other values present in each row.
  5. The ‘auto_class_weights=TRUE' option balances the class labels in the training data. By default, the training data is unweighted. If the labels in the training data are imbalanced, the model may learn to predict the most popular class of labels more heavily.
  6. The SELECT statement queries the table we loaded with the csv data. The WHERE clause filters the rows in the input table so that only the TRAIN dataset is selected in this step.
  7. The following constructs are OPTIONAL so BigQuery ML can explicitly register it to the Vertex AI Model Registry. You can read about this more in this blog. model_registry='vertex_ai', vertex_ai_model_version_aliases=['logistic_reg', 'experimental']

Once created, the below appears in the SCHEMA section of the BigQuery SQL Workspace:

2e43087f914aa466.png

After creating your model, evaluate the performance of the model using the ML.EVALUATE function. The ML.EVALUATE function evaluates the predicted values against the actual data.

You can also view the evaluation metrics of your model from the MODEL page:

7f2dc168bac0ac1a.png

Key metrics at a glance:

Precision - What proportion of positive identifications was actually correct? Precision = True Positive / (True Positive + False Positive) Recall - What proportion of actual positives was identified correctly? Recall = True Positive / (True Positive + False Negative) Accuracy - A metric for evaluating classification models, it is the fraction of predictions our model actually got right Accuracy = Number of correct predictions / Total number of predictions

5. Predicting movie score using the model

Prediction Time!!!! The following query predicts the score of each movie in the dataset that is categorized as "TEST" data.

Run the below query in BigQuery console SQL Workspace QUERY EDITOR section:

SELECT
  *
FROM
  ML.PREDICT (MODEL movie_insights.model_rating_by_runtime_genre,
    (
    SELECT
      *
    FROM
      movie_insights.movie_score
    WHERE
      data_cat= 'TEST'
     )
  );

The result looks like this:

c719844860ce7c27.png

The model result shows the predicted_score of the movie on a scale of 1 to 10 (classification). You must be wondering why there are several prediction rows against each movie. That is because the model has returned the possible predicted labels and the probability of occurrence of each one in the decreasing order.

Analyze predicted results and the model:

You can do two great analysis steps with the prediction to understand the results:

  1. To understand why your model is generating these prediction results, you can use the ML.EXPLAIN_PREDICT function.
  2. To know which features are the most important to determine the income bracket in general, you can use the ML.GLOBAL_EXPLAIN function.

You can read about these steps in detail in the documentation.

6. Data to Generative AI

Let's deliver insights on the movies dataset by asking LLM (Large Language Model) the summary of factors that influence the movie score to be greater than 5, with Generative AI using Vertex AI's text-bison (latest) model using only sql queries

  1. The table we created movie_score will be the input for this step as well.
  2. External Connection will be created to establish the access between BigQuery ML and Vertex services.
  3. BigQuery GENERATE_TEXT construct will be used to invoke the PaLM API remotely from Vertex AI.

7. Create an External Connection

Enable BQ Connection API if not already done and note down the Service Account id from the connection configuration details:

  1. Click the +ADD button on the BigQuery Explorer pane (in the left of the BigQuery console) and click "Connection to external data sources" in the popular sources listed
  2. Select Connection type as "BigLake and remote functions", provide location type as "Region" and value as "us-central1 (Iowa)" and "bq_llm_connection" as Connection ID

8a87802ab0846a6.png

  1. Once the connection is created, take a note of the Service Account generated from the connection configuration details

Grant permissions

In this step we will grant permissions to the Service Account to access the Vertex AI service:

Open IAM and add the Service Account you copied after creating the external connection as the Principal and select "Vertex AI User" Role

ff8e1d730879f972.png

8. Create a remote ML model

Create the remote model that represents a hosted Vertex AI large language model:

CREATE OR REPLACE MODEL
  movie_insights.llm_model REMOTE
WITH CONNECTION `us-central1.bq_llm_connection` OPTIONS (remote_service_type = 'CLOUD_AI_LARGE_LANGUAGE_MODEL_V1');

It creates a model named llm_model in the dataset movie_insights which leverages the CLOUD_AI_LARGE_LANGUAGE_MODEL_V1 API of Vertex AI as a remote function. This will take several seconds to complete.

9. Generate text using the ML model

Once the model is created, use the model to generate, summarize or categorize text.

SELECT
  ml_generate_text_result['predictions'][0]['content'] AS generated_text,
  ml_generate_text_result['predictions'][0]['safetyAttributes']
    AS safety_attributes,
  * EXCEPT (ml_generate_text_result)
FROM
  ML.GENERATE_TEXT(
    MODEL `movie_insights.llm_model`,
    (
 SELECT
      CONCAT('FROM THE FOLLOWING TEXT ABOUT MOVIES, WHAT DO YOU THINK ARE THE FACTORS INFLUENCING A MOVIE SCORE TO BE GREATER THAN 5?: ', movie_data) AS prompt
    FROM (
      SELECT
        REPLACE(STRING_AGG( CONCAT('A movie named ',name, ' from the country ', country, ' with a censor rating of ',rating, ' and a budget of ', budget, ' produced by ', company, ' with a runtime of about ', runtime, ' and in the genre ', genre, ' starring ', star, ' has had a success score of ', score, '') ), ',','. ') AS movie_data
      FROM (
        SELECT
          *
        FROM
          `movie_insights.movie_score`
        WHERE
          CAST(SCORE AS INT64) > 5
        LIMIT
          50) ) AS MOVIES
    ),
    STRUCT(
      0.2 AS temperature,
      100 AS max_output_tokens));

**Explanation:

ml_generate_text_result** is the response from the text generation model in JSON format that contains both content and safety attributes: a. Content represents the generated text result b. Safety Attributes represent the built-in content filter with an adjustable threshold that is enabled in Vertex AI Palm API to avoid any unintended or unforeseen responses from the large language model - the response is blocked if it violates the safety threshold

ML.GENERATE_TEXT is the construct you use in BigQuery to access the Vertex AI LLM to perform text generation tasks

CONCAT appends your PROMPT statement and the database record

movie_insights is the dataset name and movie_score is the name of the table that has the data we will use in the prompt design

Temperature is the prompt parameter to control the randomness of the response - lesser the better in terms of relevance

Max_output_tokens is the number of words you want in response

The query response looks like this:

a3691afc0a97e724.png

As you can see, the response is nested and unformatted.

10. Flatten the query result

Let's flatten the result so we don't have to decode the JSON explicitly in the query:

SELECT
  *
FROM
  ML.GENERATE_TEXT( MODEL movie_insights.llm_model,
    (
    SELECT
      CONCAT('FROM THE FOLLOWING TEXT ABOUT MOVIES, WHAT DO YOU THINK ARE THE FACTORS INFLUENCING A MOVIE SCORE TO BE GREATER THAN 5?: ', movie_data) AS prompt
    FROM (
      SELECT
        REPLACE(STRING_AGG( CONCAT('A movie named ',name, ' from the country ', country, ' with a censor rating of ',rating, ' and a budget of ', budget, ' produced by ', company, ' with a runtime of about ', runtime, ' and in the genre ', genre, ' starring ', star, ' has had a success score of ', score, '') ), ',','. ') AS movie_data
      FROM (
        SELECT
          *
        FROM
          `movie_insights.movie_score`
        WHERE
          CAST(SCORE AS INT64) > 5
        LIMIT
          50) ) AS MOVIES),
    STRUCT( 0.2 AS temperature,
      100 AS max_output_tokens,
      TRUE AS flatten_json_output));

**Explanation:

Flatten_json_output** represents the boolean, which if set true returns a flat understandable text extracted from the JSON response.

The query response looks like this:

1aaa0c514fccab59.png

11. Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this post, you can delete the Vertex AI endpoint that you created as part of the ML step, by navigating to Vertex AI Endpoint page.

12. Congratulations

Congratulations! You have successfully created a BQML model and performed LLM based analytics using a Vertex AI API on your movies dataset only using SQL-queries. Check out Vertex AI LLM product documentation to learn more about available models.