SQL-only LLM with BigQuery ML using Vertex AI PaLM API

1. Introduction

In this codelab, I have listed the steps to perform summarization of source code from Github repos and identification of the language of programming in the repo, using Vertex AI Large Language Model for text generation ( text-bison) as a hosted remote function in BigQuery. Thanks to the GitHub Archive Project, we now have a full snapshot of over 2.8 million open source GitHub repositories in Google BigQuery Public Datasets. The list of services used are:

  1. BigQuery ML
  2. Vertex AI PaLM API

What you'll build

You'll create

  • A BigQuery Dataset to contain the model
  • A BigQuery Model that hosts the Vertex AI PaLM API as a remote function
  • An external connection to establish the connection between BigQuery and Vertex AI

2. Requirements

  • A browser, such as Chrome or Firefox
  • A Google Cloud project with billing enabled

3. Before you begin

  1. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project
  2. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project
  3. Make sure all the necessary APIs (BigQuery API, Vertex AI API, BigQuery Connection API) are enabled
  4. You will use Cloud Shell, a command-line environment running in Google Cloud that comes pre-loaded with bq. Refer documentation for gcloud commands and usage

From the Cloud Console, click Activate Cloud Shell on the top right corner:

51622c00acec2fa.png

If your project is not set, use the following command to set it:

gcloud config set project <YOUR_PROJECT_ID>
  1. Navigate to BigQuery console directly by entering the following URL in your browser: https://console.cloud.google.com/bigquery

4. Preparing data

In this use case we are using the source code content from github_repos dataset in the Google BigQuery Public Datasets. To use this, in the BigQuery console, search for "github_repos" and press enter. Click on the star next to the dataset that is listed as the search result. Then click on the "SHOW STARRED ONLY" option to see that dataset only from the public datasets.

dc6bf1e2fa6bba8a.png

Expand the tables in the dataset to view the schema and data preview. We are going to use the sample_contents, which only contains a sample (10%) of the full data in the contents table. Here is a preview of the data:

e021c689c56abf22.png

5. Creating the BigQuery dataset

A BigQuery dataset is a collection of tables. All tables in a dataset are stored in the same data location. You can also attach custom access controls to limit access to a dataset and its tables.

Create a dataset in the region "US" (or any region of our preference) named bq_llm

c67e9f929629739a.png

This dataset will house the ML model that we will create in the upcoming steps. Typically we would also store the data that we use in the ML application in a table in this dataset itself, however in our use case the data already lives in a BigQuery public dataset, we are going to reference that directly from our newly created dataset as required. If you want to do this project on your own dataset that lives in a CSV (or any other file), you can load your data into a BigQuery dataset into table by running the command below from the Cloud Shell terminal:

bq load --source_format=CSV --skip_leading_rows=1 bq_llm.table_to_hold_your_data \
./your_file.csv \ text:string,label:string

6. Creating external connection

Create an External Connection (Enable BQ Connection API if not already done) and note down the Service Account id from the connection configuration details:

  1. Click the +ADD button on the BigQuery Explorer pane (in the left of the BigQuery console) and click "Connection to external data sources" in the popular sources listed
  2. Select Connection type as "BigLake and remote functions" and provide "llm-conn" as Connection ID

6b75042881eaec5f.png

  1. Once the connection is created, take a note of the Service Account generated from the connection configuration details

7. Grant permissions

In this step we will grant permissions to the Service Account to access the Vertex AI service:

Open IAM and add the Service Account you copied after creating the external connection as the Principal and select "Vertex AI User" Role

ff8e1d730879f972.png

8. Create a remote ML model

Create the remote model that represents a hosted Vertex AI large language model:

CREATE OR REPLACE MODEL bq_llm.llm_model
  REMOTE WITH CONNECTION `us.llm-conn`
  OPTIONS (remote_service_type = 'CLOUD_AI_LARGE_LANGUAGE_MODEL_V1');

It creates a model named llm_model in the dataset bq_llm which leverages the CLOUD_AI_LARGE_LANGUAGE_MODEL_V1 API of Vertex AI as a remote function. This will take several seconds to complete.

9. Generate text using the ML model

Once the model is created, use the model to generate, summarize or categorize text.

SELECT
  ml_generate_text_result['predictions'][0]['content'] AS generated_text,
  ml_generate_text_result['predictions'][0]['safetyAttributes']
    AS safety_attributes,
  * EXCEPT (ml_generate_text_result)
FROM
  ML.GENERATE_TEXT(
    MODEL `bq_llm.llm_model`,
    (
  SELECT
        CONCAT('Can you read the code in the following text and generate a summary for what the code is doing and what language it is written in:', content)
        AS prompt from `bigquery-public-data.github_repos.sample_contents`
          limit 5
    ),
    STRUCT(
      0.2 AS temperature,
      100 AS max_output_tokens));

**Explanation:

ml_generate_text_result** is the response from the text generation model in JSON format that contains both content and safety attributes: a. Content represents the generated text result b. Safety Attributes represent the built-in content filter with an adjustable threshold that is enabled in Vertex AI Palm API to avoid any unintended or unforeseen responses from the large language model - the response is blocked if it violates the safety threshold

ML.GENERATE_TEXT is the construct you use in BigQuery to access the Vertex AI LLM to perform text generation tasks

CONCAT appends your PROMPT statement and the database record

github_repos is the dataset name and sample_contents is the name of the table that holds the data we will use in the prompt design

Temperature is the prompt parameter to control the randomness of the response - lesser the better in terms of relevance

Max_output_tokens is the number of words you want in response

The query response looks like this:

1156f6eecb548cd5.png

10. Flatten the query result

Let's flatten the result so we don't have to decode the JSON explicitly in the query:

SELECT *
FROM
  ML.GENERATE_TEXT(
    MODEL `bq_llm.llm_model`,
    (
  SELECT
        CONCAT('Can you read the code in the following text and generate a summary for what the code is doing and what language it is written in:', content)
        AS prompt from `bigquery-public-data.github_repos.sample_contents`
          limit 5
        ),
    STRUCT(
      0.2 AS temperature,
      100 AS max_output_tokens,
      TRUE AS flatten_json_output));

**Explanation:

Flatten_json_output** represents the boolean, which if set true returns a flat understandable text extracted from the JSON response.

The query response looks like this:

3b662ef7d3ba9263.png

11. Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this post, follow these steps:

  1. In the Google Cloud console, go to the Manage resources page
  2. In the project list, select the project that you want to delete, and then click Delete
  3. In the dialog, type the project ID, and then click Shut down to delete the project

12. Congratulations

Congratulations! You have successfully used a Vertex AI Text Generation LLM programmatically to perform text analytics on your data only using SQL-queries. Check out Vertex AI LLM product documentation to learn more about available models.