In-Place LLM Insights: BigQuery & Gemini for Structured & Unstructured Data Analytics

1. Introduction

Massive datasets hold valuable insights, but extracting meaningful understanding can be a daunting task. Traditional analytics tools often fall short in providing nuanced, human-like explanations that help us truly grasp the story within our data. Data warehouses are holding more and more unstructured data, but that data hasn't always been accessible for gaining insights. Unstructured text, images, and videos hold valuable information for evolving your business. LLMs offer a powerful new way to understand your data providing explanations, sentiment analysis, object detection, and more. With the emergence of multimodal models (and the ability to store object data such as images and video in BigQuery via Object tables), integrating powerful LLMs like Gemini 1.0 Pro (for text-based analysis) and Gemini 1.0 Pro Vision (for text and image/video analysis) with BigQuery can transform the way you analyze data.

In this codelab, we'll look at the following two methods of drawing LLM insights from your enterprise data, and scaling those insights to thousands of rows within BigQuery:

Method 1: Remote Model Invocation

  • Call Gemini 1.0 Pro from within BigQuery using ML.GENERATE_TEXT in the SELECT query.
  • Use this approach when the model is already available as a remote model in BigQuery and you would like to use it out-of-the-box. Check the status of the model you want to use in the BigQuery documentation.

Method 2: Remote Function Implementation

What you'll build

  • For the Remote Model Invocation method, you'll build a location summarizer for the internet archive books dataset (that is publicly available in BigQuery) by invoking the remote model for Gemini 1.0 Pro in the ML.GENERATE_TEXT construct in BigQuery.

Flow diagram for remote model invocation

  • For the Remote Function Implementation method, you'll build an image validator for validating test images against a baseline image. For this, you'll create a dataset containing test image screenshots in an external table and ask Gemini 1.0 Pro Vision to validate it against the baseline image. Then, you'll create a Java Cloud Function that implements Gemini Pro Vision call and invoke it as a remote function in BigQuery.

Flow diagram for remote function implementation

Note:

There might be more Vertex AI Generative AI models available as remote models in BigQuery. This means that you can access these models with the ML.GENERATE_TEXT construct from BigQuery out-of-the-box. However, even in the case of remote model availability, you always have the option to build advanced customizations for your use case with Cloud Functions and then access these models as remote functions from BigQuery.

Why BigQuery?

BigQuery is a serverless, multi-cloud data warehouse that can scale from bytes to petabytes with minimal operational overhead and is hence a great choice for storing ML training data. Using the built-in BigQuery Machine Learning (BQML) and analytics capabilities, you can create no-code predictions using just SQL queries. Additionally, you can access data from external sources with federated queries, eliminating the need for complicated ETL pipelines.

BigQuery is a fully-managed cloud data warehouse that helps users analyze structured, semi-structured, and unstructured data.

2. Requirements

  • A browser, such as Chrome or Firefox
  • A Google Cloud project with billing enabled.

3. Before you begin

  1. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
  2. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project .
  3. You'll use Cloud Shell, a command-line environment running in Google Cloud that comes preloaded with bq. Click Activate Cloud Shell at the top of the Google Cloud console.

Activate Cloud Shell button image

  1. Once connected to Cloud Shell, you check that you're already authenticated and that the project is set to your project ID using the following command:
gcloud auth list
  1. Run the following command in Cloud Shell to confirm that the gcloud command knows about your project.
gcloud config list project
  1. If your project is not set, use the following command to set it:
gcloud config set project <YOUR_PROJECT_ID>
  1. Navigate to the Gemini for Google Cloud Marketplace to enable the API. You can also use the following command in the Cloud Shell terminal:
gcloud services enable cloudaicompanion.googleapis.com --project PROJECT_ID
  1. Make sure that the following APIs are enabled:
  • BigQuery
  • BigQuery Connection
  • Cloud Function
  • Cloud Run
  • Vertex AI
  • Cloud Build
  • Cloud Storage APIs

The alternative to using the gcloud command is going through the console using this link.

Refer documentation for gcloud commands and usage.

4. Create a BigQuery dataset and external connection

Let's begin by creating a dataset and a Cloud resource connection.

A dataset in BigQuery is a container for all the tables and objects for your application.

To create a dataset, do the following:

  1. Go to the BigQuery page in the Google Cloud console.
  2. In the Explorer panel, select the project where you want to create the dataset.
  3. Expand the Actions option (the vertical ellipsis icon), and click Create dataset.

Image of the Actions menu and the Create dataset option

  1. Click Create dataset.
  2. Enter gemini_bq_fn in the Dataset ID field.
  3. Set your location type as Multi-region and accept the default value, which is US(multiple regions in United States.
  4. Click Create dataset.
  5. Check that the dataset is created and listed under your project ID in the Explorer pane.

A BigQuery connection is required to interact with your Cloud Function. To create a remote function, you must create a BigQuery connection. In this codelab, we will use the BigLake connection to access the model from BigQuery via the Cloud Function. BigLake connections help to connect the external data source while retaining fine-grained BigQuery access control and security, which in our case is the Vertex AI Gemini Pro API.

To create the BigLake connection, do the following:

  1. Click Add on the Explorer pane of the BigQuery page.

BigQuery Console with ADD button highlighted to add external connection

  1. Click Connections to external data sources.
  2. From the Connection type menu, select Vertex AI remote models, remote functions and BigLake (Cloud Resource).
  3. In the Connection ID field, enter your connection name as gemini-bq-conn.
  4. Set your location type as Multi-region and accept the default value, which is US(multiple regions in United States.
  5. Click Create connection.
  6. Click Go to connection, and then copy the service account ID in the Connection info pane.

Connection info screenshot

  1. Go to the IAM & Admin page and click Grant access.
  2. Paste the service account ID in the New principles field.
  3. Select the Vertex AI user role from the role list, and then click Save.

Grant access to Service Account screenshot

You have now successfully created the dataset and BigQuery connection. Next, you'll complete the steps that are required for remote model invocation.

5. Use case #1 Remote Model Invocation: Set up remote model invocation

Now that you have created your dataset and connection, let's create a model in BigQuery based on the Vertex AI Gemini Pro foundation model. At the end of this exercise, you'll have your LLM application up and running using only SQL queries.

Step 1: Create a BigQuery table that contains the input data for the remote model

Create a table named books in your dataset that can hold about 50 records from the table bigquery-public-data.gdelt_internetarchivebooks.1905 in the Internet Archive Books dataset sourced for public use by BigQuery.

To do this, execute the following DDL (Data Definition Language) statement from the BigQuery SQL editor pane:

create or replace table gemini_bq_fn.books as (
select *
from
bigquery-public-data.gdelt_internetarchivebooks.1905 limit 50)

Step 2 : Create a BigQuery model

Create a model in your dataset. To do this, run the following DDL from the BigQuery SQL Editor pane:

CREATE MODEL `gemini_bq_fn.gemini_remote_model`
REMOTE WITH CONNECTION `us.gemini-bq-conn`
OPTIONS(ENDPOINT = 'gemini-pro');

Observe that the model is created with an option to view the model.

Step 3 : Test your new Generative AI application

Use the ML.GENERATE_TEXT function in a SELECT query to send a request to the remote model.

SELECT ml_generate_text_llm_result as Gemini_Response, prompt as Prompt
FROM ML.GENERATE_TEXT(MODEL `gemini_bq_fn.gemini_remote_model`,
 (select 'You are a text summarizer and standardizer. From the following text that contains address locations, standardize and print one standardized, consolidated address. You cannot return empty because you know how to pick up sensible data from the text in this field: ' ||
substring(locations, 0, 200) as prompt
from `gemini_bq_fn.books`),
STRUCT(
 TRUE AS flatten_json_output));

You should see the following result:

ML.GENERATE_TEXT query result in BigQuery

Congratulations! You have successfully demonstrated how to use a remote model (based on a gemini-pro model) with the ML.GENERATE_TEXT function to analyze text data in a BigQuery table.

Now let's try another Vertex AI model using BigQuery remote functions. Let's say you want to add more customization and flexibility to how you can use the model remotely in BigQuery. Supported models are listed in the BigQuery documentation.

6. Use case #2 Remote Function Implementation: Set up remote function implementation

In this exercise, you'll create a function in BigQuery based on the Java Cloud Function that implements Gemini 1.0 Pro Vision foundation model. First, you'll create and deploy the Java Cloud Function to compare images using the Gemini 1.0 Pro Vision model, and then you'll create the remote function in BigQuery that invokes the deployed Cloud Function. Remember that the same procedure can be followed for any remote function execution in BigQuery.

So, let's get started!

Step 1: Create the Java Cloud Function

You'll build a Gen 2 Cloud Function in Java for validating test images against a baseline image. The baseline image is stored in a dataset containing test image screenshots in an external table in BigQuery. You will validate the images using the Gemini Pro Vision model (Java SDK). You will then deploy the Java Cloud Function to a REST endpoint.

Follow these steps:

  1. Launch the Cloud Shell Editor.
  2. If you're using the Legacy Editor, click Try the new Editor.
  3. In the status bar, click Cloud Code - Sign In.
  4. If prompted to authorize Cloud Shell Editor to make Google Cloud API calls, click Authorize. Follow the prompts to sign into your Google Account. You're now connected to Google Cloud.
  5. If you don't see your project in the Cloud Code status bar, click Cloud Code - Sign In > Select a Google Cloud project. Choose the Google Cloud project where you want to create your Cloud Functions. The project loads and becomes the active project in Cloud Shell Editor.
  6. In the Cloud Code status bar, click the active project name. In the Quick Pick menu that appears, select New Application.
  7. In the Create New Application dialog, select Cloud Functions application.

Create New Application pop up page 1

  1. Select Java: Hello World when prompted to select a template.

Create New Application pop up page 2

  1. Enter Gemini-BQ-Function as your project name in the project path. The project structure appears in a new Cloud Shell Editor view.

New Java Cloud Function application project structure

  1. Add the necessary dependencies within the <dependencies>... </dependencies> tag or copy the dependencies from the sample pom.xml file.
  2. Change the name of your class from HelloWorld.java to something more meaningful. Let's say GeminiBigQueryFunction.java. Rename the package accordingly.
  3. Replace the placeholder code in the file GeminiBigQueryFunction.Java with the sample code provided in the GitHub repository. Make sure to replace the variable <YOUR_PROJECT_ID> with your project ID in the sample code.
  4. Go to the Cloud Shell terminal. Check that you're in your project directory. If not, then navigate to the project directory by executing the following command:
cd Gemini-BQ-Function
  1. Execute the following statement to build and deploy the Cloud Function:
gcloud functions deploy gemini-bq-fn --gen2 --runtime java17 --trigger-http --entry-point cloudcode.helloworld.GeminiBigQueryFunction --allow-unauthenticated --source .

Replace the entry point in the command with the entry point of your function. The output is a REST URL in the following format:

https://us-central1-YOUR_PROJECT_ID.cloudfunctions.net/gemini-bq-fn
  1. Test this Cloud Function by running the following command from the terminal:
gcloud functions call gemini-bq-fn --region=us-central1 --gen2 --data '{"calls":[["https://storage.googleapis.com/img_public_test/image_validator/baseline/1.JPG", "https://storage.googleapis.com/img_public_test/image_validator/test/2.JPG", "PROMPT_ABOUT_THE_IMAGES_TO_GEMINI"]]}'

Response for the random sample prompt:

JSON Response string from the Cloud Function

Awesome! The generic Cloud Function for Gemini Pro Vision model implementation is ready. Let's use this endpoint directly on BigQuery data from within a BigQuery remote function.

Step 2: Create a BigQuery object table and remote function

Prerequisite

You'll need a Cloud Storage bucket for this part of the exercise. Follow these steps to create a Cloud Storage bucket:

  1. To create a bucket, go to the Cloud Storage console and click Create.
  2. Enter demo-bq-gemini-public-image as your bucket name and click Create.
  3. Uncheck the Enforce public access prevention on this bucket option. Although you're creating this bucket for public access in this exercise, we recommend that you prevent public access and provide permissions to specific service accounts as required in your enterprise use cases.
  4. After the bucket is created, go to the Permissions tab to view and change permissions.
  5. To add principals, click Grant access. Specify a service account ID to give permissions to a specific account or specify allUsers to give public access to your bucket.
  6. Select the Storage Object Viewer role and click Save. If prompted, choose to allow public access.
  7. Go to the Objects tab and upload baseline images and test images to compare. For this codelab, you can upload the following objects that are publicly available: baseline.JPG, TEST1.JPG, and TEST2.JPG.

Create a BigQuery object table

Create an external object table from BigQuery to access the unstructured data in the bucket using the connection and the dataset that you created earlier. To do this, execute the following DDL (Data Definition Language) statement from the BigQuery query editor pane:

CREATE OR REPLACE EXTERNAL TABLE `gemini_bq_fn.image_validation`
WITH CONNECTION `us.gemini-bq-conn`
OPTIONS(object_metadata="SIMPLE", uris=["gs://demo-bq-gemini-public-image/*.JPG"]);

Check that a new object table called image_validation is created in your dataset.

Create a BigQuery Remote Function

Create a remote function in BigQuery to invoke the Java Cloud Function that implements the Gemini Pro Vision model. Create this in the same dataset. To do this, run the following DDL from the SQL editor pane of the BigQuery console:

CREATE OR REPLACE FUNCTION `gemini_bq_fn.FN_IMAGE_VALIDATE` (baseline STRING, test STRING, prompt STRING) RETURNS STRING
 REMOTE WITH CONNECTION `us.gemini-bq-conn`
 OPTIONS (
   endpoint = 'https://us-central1-<your project id>.cloudfunctions.net/gemini-bq-fn',
   max_batching_rows = 1
 );

The endpoint URL is the REST URL that was returned earlier in this exercise. This creates the remote function in BigQuery. There are three parameters in the above DDL. The first two parameters are URLs to the images stored in the object table created in the previous step. The last parameter is the prompt to the model (Gemini Pro Vision). You can refer to the Java Cloud Functions code that is used to parse this signature:

Gson().fromJson(request.getReader(), JsonObject.class);
JsonArray calls_array = requestJson.getAsJsonArray("calls");
JsonArray calls = (JsonArray) calls_array.get(0);
String baseline_url = calls.get(0).toString().replace("\"", "");
String test_url = calls.get(1).toString().replace("\"", "");
String prompt_string = calls.get(2).toString();

Invoke Gemini from BigQuery

Use the remote function in a SELECT query to test the function for validating the images against the prompt.

select gemini_bq_fn.FN_IMAGE_VALIDATE(
'https://storage.googleapis.com/demo-bq-gemini-public-image/Baseline.JPG',
REPLACE(uri, 'gs://', 'https://storage.googleapis.com/') ,
'You are an expert image validator who can respond with JSON result. Find 2 images here. The first one is the reference image. The second image is the test image. Compare the second image to see if the second image is similar to the first in terms of the text present in the image.  Respond ONLY in JSON Format with 3 attributes namely SIMILARITY which is a \"YES\" or \"NO\", SIMILARITY_SCORE which is a percentage and DIFFERENCE_COMMENT which is a string.' ) as IMAGE_VALIDATION_RESULT
from `gemini_bq_fn.image_validation`
where uri like '%TEST1%';  

Try the query above with TEST1, and TEST2 object files. You'll see a result similar to the following:

BigQuery Remote Function Result screenshot

Here are the images for your reference:

Baseline Image

Baseline Reference Image for the Test Case

Test Image

Test Image 1 for the test case

Notice that both images are similar in that they have the Duet AI cloud console view but the text in both are different as brought about by the model in a JSON format.

With this, you have successfully tested the remote function implementation.

7. Benefits and Use Cases

Let's look at some of the benefits of integrating BigQuery and Gemini for structured and unstructured data analytics.

  • Bringing GenAI to your data: No more moving data around, duplication, and added complexity. You can analyze data and generate insights within the same BigQuery environment.
  • Enhanced analytics: Gemini's natural language explanations adds a new layer of understanding to your data and you can achieve this by using SQL queries.
  • Scalability: This solution handles large datasets and complex analysis with ease and low-code effort.

The applications for this are many. Consider scenarios in finance (market trend analysis), retail (customer sentiment), healthcare (medical report summaries), etc. where your analytics and business teams could implement these with relatively low effort, resources and a familiar language and tool of choice.

8. Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this post, follow these steps:

  1. In the Google Cloud console, go to the Manage resources page.
  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.
  4. If you want to keep your project, skip the above steps and delete the Cloud Function by navigating to Cloud Functions and from the list of functions, check the one you want to delete and click Delete.

9. Congratulations

Congratulations. With the Gemini models integrated into BigQuery, you're no longer just analyzing your data — you're a data storyteller. You can generate the hidden narratives within your datasets and transform the way you understand your insights. Start experimenting! Apply this technique to your own datasets and discover the stories within your data. With BigQuery supporting unstructured data in object tables ( External Tables), try using Gemini Pro Vision for creating generative insights on image data. Refer to the documentation for Vertex AI, BigQuery Remote Functions, and Cloud Functions for more in-depth guidance. Here is the github repo for this project. Let us know what you build with this learning!