Analyzing Movie Posters in BigQuery with Remote Models

Analyzing Movie Posters in BigQuery with Remote Models

About this codelab

subjectLast updated Mar 10, 2025
account_circleWritten by Alicia Williams

1. Introduction

In this lab, you will learn how to use BigQuery Machine Learning for inference with remote models ( Gemini models) to analyze movie poster images and generate summaries of the movies based on the posters directly in your BigQuery data warehouse.

3e75cd7cdc8ffcf6.jpeg

Pictured above: A sample of the movie poster images you will analyze.

BigQuery is a fully managed, AI-ready data analytics platform that helps you maximize value from your data and is designed to be multi-engine, multi-format, and multi-cloud. One of its key features is BigQuery Machine Learning for inference, which lets you create and run machine learning (ML) models by using GoogleSQL queries.

Gemini is a family of generative AI models developed by Google that is designed for multimodal use cases.

Running ML models using GoogleSQL Queries

Usually, performing ML or artificial intelligence (AI) on large datasets requires extensive programming and knowledge of ML frameworks. This limits solution development to a small group of specialists within each company. With BigQuery Machine Learning for inference, SQL practitioners can use existing SQL tools and skills to build models, and generate results from LLMs and Cloud AI APIs.

Prerequisites

What you'll learn

  • How to configure your environment and account to use APIs
  • How to create a Cloud Resource connection in BigQuery
  • How to create a dataset and object table in BigQuery for movie poster images
  • How to create the Gemini remote models in BigQuery
  • How to prompt the Gemini model to provide movie summaries for each poster
  • How to generate text embeddings for the movie represented in each poster
  • How to use BigQuery VECTOR_SEARCH to match movie poster images with closely related movies in the dataset

What you'll need

  • A Google Cloud Account and Google Cloud Project, with billing enabled
  • A web browser such as Chrome

2. Setup and Requirements

Self-paced environment setup

  1. Sign-in to the Google Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one.

fbef9caa1602edd0.png

a99b7ace416376c4.png

5e3ff691252acf41.png

  • The Project name is the display name for this project's participants. It is a character string not used by Google APIs. You can always update it.
  • The Project ID is unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference your Project ID (typically identified as PROJECT_ID). If you don't like the generated ID, you might generate another random one. Alternatively, you can try your own, and see if it's available. It can't be changed after this step and remains for the duration of the project.
  • For your information, there is a third value, a Project Number, which some APIs use. Learn more about all three of these values in the documentation.
  1. Next, you'll need to enable billing in the Cloud Console to use Cloud resources/APIs. Running through this codelab won't cost much, if anything at all. To shut down resources to avoid incurring billing beyond this tutorial, you can delete the resources you created or delete the project. New Google Cloud users are eligible for the $300 USD Free Trial program.

Start Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this codelab you will be using Google Cloud Shell, a command line environment running in the Cloud.

From the Google Cloud Console, click the Cloud Shell icon on the top right toolbar:

55efc1aaa7a4d3ad.png

It should only take a few moments to provision and connect to the environment. When it is finished, you should see something like this:

7ffe5cbb04455448.png

This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on Google Cloud, greatly enhancing network performance and authentication. All of your work in this codelab can be done within a browser. You do not need to install anything.

3. Before you begin

There are a few setup steps for working with Gemini models in BigQuery, including enabling APIs, creating a Cloud resource connection, and granting the service account for the Cloud resource connection certain permissions. These steps are one-time per project, and will be covered in the next few sections.

Enable APIs

Inside Cloud Shell, make sure that your project ID is setup:

gcloud config set project [YOUR-PROJECT-ID]

Set environment variable PROJECT_ID:

PROJECT_ID=$(gcloud config get-value project)

Configure your default region to use for the Vertex AI models. Read more about available locations for Vertex AI. In the example we are using the us-central1 region.

gcloud config set compute/region us-central1

Set environment variable REGION:

REGION=$(gcloud config get-value compute/region)

Enable all necessary services:

gcloud services enable bigqueryconnection.googleapis.com \
                       aiplatform
.googleapis.com

Expected output after running all above commands:

student@cloudshell:~ (test-project-001-402417)$ gcloud config set project test-project-001-402417
Updated property [core/project].
student@cloudshell:~ (test-project-001-402417)$ PROJECT_ID=$(gcloud config get-value project)
Your active configuration is: [cloudshell-14650]
student@cloudshell:~ (test-project-001-402417)$ 
student@cloudshell:~ (test-project-001-402417)$ gcloud services enable bigqueryconnection.googleapis.com \
                       aiplatform.googleapis.com
Operation "operations/acat.p2-4470404856-1f44ebd8-894e-4356-bea7-b84165a57442" finished successfully.

4. Create a Cloud Resource connection

In this task, you will create a Cloud Resource connection, which enables BigQuery to access image files in Cloud Storage and make calls to Vertex AI.

  1. In the Google Cloud Console, on the Navigation menu ( Navigation menu icon), click BigQuery.

68ddff3628a55957.png

  1. To create a connection, click + ADD, and then click Connections to external data sources.

8b70fdd209f540b6.png

  1. In the Connection type list, select Vertex AI remote models, remote functions and BigLake (Cloud Resource).
  2. In the Connection ID field, enter gemini_conn for your connection.
  3. For Location type, select Multi-region and then, from dropdown select US multi-region.
  4. Use the defaults for the other settings.

deb25a1dd65ee9dd.png

  1. Click Create connection.
  2. Click GO TO CONNECTION.
  3. In the Connection info pane, copy the service account ID to a text file for use in the next task. You will also see that the connection is added under the External Connections section of your project in the BigQuery Explorer.

5. Grant IAM permissions to the connection's service account

In this task, you grant the Cloud Resource connection's service account IAM permissions, through a role, to enable it access the Vertex AI services.

  1. In the Google Cloud console, on the Navigation menu, click IAM & Admin.
  2. Click Grant Access.
  3. In the New principals field, enter the service account ID that you copied earlier.
  4. In the Select a role field, enter Vertex AI, and then select Vertex AI User role.

55ed0ac6cd92aa1e.png

  1. Click Save. The result is the service account ID now includes the Vertex AI User role.

6. Create the dataset and object table in BigQuery for movie poster images

In this task, you will create a dataset for the project and an object table within it to store the poster images.

The dataset of movie poster images used in this tutorial are stored in a public Google Cloud Storage bucket: gs://cloud-samples-data/vertex-ai/dataset-management/datasets/classic-movie-posters

Create a dataset

You will create a dataset to store database objects, including tables and models, used in this tutorial.

  1. In the Google Cloud console, select the Navigation menu ( Navigation menu icon), and then select BigQuery.
  2. In the Explorer panel, next to your project name, select View actions ( More menu icon), and then select Create dataset.
  3. In the Create dataset pane, enter the following information:
  • Dataset ID: gemini_demo
  • Location type: select Multi-region
  • Multi-region: select US
  1. Leave the other fields at their defaults.

9a787006017d5b47.png

  1. Click Create Dataset.

The result is the gemini_demo dataset is created and listed underneath your project in the BigQuery Explorer.

Create the object table

BigQuery holds not only structured data, but it can also access unstructured data (like the poster images) through object tables.

You create an object table by pointing to a Cloud Storage bucket, and the resulting object table has a row for each object from the bucket with its storage path and metadata.

To create the object table you will use a SQL query.

  1. Click the + to Create new SQL query.
  2. In the query editor, paste the query below.
CREATE OR REPLACE EXTERNAL TABLE
  `gemini_demo.movie_posters`
WITH CONNECTION `us.gemini_conn`
OPTIONS (
  object_metadata = 'SIMPLE',
  uris = ['gs://cloud-samples-data/vertex-ai/dataset-management/datasets/classic-movie-posters/*']
  );
  1. Run the query. The result is a movie_posters object table added to the gemini_demo dataset and loaded with the URI (the Cloud Storage location) of each movie poster image.
  2. In the Explorer, click on the movie_posters and review the schema and details. Feel free to query the table to review specific records.

7. Create the Gemini remote model in BigQuery

Now that the object table is created, you can begin to work with it. In this task, you will create a remote model for Gemini 1.5 Flash to make it available in BigQuery.

Create the Gemini 1.5 Flash remote model

  1. Click the + to Create new SQL query.
  2. In the query editor, paste the query below and run it.
CREATE OR REPLACE MODEL `gemini_demo.gemini_1_5_flash`
REMOTE WITH CONNECTION
`us.gemini_conn`
OPTIONS
(endpoint = 'gemini-1.5-flash')

The result is the gemini_1_5_flash model is created and you see it added to the gemini_demo dataset, in the models section.

  1. In the Explorer, click on the gemini_1_5_flash model and review the details.

8. Prompt Gemini model to provide movie summaries for each poster

In this task, you will use the Gemini remote model you just created to analyze the movie poster images and generate summaries for each movie.

You can send requests to the model by using the ML.GENERATE_TEXT function, referencing the model in the parameters.

Analyze the images with Gemini 1.5 Flash model

  1. Create and run a new query with the following SQL statement:
CREATE OR REPLACE TABLE
`gemini_demo.movie_posters_results` AS (
SELECT
    uri,
    ml_generate_text_llm_result
FROM
    ML.GENERATE_TEXT( MODEL `gemini_demo.gemini_1_5_flash`,
    TABLE `gemini_demo.movie_posters`,
    STRUCT( 0.2 AS temperature,
        'For the movie represented by this poster, what is the movie title and year of release? Answer in JSON format with two keys: title, year. title should be string, year should be integer. Do not use JSON decorators.' AS PROMPT,
        TRUE AS FLATTEN_JSON_OUTPUT)));

When the query runs, BigQuery prompts the Gemini model for each row of the object table, combining the image with the specified static prompt. The result is the movie_posters_results table is created.

  1. Now let's view the results. Create and run a new query with the following SQL statement:
SELECT * FROM `gemini_demo.movie_posters_results`

The result is rows for each movie poster with the URI (the Cloud Storage location of the movie poster image) and a JSON result including the movie title and the year the movie was released from the Gemini 1.5 Flash model.

You can retrieve these results in a more human readable way by using the next query. This query uses SQL to pull out the movie title and release year from these responses into new columns.

  1. Create and run a new query with the following SQL statement:
CREATE OR REPLACE TABLE
 
`gemini_demo.movie_posters_results_formatted` AS (
  SELECT
    uri
,
    JSON_VALUE
(ml_generate_text_llm_result, "$.title") AS title,
    JSON_VALUE
(ml_generate_text_llm_result, "$.year") AS year
  FROM
   
`gemini_demo.movie_posters_results` results )

The result is the movie_posters_result_formatted table is created.

  1. You can query the table with the query below, to see the rows created.
SELECT * FROM `gemini_demo.movie_posters_results_formatted`

Notice how the URI column results remain the same, but the JSON is now converted to the title and year columns for each row.

Prompt Gemini 1.5 Flash model to provide movie summaries

What if you wanted a bit more information about each of these movies, say a text summary of each of the movies? This content generation use case is perfect for an LLM model such as the Gemini 1.5 Flash model.

  1. You can use Gemini 1.5 Flash to provide movie summaries for each poster by running the query below:
SELECT
  uri,
  title,
  year,
  prompt,
  ml_generate_text_llm_result
  FROM
 ML.GENERATE_TEXT( MODEL `gemini_demo.gemini_1_5_flash`,
   (
   SELECT
     CONCAT('Provide a short summary of movie titled ',title, ' from the year ',year,'.') AS prompt,
     uri,
     title,
     year
   FROM
     `gemini_demo.movie_posters_results_formatted`
   LIMIT
     20 ),
   STRUCT(0.2 AS temperature,
     TRUE AS FLATTEN_JSON_OUTPUT));    

Notice the ml_generate_text_llm_result field of the results; this includes a short summary of the movie.

9. Generate text embeddings using a remote model

Now you can join the structured data you have built with other structured data in your warehouse. The IMDB public dataset available in BigQuery contains a rich amount of information about movies, including ratings by viewers and some sample freeform user reviews as well. This data can help you deepen your analysis of the movie posters and understand how these movies were perceived.

In order to join data, you will need a key. In this case, the movie titles generated by the Gemini model may not match perfectly to the titles in the IMDB dataset.

In this task, you will generate text embeddings of the movie titles and years from both datasets, and then use the distance between these embeddings to join the closest IMDB title with the movie poster titles from your newly created dataset.

Create the remote model

To generate the text embeddings, you will need to create a new remote model pointing to the text-multilingual-embedding-002 endpoint.

  1. Create and run a new query with the following SQL statement:
CREATE OR REPLACE MODEL `gemini_demo.text_embedding`
REMOTE WITH CONNECTION
`us.gemini_conn`
OPTIONS
(endpoint = 'text-multilingual-embedding-002')

The result is the text_embedding model is created and appears in the explorer underneath the gemini_demo dataset.

Generate text embeddings for the title and year associated with the posters

You will now use this remote model with the ML.GENERATE_EMBEDDING function to create an embedding for each movie poster title and year.

  1. Create and run a new query with the following SQL statement:
CREATE OR REPLACE TABLE
 
`gemini_demo.movie_posters_results_embeddings` AS (
  SELECT
   
*
  FROM
    ML
.GENERATE_EMBEDDING(
    MODEL
`gemini_demo.text_embedding`,
   
(
    SELECT
      CONCAT
('The movie titled ', title, ' from the year ', year,'.') AS content,
      title
,
      year
,
      uri
    FROM
     
`gemini_demo.movie_posters_results_formatted` ),
    STRUCT
(TRUE AS flatten_json_output)));

The result is the movie_poster_results_embeddings table is created containing the embeddings for the text content concatenated for each row of the gemini_demo.movie_posters_results_formatted table.

  1. You can view the results of the query using the new query below:
SELECT * FROM `gemini_demo.movie_posters_results_embeddings`

Here you see the embeddings (vectors represented by numbers) for each movie generated by the model.

Generate text embeddings for a subset of the IMDB dataset

You will create a new view of data from a public IMDB dataset which contains only the movies that were released prior to 1935 (the known time period of the movies from the poster images).

  1. Create and run a new query with the following SQL statement:
CREATE OR REPLACE VIEW
  `gemini_demo.imdb_movies` AS (
  WITH
    reviews AS (
      SELECT
        reviews.movie_id AS movie_id,
        title.primary_title AS title,
        title.start_year AS year,
        reviews.review AS review
      FROM
        `bigquery-public-data.imdb.reviews` reviews
      LEFT JOIN
        `bigquery-public-data.imdb.title_basics` title
      ON
        reviews.movie_id = title.tconst)
  SELECT
    DISTINCT(movie_id),
    title,
    year
  FROM
    reviews
  WHERE
    year < 1935)

The result is a new view containing a list of distinct movie IDs, titles, and year of release from the bigquery-public-data.imdb.reviews table for all movies in the dataset released before 1935.

  1. You will now create embeddings for the subset of movies from IMDB using a similar process to the previous section. Create and run a new query with the following SQL statement:
CREATE OR REPLACE TABLE
 
`gemini_demo.imdb_movies_embeddings` AS (
  SELECT
   
*
  FROM
    ML
.GENERATE_EMBEDDING( MODEL `gemini_demo.text_embedding`,
     
(
      SELECT
        CONCAT
('The movie titled ', title, ' from the year ', year,'.') AS content,
        title
,
        year
,
        movie_id
      FROM
       
`gemini_demo.imdb_movies` ),
      STRUCT
(TRUE AS flatten_json_output) )
  WHERE
    ml_generate_embedding_status
= '' );

The result of the query is a table that contains the embeddings for the text content of the gemini_demo.imdb_movies table.

Now, you can join the two tables using the VECTOR_SEARCH function.

  1. Create and run a new query with the following SQL statement:
SELECT
  query.uri AS poster_uri,
  query.title AS poster_title,
  query.year AS poster_year,
  base.title AS imdb_title,
  base.year AS imdb_year,
  base.movie_id AS imdb_movie_id,
  distance
FROM
 VECTOR_SEARCH( TABLE `gemini_demo.imdb_movies_embeddings`,
   'ml_generate_embedding_result',
   TABLE `gemini_demo.movie_posters_results_embeddings`,
   'ml_generate_embedding_result',
   top_k => 1,
   distance_type => 'COSINE');

The query uses the VECTOR_SEARCH function to find the nearest neighbor in the gemini_demo.imdb_movies_embeddings table for each row in the gemini_demo.movie_posters_results_embeddings table. The nearest neighbor is found using the cosine distance metric, which determines how similar two embeddings are.

This query can be used to find the most similar movie in the IMDB dataset for each of the movies identified by Gemini 1.5 Flash in the movie posters. For example, you could use this query to find the closest match for the movie "Au Secours!" in the IMDB public dataset, which references this movie by its English-language title, "Help!".

  1. Create and run a new query to join some additional information on movie ratings provided in the IMDB public dataset:
SELECT
  query.uri AS poster_uri,
  query.title AS poster_title,
  query.year AS poster_year,
  base.title AS imdb_title,
  base.year AS imdb_year,
  base.movie_id AS imdb_movie_id,
  distance,
  imdb.average_rating,
  imdb.num_votes
FROM
  VECTOR_SEARCH( TABLE `gemini_demo.imdb_movies_embeddings`,
    'ml_generate_embedding_result',
    TABLE `gemini_demo.movie_posters_results_embeddings`,
    'ml_generate_embedding_result',
    top_k => 1,
    distance_type => 'COSINE') DATA
LEFT JOIN
  `bigquery-public-data.imdb.title_ratings` imdb
ON
  base.movie_id = imdb.tconst
ORDER BY
  imdb.average_rating DESC

This query is similar to the previous query. It still uses special numerical representations called vector embeddings to find similar movies to a given movie poster. However, it also joins the average rating and number of votes for each nearest neighbor movie from a separate table from the IMDB public dataset.

10. Congratulations

Congratulations for completing the codelab. You successfully created an object table for your poster images in BigQuery, created a remote Gemini model, used the model to prompt the Gemini model to analyze images and provided movie summaries, generated text embeddings for movie titles, and used those embeddings to match movie poster images to the related movie title in the IMDB dataset.

What we've covered

  • How to configure your environment and account to use APIs
  • How to create a Cloud Resource connection in BigQuery
  • How to create a dataset and object table in BigQuery for movie poster images
  • How to create the Gemini remote models in BigQuery
  • How to prompt the Gemini model to provide movie summaries for each poster
  • How to generate text embeddings for the movie represented in each poster
  • How to use BigQuery VECTOR_SEARCH to match movie poster images with closely related movies in the dataset

Next steps / learn more

11. Survey

How will you use this tutorial?