Build a Patent Search App with Spanner, Vector Search & Gemini 1.0 Pro!

1. Overview

Across different industries, Patent research is a critical tool for understanding the competitive landscape, identifying potential licensing or acquisition opportunities, and avoiding infringing on existing patents.

Patent research is vast and complex. Sifting through countless technical abstracts to find relevant innovations is a daunting task. Traditional keyword-based searches are often inaccurate and time-consuming. Abstracts are lengthy and technical, making it difficult to grasp the core idea quickly. This can lead to researchers missing key patents or wasting time on irrelevant results.

Objective

In this codelab, we will work towards making the process of searching for patents faster, more intuitive, and incredibly precise by leveraging Spanner, and in-place Gemini 1.0 Pro, Embeddings and Vector Search.

What you'll build

As part of this lab, you will:

  1. Create a Spanner instance
  2. Load a Google Patents Public Datasets
  3. Create a remote model for text embeddings with Gemini 1.0 Pro model
  4. Create generative insights from loaded dataset
  5. Generate embeddings from the insights
  6. Issue similarity search queries against the dataset

The following diagram represents the flow of data and steps involved in the implementation.

14cfdde5e24258a.png

Requirements

  • A browser, such as Chrome or Firefox
  • A Google Cloud project with billing enabled.

2. Before you begin

Create a project

  1. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
  2. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project .
  3. You'll use Cloud Shell, a command-line environment running in Google Cloud that comes preloaded with bq. Click Activate Cloud Shell at the top of the Google Cloud console.

Activate Cloud Shell button image

  1. Once connected to Cloud Shell, you check that you're already authenticated and that the project is set to your project ID using the following command:
gcloud auth list
  1. Run the following command in Cloud Shell to confirm that the gcloud command knows about your project.
gcloud config list project
  1. If your project is not set, use the following command to set it:
gcloud config set project <YOUR_PROJECT_ID>
  1. Make sure that the Vertex AI and Spanner APIs are enabled by searching for them in the console. Alternatively, you can also use the following command in the Cloud Shell terminal:
gcloud services enable spanner.googleapis.com --project <<YOUR_PROJECT_ID>>
gcloud services enable aiplatform.googleapis.com --project <<YOUR_PROJECT_ID>>

Another alternative is to use this link.

Refer documentation for gcloud commands and usage.

3. Prepare your Spanner database

Let's create a Spanner instance, database, and table where the patent dataset will be loaded.

Create a Spanner instance

  1. Create a Spanner instance named spanner-vertex.
gcloud spanner instances create spanner-vertex \
--config=regional-us-central1 \
--description=spanner-vertex \
--nodes=1

Create a database

  1. In the Google Cloud console, open the Spanner page.
  2. Select the spanner-vertex instance from the list.
  3. Under Databases, click Create Database.
  4. Set the database name to patents.
  5. Click Create to create the database.

Create a table

  1. In the Google Cloud console, open the Spanner page.
  2. Select the spanner-vertex instance from the list.
  3. Select the patents database.
  4. Under the Tables tab, click Create Table. The Spanner Studio page opens.
  5. Open a new tab by clicking the New SQL editor tab.
  6. Run the following query:
CREATE TABLE patents_data (
   id string(25), type string(25), number string(20), country string(2), date string(20), abstract string(300000), title string(100000),kind string(5), num_claims numeric, filename string(100), withdrawn numeric, 
) PRIMARY KEY (id);

4. Load patent data into the database

The Google Patents Public Datasets on BigQuery will be used as our dataset. We will use Spanner Studio to run our queries. The spanner-gemini-search repository includes the insert_into_patents_data.sql script we will run to load the patent data.

  1. In the Google Cloud console, open the Spanner page.
  2. Select the spanner-vertex instance from the list.
  3. Select the patents database.
  4. In the navigation menu, click Spanner Studio. The Explorer pane displays a list of the objects in your database.
  5. Open a new tab by clicking the New SQL editor tab.
  6. Copy the insert query statement from the insert_into_patents_data.sql script in the editor. You can copy 50-100 insert statements for a quick demo of this use case.
  7. Click Run. The results of your query appear in the Results table.

5. Create Remote Model for Gemini 1.0 Pro

After the patent data is loaded to the database, we will create a remote model that uses Gemini 1.0 Pro Vertex AI model to generate a summarized set of title and keywords.

Run the following DDL command in Spanner Studio Editor:

  1. In the navigation menu, click Spanner Studio. The Explorer pane displays a list of the objects in your database.
  2. Open a new tab by clicking the New SQL editor tab.
  3. Run the following query:
CREATE MODEL gemini_pro_model INPUT(
prompt STRING(MAX),
) OUTPUT(
content STRING(MAX),
) REMOTE OPTIONS (
endpoint = '//aiplatform.googleapis.com/projects/<<YOUR_PROJECT_ID>>/locations/us-central1/publishers/google/models/gemini-pro',
default_batch_size = 1
);
  1. Click Run. The results of your query appear in the Results table.

6. Create Remote Model for Text Embeddings

The result of the earlier step includes a consolidated summary consisting of a title and keywords. We will convert this response into embeddings that will help us generate appropriate matches when we run a query. We will use the Text Embedding Gecko 003 model from Vertex AI remotely from Spanner.

  1. In the navigation menu, click Spanner Studio. The Explorer pane displays a list of the objects in your database.
  2. Open a new tab by clicking the New SQL editor tab.
  3. Run the following query:
CREATE MODEL text_embeddings INPUT(content STRING(MAX))
OUTPUT(
 embeddings
   STRUCT<
     statistics STRUCT<truncated BOOL, token_count FLOAT64>,
     values ARRAY<FLOAT64>>
)
REMOTE OPTIONS (
 endpoint = '//aiplatform.googleapis.com/projects/<<YOUR_PROJECT_ID>>/locations/us-central1/publishers/google/models/textembedding-gecko@003');
  1. Click Run. The results of your query appear in the Results table.

7. Create Generative Insights from Patent Abstracts

We will create a patents_data_gemini table to store generative insights that we will generate using the Gemini 1.5 Pro model we earlier created.

Create the table

  1. In the navigation menu, click Spanner Studio. The Explorer pane displays a list of the objects in your database.
  2. Open a new tab by clicking the New SQL editor tab.
  3. Run the following query:
CREATE TABLE patents_data_gemini (id string(100), gemini_response STRING(MAX)) PRIMARY KEY (id);
  1. Click Run. The results of your query appear in the Results table.

Generate insights

To populate the table with generative insights, it is recommended that you use an application that uses the batch write or mutations method. For this codelab, we will run the following DDL query up to 4 times to populate the table.

INSERT INTO patents_data_gemini (id, gemini_response)
SELECT id, content as gemini_response
FROM ML.PREDICT(MODEL gemini_pro_model,
(select id, concat ('Identify the areas of work or keywords in this abstract', abstract) as prompt from patents_data b where id not in (select id from patents_data_gemini) limit 50
));

Note: If you are getting Quota Exceeded error at this step (possible in the case of the small margin free-credits), try to skip the insert and run only the select query in the workaround section below.

Workaround section:

SELECT id, content as gemini_response
FROM ML.PREDICT(MODEL gemini_pro_model,
(select id, concat ('Identify the areas of work or keywords in this abstract', abstract) as prompt from patents_data b limit 50
))

Observe insights

The table includes insights that are generated for the 'Identify the areas of work or keywords in this abstract', prompt in the query.

Note: If you ran the workaround section query above instead of the INSERT DDL, then skip this part and run the last SELECT query in this page, instead.

Let's run the following query to verify the results of the insights:

select title, abstract, gemini_response from patents_data a inner join patents_data_gemini b
on a.id = b.id;

The following results are seen:

6041fab164aaab93.png

Note: If you ran the workaround section query, replace the table name in the above select query with the query in the workaround section. So you would run the below instead:

select title, abstract, gemini_response from patents_data a inner join (SELECT id, content as gemini_response
FROM ML.PREDICT(MODEL gemini_pro_model,
(select id, concat ('Identify the areas of work or keywords in this abstract', abstract) as prompt from patents_data b limit 50
))) b
on a.id = b.id;

The result should be the same as the one in the result screenshot above.

8. Generate Embeddings for the Generated Insights

After populating the insights in our table, we can now use these insights to generate embeddings. These embeddings help us in not relying on exact keyword matches, but help generate results based on conceptual similarities.

Note: If you ran the workaround section query in the previous step, you can skip this and move to the workaround section query in this step as well.

Run the following query to generate embeddings:

  1. In the navigation menu, click Spanner Studio. The Explorer pane displays a list of the objects in your database.
  2. Open a new tab by clicking the New SQL editor tab.
  3. Run the following query to create the patents_data_embeddings table.
CREATE TABLE patents_data_embeddings (id string(100), patents_embeddings ARRAY<FLOAT64>) PRIMARY KEY (id);
  1. Click Run. The results of your query appear in the Results table.
  2. Run the following query to generate embeddings.
INSERT INTO patents_data_embeddings (id, patents_embeddings)
SELECT id, embeddings.values as patents_embeddings
FROM ML.PREDICT(MODEL text_embeddings,
(SELECT id, gemini_response as content FROM patents_data_gemini));
  1. Click Run. The results of your query appear in the Results table.

Observe results

The table includes embeddings generated for the title and abstracted text.

Let's run the following query to verify the results:

select title, abstract, b.patents_embeddings from patents_data a inner join patents_data_embeddings b
on a.id = b.id;

The following results are seen:

a1e968bac4ab1cb.png

Workaround Section:

Use this query, if you did the workaround section in other steps:

select title, abstract, b.patents_embeddings from patents_data a inner join 
(SELECT id, embeddings.values as patents_embeddings
FROM ML.PREDICT(MODEL text_embeddings,
(SELECT id, gemini_response as content FROM (SELECT id, content as gemini_response
FROM ML.PREDICT(MODEL gemini_pro_model,
(select id, concat ('Identify the areas of work or keywords in this abstract', abstract) as prompt from patents_data b limit 50
)))))) b
on a.id = b.id;

This should result in the same results as shown in the screenshot above.

9. Prepare for vector search

Now that we have generated text embeddings, we can prepare our web application to be ready for performing similarity vector searches. In this codelab, we are creating a web application that includes the logic to give search results based on K-Nearest Neighbors Similarity Search capability. You can use this prepared dataset with a search app to visualize how the search results appear.

For our codelab, we will run a sample query that searches for a prompt, generates results based on context, and limits the results to 10 entries.

Run the following query:

  1. In the navigation menu, click Spanner Studio. The Explorer pane displays a list of the objects in your database.
  2. Open a new tab by clicking the New SQL editor tab.
  3. Run the following query to create the patents_data_embeddings table.
SELECT a.id, a.title, a.abstract, 'A new Natural Language Processing related Machine Learning Model' search_text, COSINE_DISTANCE(c.patents_embeddings,
(SELECT embeddings.values
FROM ML.PREDICT(
MODEL text_embeddings,
(SELECT 'A new Natural Language Processing related Machine Learning Model' as content)))) as distance
FROM patents_data a inner join patents_data_gemini b on a.id = b.id
inner join patents_data_embeddings c on a.id = c.id
ORDER BY distance
LIMIT 10;
  1. Click Run. The results of your query appear in the Results table.

Note: If you used the queries in the workaround section, because of quota errors in one of the earliest insert statements, you can skip all the other steps and directly run the below query to observe the results of performing nearest neighbor search on vector embeddings in Spanner database:

SELECT a.id, a.title, a.abstract, 'A new Natural Language Processing related Machine Learning Model' search_text, COSINE_DISTANCE(c.patents_embeddings,
(SELECT embeddings.values
FROM ML.PREDICT(
MODEL text_embeddings,
(SELECT 'A new Natural Language Processing related Machine Learning Model' as content)))) as distance
FROM patents_data a inner join (SELECT id, embeddings.values as patents_embeddings
FROM ML.PREDICT(MODEL text_embeddings,
(SELECT id, gemini_response as content FROM (SELECT id, content as gemini_response
FROM ML.PREDICT(MODEL gemini_pro_model,
(select id, concat ('Identify the areas of work or keywords in this abstract', abstract) as prompt from patents_data b 
)))))) c on a.id = c.id
ORDER BY distance
LIMIT 2;

Observe results

The earlier query uses the COSINE_DISTANCE method to find the closest 10 matches for our prompt.

The following results are seen:

d26ca8b8238bdf25.png

The results that are generated are quite close contextually to the prompt that was part of the query.

10. Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this post, follow these steps:

  1. In the Google Cloud console, go to the Manage resources page.
  2. In the project list, select the project that you want to delete, and then click Delete. If you don't want to delete the project, just delete the instance you created in Spanner.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

11. Congratulations

Congratulations!You have successfully performed a similarity search using Spanner's built-in vector search. Additionally, you saw how easy it is to work with embedding and LLM models to provide generative AI functionality directly using SQL.

What's next?

Learn more about Spanner's exact nearest neighbor (KNN vector search) feature here: https://cloud.google.com/spanner/docs/find-k-nearest-neighbors

You can also read more about how to perform online predictions with SQL using Spanner's VertexAI integration here: https://cloud.google.com/spanner/docs/ml