Using textembedding-gecko@003 for Vector Embeddings

Using textembedding-gecko@003 for Vector Embeddings

About this codelab

subjectLast updated Oct 7, 2024
account_circleWritten by Eduardo Godinez

1. Introduction

Last Updated: 2024-04-08

Text Embedding

Text embedding refers to the process of transforming textual data into numerical representations. These numerical representations, often vectors, capture the semantic meaning and relationships between words in a text. Imagine it like this:

Text is like a complex language, full of nuances and ambiguities.

Text embedding translates that language into a simpler, mathematical format that computers can understand and manipulate.

Benefits of Text Embedding

  • Enables efficient processing: Numerical representations are much faster for computers to process compared to raw text. This is crucial for tasks like search engines, recommendation systems, and machine translation.
  • Captures semantic meaning: Embeddings go beyond just the literal meaning of words. They capture the context and relationships between words, allowing for more nuanced analysis.
  • Improves machine learning performance: Text embeddings can be used as features in machine learning models, leading to better performance in tasks like sentiment analysis, text classification, and topic modeling.

Use cases of Text Embedding

Text embeddings, by transforming text into numerical representations, unlock a variety of applications in Natural Language Processing (NLP). Here are some key use cases:

1. Search Engines and Information Retrieval:

Text embeddings allow search engines to understand the semantic meaning behind queries and match them with relevant documents, even if the exact keywords aren't present.

By comparing the embeddings of a search query with document embeddings, search engines can identify documents that cover similar topics or concepts.

2. Recommendation Systems:

Recommender systems use text embeddings to analyze user behavior and preferences expressed through reviews, ratings, or browsing history.

The system can then recommend similar items by comparing the embeddings of products, articles, or other content the user has interacted with.

3. Plagiarism Detection:

Comparing the embeddings of two text pieces can help identify potential plagiarism by finding significant similarities in their semantic structure.

These are just a few examples, and the possibilities continue to grow as text embedding techniques evolve. As computers gain a better understanding of language through embeddings, we can expect even more innovative applications in the future.

textembedding-gecko@003

Textembedding-gecko@003 is a specific version of a pre-trained text embedding model offered by Google Cloud Platform (GCP) through Vertex AI and their suite of AI tools and services.

What you'll build

In this codelab, you're going to build a Python script. This script will:

  • Use Vertex API to call textembedding-gecko@003 and transform text into text embeddings (vectors).
  • Create a simulated database made of text and their vectors
  • Perform a query to our simulated vector database by comparing the vectors and get the most likely response.

What you'll learn

  • How to use Text Embedding in GCP
  • How to call textembedding-gecko@003
  • How to run this in Workbench
  • How to use Vertex AI - Workbench to execute scripts

What you'll need

  • A recent version of Chrome
  • Knowledge of Python
  • A Google Cloud Project
  • Access to Vertex AI - Workbench

2. Getting set up

Create a Vertex AI Workbench Instance

  1. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
  1. Go to project selector
  2. Make sure that billing is enabled for your Google Cloud project.
  3. Enable the Notebooks API.

You can create a Vertex AI Workbench instance by using the Google Cloud console, the gcloud CLI, or Terraform. For the purpose of this tutorial, we'll create it using the Google Cloud Console. More information on the other methods can be found here.

  1. In the Google Cloud console, go to the Instances page which can be accessed in Vertex AI menu, Notebooks section and click on Workbench. 56c087d619c464dd.png
  2. Go to Instances.
  3. Click Create new. 381ff9b895e77641.png
  4. In the Create instance dialog, in the Details section, provide the following information for your new instance:

Name: Provide a name for your new instance. The name must start with a letter followed by up to 62 lowercase letters, numbers, or hyphens (-), and cannot end with a hyphen.

Region and Zone: Select a region and zone for the new instance. For best network performance, select the region that is geographically closest to you.

No need to install GPU

In the Networking section, provide the following:

Networking: Adjust the network options to use a network in your current project or a Shared VPC network from a host project, if one is configured. If you are using a Shared VPC in the host project, you must also grant the Compute Network User role (roles/compute.networkUser) to the Notebooks Service Agent from the service project.

In the Network field: Select the network that you want. You can select a VPC network, as long as the network has Private Google Access enabled or can access the internet

In the Subnetwork field: Select the subnetwork that you want. You can choose the default one.

In the Instance properties you can leave the default one, this is, an e2-standard-4.

d47bdc2d7f516c46.png

  1. Click on Create.

Vertex AI Workbench creates an instance and automatically starts it. When the instance is ready to use, Vertex AI Workbench activates an Open JupyterLab link. Click on it.

Create a Python 3 Notebook

  1. Inside the Jupyterlab, from the Launcher, in the Notebook section, click on the icon with the Python logo which says Python3. e16bb118cd28256f.png
  2. A Jupyter notebook is created with the name Untitled and the extension ipynb. da9bd34cf307156c.png
  3. You can rename it using the file browser section on the left side or you can leave it as is.

Now, we can start putting our code in the notebook.

3. Importing required libraries

Once the instance has been created and the Jupyterlab has been opened, we need to install all the required libraries for our codelab.

We'll need:

  1. numpy
  2. pandas
  3. TextEmbeddingInput, TextEmbeddingModel from vertexai.language_models

Copy and paste the below code in a cell:

from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel

import numpy as np
import pandas as pd

It would look something like this:

6852d323eedcac93.png

4. Create a simulated vector database

In order to test our code, we'll create a database made of text and their respective vectors translated using the gecko@003 text embedding model.

The objective is for the users to search a text, translate it to a vector, search for it in our database and return the most approximate result.

Our vector database will hold 3 records, this is how we'll create it:

Copy and paste the below code in a new cell.

DOCUMENT1 = {
    "title": "Operating the Climate Control System",
    "content": "Your Googlecar has a climate control system that allows you to adjust the temperature and airflow in the car. To operate the climate control system, use the buttons and knobs located on the center console.  Temperature: The temperature knob controls the temperature inside the car. Turn the knob clockwise to increase the temperature or counterclockwise to decrease the temperature. Airflow: The airflow knob controls the amount of airflow inside the car. Turn the knob clockwise to increase the airflow or counterclockwise to decrease the airflow. Fan speed: The fan speed knob controls the speed of the fan. Turn the knob clockwise to increase the fan speed or counterclockwise to decrease the fan speed. Mode: The mode button allows you to select the desired mode. The available modes are: Auto: The car will automatically adjust the temperature and airflow to maintain a comfortable level. Cool: The car will blow cool air into the car. Heat: The car will blow warm air into the car. Defrost: The car will blow warm air onto the windshield to defrost it."}

DOCUMENT2 = {
    "title": "Touchscreen",
    "content": "Your Googlecar has a large touchscreen display that provides access to a variety of features, including navigation, entertainment, and climate control. To use the touchscreen display, simply touch the desired icon.  For example, you can touch the \"Navigation\" icon to get directions to your destination or touch the \"Music\" icon to play your favorite songs."}

DOCUMENT3 = {
    "title": "Shifting Gears",
    "content": "Your Googlecar has an automatic transmission. To shift gears, simply move the shift lever to the desired position.  Park: This position is used when you are parked. The wheels are locked and the car cannot move. Reverse: This position is used to back up. Neutral: This position is used when you are stopped at a light or in traffic. The car is not in gear and will not move unless you press the gas pedal. Drive: This position is used to drive forward. Low: This position is used for driving in snow or other slippery conditions."}

documents = [DOCUMENT1, DOCUMENT2, DOCUMENT3]

df_initial_db = pd.DataFrame(documents)
df_initial_db.columns = ['Title', 'Text']
df_initial_db

It would look like this:

26baa3b876c0605d.png

Let's analyze the code

In the variables DOCUMENT1, DOCUMENT2 and DOCUMENT3 we are storing a dictionary which will emulate documents with their titles and contents. These "documents" are referencing a simulated manual of a Google made car.

In the next line, we create a list out of those 3 documents (dictionaries).

documents = [DOCUMENT1, DOCUMENT2, DOCUMENT3]

Finally, leveraging pandas, we create a dataframe out of that list which will be called df_initial_db.

df_initial_db = pd.DataFrame(documents)
df_initial_db
.columns = ['Title', 'Text']
df_initial_db

5. Create text embeddings

We'll now get a text embedding using the gecko@003 model for each record in our simulated database of documents.

Copy and paste the below code into a new cell:

def embed_fn(df_input):
    list_embedded_values = []
    for index, row in df_input.iterrows():        
        model = TextEmbeddingModel.from_pretrained("textembedding-gecko@003")
        embeddings = model.get_embeddings([(row['Text'])])        
        list_embedded_values.append(embeddings[0].values)
    df_input['Embedded text'] = list_embedded_values
    return df_input        
                                           
df_embedded_values_db = embed_fn(df_initial_db)
df_embedded_values_db      

It would look like this:

4c4af091c7a82861.png

Let's analyze the code

We defined a function called embed_fn which will receive as input a pandas dataframe which contains the text to perform the embedding. The function then returns the text encoded as a vector.

def embed_fn(df_input):
    list_embedded_values = []
    for index, row in df_input.iterrows():        
        model = TextEmbeddingModel.from_pretrained("textembedding-gecko@003")
        embeddings = model.get_embeddings([(row['Text'])])        
        list_embedded_values.append(embeddings[0].values)
    df_input['Embedded text'] = list_embedded_values
    return df_input            

In the list called list_embedded_values is where we'll be storing and appending the encoded text of every row.

Using the iterrows method from pandas, we can iterate every row in the dataframe, getting the values from the column Text (which contains the manual information from our simulated database).

In order to send regular text and return its vector using the gecko@003 model, we initialize the variable model which is where we set the model to use by calling TextEmbeddingModel.from_pretrained function.

model = TextEmbeddingModel.from_pretrained("textembedding-gecko@003")
embeddings = model.get_embeddings([(row['Text'])])                    

Then, in the variable embeddings, we capture the vector of the text we send via the model.get_embeddings function.

At the end of the function, we create a new column in the dataframe called Embedded text and this will contain the list of vectors created based on the gecko@003 model.

df_input['Embedded text'] = list_embedded_values
return df_input            

Finally, in the variable df_embedded_values_db we capture the dataframe containing our original data from the simulated database plus a new column containing the list of vectors for each row.

df_embedded_values_db = embed_fn(df_initial_db)
df_embedded_values_db      

6. Asking a question to the vector database

Now that our database contains text and their vectors, we can proceed to ask a question and query our database to find an answer.

For that, copy and paste the below code into a new cell:

question='How do you shift gears in the Google car?'
model = TextEmbeddingModel.from_pretrained("textembedding-gecko@003")
embeddings = model.get_embeddings([(question)])        
text_to_search=embeddings[0].values
len(text_to_search)

The result would look something like this:

6b7cf9b08e3b4573.png

Let's analyze the code

Similar to the function from the previous step, we first initialize the question variable with what we intend to ask our database.

question='How do you shift gears in the Google car?'

Then, in the model variable we set the model we want to use via the TextEmbeddingModel.from_pretrained function which in this case, is the gecko@003 model.

model = TextEmbeddingModel.from_pretrained("textembedding-gecko@003")

In the embeddings variable, we call the model.get_embeddings function and pass the text to be converted to vector, in this case, we pass the question to ask.

embeddings = model.get_embeddings([(question)])        

Finally, the text_to_search variable holds the list of vectors translated from the question.

We print the length of the vector just as a reference.

text_to_search=embeddings[0].values
len
(text_to_search)

7. Comparing vectors

We now have a list of vectors in our simulated database and a question transformed into a vector. This is, we can now compare the vector of the question with all the vectors in our database to find out which one is closest to answer our question more accurately.

To accomplish this, we'll measure the distance between the vector of the question and each vector of the database. There are multiple techniques to measure distances between vectors, for this specific codelab, we'll use Euclidean distance or L2 norm.

73ea8635c4570bea.png

In python, we can leverage the numpy function to accomplish this.

Copy and paste the below code into a new cell:

list_embedded_text_from_db = df_embedded_values_db['Embedded text']
shortest_distance=1
for position, embedded_value in enumerate(list_embedded_text_from_db):
    distance=np.linalg.norm((np.array(embedded_value) - np.array(text_to_search)), ord = 2)
    print(distance)
    if distance<shortest_distance:
        shortest_distance=distance
        shortest_position=position
       
print(f'The shortest distance is {shortest_distance} and the position of that value is {shortest_position}')

The result should look like this:

b70563b50ea86668.png

Let's analyze the code

We start by converting the column holding the embedded text or the vectors of our database into a list and storing it in list_embedded_text_from_db.

We also initialized the shortest_distance variable to 1 in order to keep updating it until we find the actual shortest distance.

list_embedded_text_from_db = df_embedded_values_db['Embedded text']
shortest_distance
=1

Then using a for loop we iterate and get the distance between the vector from the question and each vector on the database.

Using the numpy linalg.norm function we calculate their distance.

If the calculated distance is less than the one in the shortest_distance variable, then the calculated distance will be set to this variable

We then capture the shortest distance as well as the position in the list where it was found. In the shortest_distance and shortest_position variables.

for position, embedded_value in enumerate(list_embedded_text_from_db):
    distance=np.linalg.norm((np.array(embedded_value) - np.array(text_to_search)), ord = 2)
    print(distance)
    if distance<shortest_distance:
        shortest_distance=distance
        shortest_position=position

8. Results

Knowing the position on the list of the vector holding the shortest distance between the question and the database, we can print the results.

Copy and paste the below code in a new cell:

print("Your question was:\n "+question+ " \nAnd our answer is:\n "+
      df_embedded_values_db
.at[shortest_position, 'Title']+": "+
      df_embedded_values_db
.at[shortest_position, 'Text'])

After executing it, you'll get something like this:

7a0e429171a19afe.png

9. Congratulations

Congratulations, you've successfully built your first application using the textembedding-gecko@003 model in a real use case!

You learned the foundations of text embeddings and how to use the gecko003 model on GCP Workbench.

You now know the key steps required to keep applying your knowledge to more use cases.

What's next?

Check out some of these codelabs...

Reference docs