Master Data Management Simplified: Match & Merge with Generative AI!

1. Overview

What is Master Data Management?

Master Data Management (MDM) is all about creating a single, reliable source of truth for your organization's most critical data. Imagine a meticulously organized library where every book (data point) is correctly labeled, up-to-date, and easy to find.

Master data represents the core, foundational business entities that are essential to a company's operations. Here are the key elements of master data:

  • Business entities: entities such as customers, products, suppliers, locations, and employees which are the nouns that your business revolves around
  • Identifiers: unique identifiers that ensure each entity is distinct and traceable across systems
  • Attributes: the characteristics that describe each entity, for example, a customer's address, a product's price etc.

To help you understand master data better, let's compare it with transactional data. Transactional data captures individual events (a purchase, a shipment etc.). Whereas, master data provides the context for those events by defining the entities involved. For example, a sales transaction links to master data for the customer, product, and salesperson.

While implementing robust MDM is essential for strategic decision-making, it can be complex and resource-intensive. This is where the transformative power of Generative AI, particularly models like Gemini 1.0 Pro, Gemini 1.0 Pro Vision, Gemini 1.5 Pro, comes into play.

2. Objective

In this codelab, you'll demonstrate how Gemini 1.0 Pro simplifies master data management applications like enrichment and deduplication, for the citibike_stations data available in the BigQuery public dataset.

What you'll use

  1. BigQuery public dataset bigquery-public-data.new_york_citibike.
  2. Gemini Function Calling (a Java Cloud Function that gets the address information using the reverse Geocoding API for the coordinates available with the citibike_stations data).
  3. Vertex AI Embeddings API and Vector Search in BigQuery to identify duplicates.

What you'll build

  1. You'll create a BigQuery dataset for the use case. In this dataset, you'll create a landing table with data from the public dataset table bigquery-public-data.new_york_citibike.citibike_stations.
  2. You'll deploy the Cloud Function that includes Gemini Function Calling for address standardization.
  3. You'll store the enriched address data in the landing tables (from the two sources that are provided for this demonstration).
  4. You'll invoke the Vertex AI Embeddings API from BigQuery on the address data.
  5. You'll use BigQuery Vector Search to identify duplicate records.

The following diagram represents the flow of data and steps involved in the implementation.

High level flow of the use case

3. Requirements

  • A browser, such as Chrome or Firefox
  • A Google Cloud project with billing enabled.

4. Before you begin

  1. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
  2. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project .
  3. You'll use Cloud Shell, a command-line environment running in Google Cloud that comes preloaded with bq. Click Activate Cloud Shell at the top of the Google Cloud console.

Activate Cloud Shell button image

  1. Once connected to Cloud Shell, you check that you're already authenticated and that the project is set to your project ID using the following command:
gcloud auth list
  1. Run the following command in Cloud Shell to confirm that the gcloud command knows about your project.
gcloud config list project
  1. If your project is not set, use the following command to set it:
gcloud config set project <YOUR_PROJECT_ID>
  1. Navigate to the Gemini for Google Cloud Marketplace to enable the API. You can also use the following command in the Cloud Shell terminal:
gcloud services enable cloudaicompanion.googleapis.com --project PROJECT_ID
  1. Make sure that the BigQuery, BigQuery Connection, Cloud Function, Cloud Run, Vertex AI, and Cloud Build APIs are enabled. The alternative to the gcloud command is through the console using this link.

Refer documentation for gcloud commands and usage.

5. Create a BigQuery dataset and external connection

Let's begin by creating a dataset and a Cloud resource connection.

A dataset in BigQuery is a container for all the tables and objects for your application.

To create a dataset, do the following:

  1. Go to the BigQuery page in the Google Cloud console.
  2. In the Explorer panel, select the project where you want to create the dataset.
  3. Expand the Actions option (the vertical ellipsis icon), and click Create dataset.

Image of the Actions menu and the Create dataset option

  1. Enter mdm_gemini in the Dataset ID field.
  2. Set your location type as Multi-region and accept the default value, which is US(multiple regions in United States.
  3. Click Create dataset.
  4. Check that the dataset is created and listed under your project ID in the Explorer pane.

A BigQuery connection is required to interact with your Cloud Function. To create a remote function, you must create a BigQuery connection. In this codelab, we will use the BigLake connection to access the model from BigQuery via the Cloud Function. BigLake connections help to connect the external data source while retaining fine-grained BigQuery access control and security, which in our case is the Vertex AI Gemini Pro API.

To create the BigLake connection, do the following:

  1. Click Add on the Explorer pane of the BigQuery page.

BigQuery Console with ADD button highlighted to add external connection

  1. Click Connections to external data sources.
  2. In the Connection type list, select Vertex AI remote models, remote functions and BigLake (Cloud Resource).
  3. In the Connection ID field, enter your connection name as gemini-bq-conn.
  4. Set your location type as Multi-region and accept the default value, which is US(multiple regions in United States.
  5. Click Create connection.
  6. Click Go to connection, and then copy the service account ID in the Connection info pane.

Connection info screenshot

  1. Go to the IAM & Admin page and click Grant access.
  2. Paste the service account ID in the New principles field.
  3. Select the Vertex AI user role from the role list, and then click Save.

Grant access to Service Account screenshot

You have now successfully created the dataset and BigQuery connection.

6. Deploy Gemini Function Calling (Java Cloud Function)

Follow these steps to deploy the Java Cloud Function that includes Gemini Function Calling.

  1. Clone the github repository from your Cloud Shell terminal using the following command:
git clone https://github.com/AbiramiSukumaran/GeminiFunctionCalling
  1. Replace the placeholders YOUR_API_KEY and YOUR_PROJECT_ID with your values.

If you read the blog here, you will know that the function calling implementations uses the Reverse Geocoding API. You can create your own API_KEY from the instructions here.

  1. In the Cloud Shell terminal, go to the newly cloned project directory GeminiFunctionCalling and run the following statement to build and deploy the Cloud Function:
gcloud functions deploy gemini-fn-calling --gen2 --region=us-central1 --runtime=java11 --source=. --entry-point=cloudcode.helloworld.HelloWorld --trigger-http

Say "y" when you are prompted with the "Allow unauthenticated invocations" questions. Ideally you will set up authentication for your enterprise applications, per recommendation. But since this is a demo app, we will proceed unauthenticated.

The output is a REST URL in the following format:

https://us-central1-YOUR_PROJECT_ID.cloudfunctions.net/gemini-fn-calling

  1. Test this Cloud Function by running the following command from the terminal:
gcloud functions call gemini-fn-calling --region=us-central1 --gen2 --data '{"calls":[["40.714224,-73.961452"]]}'

Response for a random sample prompt:

 '{"replies":["{ \"DOOR_NUMBER\": \"277\", \"STREET_ADDRESS\": \"Bedford Ave\", \"AREA\":
 null, \"CITY\": \"Brooklyn\", \"TOWN\": null, \"COUNTY\": \"Kings County\", \"STATE\":
 \"NY\", \"COUNTRY\": \"USA\", \"ZIPCODE\": \"11211\", \"LANDMARK\": null}}```"]}'

The request and response parameters of this Cloud Function are implemented in a way that is compatible with BigQuery's remote function invocation. It can be directly consumed from BigQuery data in-place. It means that if your data input (lat and long data) lives in BigQuery then you can call the remote function on the data and get the function response which can be stored or processed within BigQuery directly.

  1. Run the following DDL from BigQuery to create a remote function that invokes this deployed Cloud Function:
CREATE OR REPLACE FUNCTION
 `mdm_gemini.MDM_GEMINI` (latlng STRING) RETURNS STRING
 REMOTE WITH CONNECTION `us.gemini-bq-conn`
 OPTIONS (
   endpoint = 'https://us-central1-YOUR_PROJECT_ID.cloudfunctions.net/gemini-fn-calling', max_batching_rows = 1
 );

Test Query to use the new remote function created:

SELECT mdm_gemini.MDM_GEMINI(latlong) from mdm_gemini.CITIBIKE_STATIONS limit 1;

If the test query that uses the new remote function created in BigQuery fails because of Cloud Functions permissions issue, go to Cloud Functions from the Google Cloud Console and locate the deployed Cloud Function named "gemini-fn-calling". Go to the permissions tab, add principal as "allUsers" and grant the role "Cloud Functions Invoker" to make sure the Cloud Functions is accessible to all users (only because this is a demo app).

7. Try a workaround

If you don't have the necessary API_KEY for the Reverse Geocoding function calling approach, or do not have the Cloud Function deployed, for some reason, you can do the following as an alternative:

  1. Download the file CITIBIKE_STATIONS.csv from the repository into your Cloud Shell project folder and navigate into that folder.
  2. Export the data from the csv into your new BigQuery dataset mdm_gemini using the following command in the Cloud Shell Terminal:
bq load --source_format=CSV --skip_leading_rows=1 mdm_gemini.CITIBIKE_STATIONS ./CITIBIKE_STATIONS.csv \ name:string,latlng:string,capacity:numeric,num_bikes_available:numeric,num_docks_available:numeric,last_reported:timestamp,full_address_string:string

8. Create table and enrich address data

Step 1: Create the table

Imp: Skip this step, if you have used the workaround as you must have already created the table.

If you haven't used the workaround, run the following DDL in BigQuery SQL Editor:

CREATE TABLE mdm_gemini.CITIBIKE_STATIONS as (
select  name, latitude || ',' || longitude as latlong, capacity, num_bikes_available, num_docks_available,last_reported,
'' as full_address_string
from bigquery-public-data.new_york_citibike.citibike_stations) ;

Now let's enrich the address data by invoking the remote function on the latitude and longitude coordinates available in the table. Set the following conditions for the data:

  • Reported in the year 2024
  • Number of bikes available > 0
  • Capacity > 100

Run the following query:

update `mdm_gemini.CITIBIKE_STATIONS`
set full_address_string = `mdm_gemini.MDM_GEMINI`(latlong)
where EXTRACT(YEAR FROM last_reported) = 2024 and num_bikes_available > 0 and capacity > 100;

Step 2: Create a second source for the bike station location data

Don't skip this step even if you used the workaround approach to create the table.

In this step, you'll create a second source for the bike station location data for the purpose of this codelab. This is to demonstrate that MDM is bringing data from multiple sources together and identifying the golden truth.

Run the following DDLs in BigQuery SQL Editor for creating the second source of location data with two records in it. Let's name this table mdm_gemini.CITIBIKE_STATIONS_SOURCE2 and insert two records into it.

CREATE TABLE mdm_gemini.CITIBIKE_STATIONS_SOURCE2 (name STRING(55), address STRING(1000), embeddings_src ARRAY<FLOAT64>);

insert into mdm_gemini.CITIBIKE_STATIONS_SOURCE2 VALUES ('Location broadway and 29','{ "DOOR_NUMBER": "1593", "STREET_ADDRESS": "Broadway", "AREA": null, "CITY": "New York", "TOWN": null, "COUNTY": "New York County", "STATE": "NY", "COUNTRY": "USA", "ZIPCODE": "10019", "LANDMARK": null}', null);

insert into mdm_gemini.CITIBIKE_STATIONS_SOURCE2 VALUES ('Allen St & Hester','{ "DOOR_NUMBER": "36", "STREET_ADDRESS": "Allen St", "AREA": null, "CITY": "New York", "TOWN": null, "COUNTY": "New York County", "STATE": "NY", "COUNTRY": "USA", "ZIPCODE": "10002", "LANDMARK": null}', null);

9. Generate embeddings for address data

Embeddings are high-dimensional numerical vectors that represent a given entity, like a piece of text or an audio file. Machine learning (ML) models use embeddings to encode semantics about such entities to make it easier to reason about and compare them. For example, a common operation in clustering, classification, and recommendation models is to measure the distance between vectors in an embedding space to find items that are most semantically similar. The Vertex AI text-embeddings API lets you create a text embedding using Generative AI on Vertex AI. Text embeddings are numerical representations of text that capture relationships between words and phrases. Read more about Vertex AI Text Embeddings here.

  1. Run the below DDL to create a remote model for Vertex AI text embeddings API:
CREATE OR REPLACE MODEL `mdm_gemini.CITIBIKE_STATIONS_ADDRESS_EMB`
REMOTE WITH CONNECTION `us.gemini-bq-conn`
OPTIONS (ENDPOINT = 'textembedding-gecko@latest');
  1. Now that the remote embeddings model is ready, let's generate embeddings for the first source and store it in a table using the following query:
CREATE TABLE `mdm_gemini.CITIBIKE_STATIONS_SOURCE1` AS (
SELECT *
FROM ML.GENERATE_EMBEDDING(
 MODEL `mdm_gemini.CITIBIKE_STATIONS_ADDRESS_EMB`,
 ( select name, full_address_string as content from `mdm_gemini.CITIBIKE_STATIONS`
 where full_address_string is not null )
  )
);

Instead of creating a new table, you can also store the embeddings result field in the same mdm_gemini.CITIBIKE_STATIONS table that you created earlier.

  1. To generate embeddings for address data in table CITIBIKE_STATIONS_SOURCE2, run the following query:
update `mdm_gemini.CITIBIKE_STATIONS_SOURCE2` a set embeddings_src =
(
SELECT  ml_generate_embedding_result
FROM ML.GENERATE_EMBEDDING(
 MODEL `mdm_gemini.CITIBIKE_STATIONS_ADDRESS_EMB`,
 ( select name, address as content from `mdm_gemini.CITIBIKE_STATIONS_SOURCE2` ))
where name = a.name) where name is not null;

This should create embeddings for the second source. Note that we have created the embeddings field in the same table CITIBIKE_STATIONS_SOURCE2.

  1. To visualize the embeddings generated for the source data tables 1 and 2, run the following query:
select name,address,embeddings_src from `mdm_gemini.CITIBIKE_STATIONS_SOURCE2`;
select name,content,ml_generate_embedding_result from `mdm_gemini.CITIBIKE_STATIONS_SOURCE1`;

Now let's go ahead and perform a vector search to identify duplicates.

10. Run a vector search for flagging duplicate addresses

In this step, you'll search the address embeddings ml_generate_embedding_result column of the mdm_gemini.CITIBIKE_STATIONS_SOURCE1 table for the top two embeddings that match each row of data in the embeddings_src column of the mdm_gemini.CITIBIKE_STATIONS_SOURCE2 table.

To do this, run the following query:

select query.name name1,base.name name2,
/* (select address from mdm_gemini.CITIBIKE_STATIONS_SOURCE2 where name = query.name) content1, base.content content2, */
distance
from VECTOR_SEARCH(
 TABLE mdm_gemini.CITIBIKE_STATIONS_SOURCE1,
 'ml_generate_embedding_result',
 (SELECT * FROM mdm_gemini.CITIBIKE_STATIONS_SOURCE2),
 'embeddings_src',
 top_k => 2
) where query.name <> base.name
order by distance desc;

Table that we are querying: mdm_gemini.CITIBIKE_STATIONS_SOURCE1 on the field ml_generate_embedding_result

Table that we use as base: mdm_gemini.CITIBIKE_STATIONS_SOURCE2 on the field embeddings_src

top_k: specifies the number of nearest neighbors to return. The default is 10. A negative value is treated as infinity, meaning that all values are counted as neighbors and returned.

distance_type: specifies the type of metric to use to compute the distance between two vectors. Supported distance types are Euclidean and Cosine. The default is Euclidean.

The result of the query is as follows:

Result Set

As you can see, this has listed the two nearest neighbors (in other words, closest duplicates) for the two rows in CITIBIKE_STATIONS_SOURCE2 from CITIBIKE_STATIONS_SOURCE1. Since the distance_type is unspecified, it assumes that it is Euclidean and the distance is read as the distances in address TEXT values between the two sources, the lowest being the most similar address texts.

Let's set the distance_type to Cosine using the following query:

select query.name name1,base.name name2,
/* (select address from mdm_gemini.CITIBIKE_STATIONS_SOURCE2 where name = query.name) content1, base.content content2, */
distance
from VECTOR_SEARCH(
 TABLE mdm_gemini.CITIBIKE_STATIONS_SOURCE1,
 'ml_generate_embedding_result',
 (SELECT * FROM mdm_gemini.CITIBIKE_STATIONS_SOURCE2),
 'embeddings_src',
 top_k => 2,distance_type => 'COSINE'
) where query.name <> base.name
order by distance desc;

The result of the query is as follows:

Result set 2

Both queries (of both distance types) are ordered by distance DESCENDING which means we want to list the results in the order of decreasing distance. But you'll notice that the second query's distance order is reversed. Can you guess why?

Yes!! You got it right! In cosine similarity, a larger number means greater similarity and smaller distance. In Euclidean distance, a larger number means greater distance between values.

For more information on the understanding of MDM and tips to understand the difference and applications of Euclidean and Cosine, read the blog.

11. Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this post, follow these steps:

  1. In the Google Cloud console, go to the Manage resources page.
  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.
  4. If you want to keep your project, skip the above steps and delete the Cloud Function by navigating to Cloud Functions and from the list of functions, check the one you want to delete and click Delete.

12. Congratulations

Congratulations! You have demonstrated the power of using Gemini 1.0 Pro and Function Calling in transforming a few MDM activities into simplified yet powerful, deterministic and reliable generative AI capabilities. Now that you know, feel free to identify other ways of implementing the same use case or other MDM functionalities. Are there datasets you could validate, information gaps you could fill, or tasks that could be automated with structured calls embedded within your generative AI responses? Refer to the documentation for Vertex AI, BigQuery Remote Functions, Cloud Functions, Embeddings, and Vector Search for more in-depth guidance. Here is the github repo for this project. Let us know what you build with this learning!