Scale Distributed Data Processing with GKE to build a Knowledge Graph in BigQuery

1. Introduction

In this codelab, you will build a distributed knowledge acquisition pipeline for "Petverse". You will process unstructured multimedia assets (Audio, Video, Images, Text/CSV) from a Cloud Storage bucket, extract key information about the pets (favorite food, hobbies), and create a knowledge graph. You will scale the processing of the multimedia file using Gemini multi-modality processing on Google Kubernetes Engine (GKE). Finally, you will store this data in BigQuery and use the new BigQuery Property Graph feature to analyze the relationships.

We will use the power of Google Kubernetes Engine to demonstrate processing of high-volume data in parallel.

Why knowledge graphs?

Knowledge graphs are better suited than traditional relational databases for representing and analyzing complex relationships between entities.

We will use Gemini 2.5 Flash to analyze images, audio, and video files and establish facts about different pets.

GKE + BigQuery architecture

What you'll do

Build and deploy a distributed data processing job on GKE.
Use Gemini to extract entities and relationships from multimedia files.
Store the knowledge graph data in BigQuery.
Create and query a Property Graph in BigQuery using Graph Query Language (GQL).

What you'll need

A web browser such as Chrome
A Google Cloud project with billing enabled
Permissions in the project to create resources and modify IAM policies

This codelab is for developers of all levels, including beginners.

Estimated Duration: 45 minutes

Cost: The resources created in this codelab should cost less than $5.

2. Before you begin

Create a Google Cloud Project

Navigate to the Google Cloud Console: https://console.cloud.google.com, and then select or create a Google Cloud project.
⚠️ Note the Project ID. You will use it for several commands in this lab.

Select a project and note down ID

Start Cloud Shell

Open Cloud Shell in a new tab: https://shell.cloud.google.com/.
If prompted, click Authorize.
Replace PROJECT_ID and paste the following command into the terminal:

export PROJECT_ID="YOUR_PROJECT_ID"
gcloud config set project $PROJECT_ID

Config project

📝 Note Your project will be shown in yellow in the command line. If your session restarts, make sure you re-run the command above to set the project ID.

Enable APIs

Run this command to enable all the required APIs:

gcloud services enable \
  aiplatform.googleapis.com \
  bigquery.googleapis.com \
  artifactregistry.googleapis.com \
  container.googleapis.com \
  cloudbuild.googleapis.com \
  pubsub.googleapis.com

Clone repository

Run these commands to clone the repository.

cd ~/
git clone --filter=blob:none --no-checkout https://github.com/GoogleCloudPlatform/devrel-demos.git
cd ~/devrel-demos
git sparse-checkout init --cone
git sparse-checkout set codelabs/gke-knowledge-graph
git checkout main
cd codelabs/gke-knowledge-graph/

Run setup script

This script automates the backend configuration by:

Creating a container image and an Artifact Registry repository
Creating a BigQuery dataset
Creating a BigQuery Connection to execute Gemini AI functions from SQL

Run the following command in your terminal:

./scripts/setup.sh

If the script prompts you for configuration details, use these values:

Project ID: Use the ID you created in the previous step.
Region: us-central1

⚠️ Important The script will take a few minutes to complete. Leave this terminal window open to finish in the background. To continue with the next step, open a new terminal tab or window to run your next commands.

setup

3. Setup the Data Agent Kit

Enable the Cloud Shell Editor with the pencil icon on the top right corner.
In the Cloud Shell Editor, click the Extensions icon in the left sidebar.
Search for Google Cloud Data Agent Kit and click Install if it's not installed already.

Install Data Agent Kit

Sign in to your Google account with the extension.
In the Configuration Summary, enter your project ID and us-central1 as the region.

configure_dak

Click Configure MCP Servers. You do not need to make any changes to this window, simply click Get started.
Reload the window if prompted. You can close the Quick Start Guide tab for now.

Setup the tables in BigQuery

On the side bar, return to the explorer. If your home folder (e.g., /home/your_user_name/) is not already open, click Open Folder and select it.

Open Home

In the explorer window, locate the folder you cloned from the repository (devrel-demos). Under codelabs/gke-knowledge-graph/scripts, you will find create_tables.sql. Open that file.
On the top right, click Open Query Settings.
Choose BigQuery. Save and Close.
Click Run.

You should see two statements executed successfully. You have now created the tables to store nodes and edges for your knowledge graph.

You can close the create_tables.sql tab and the results console.

Run

4. Initialize GKE Cluster

We will use GKE Autopilot to run our data processing job. Autopilot is the recommended best practice as it manages the cluster infrastructure for you.

By now, the setup script should have finished. You should see a success message: 🎉🦄 Setup successfully finished! 🎉🦄.

Paste this command in the terminal to create the cluster:

source scripts/setenv.sh
gcloud container clusters create-auto petverse-cluster \
    --region=$REGION

🕓 This will take about 5' minutes.

Get credentials to interact with the cluster:

source scripts/setenv.sh
gcloud container clusters get-credentials petverse-cluster --region $REGION

You should see this output:

Fetching cluster endpoint and auth data.
kubeconfig entry generated for petverse-cluster.

5. Configure Workload Identity

Workload Identity Federation for GKE (using Direct Resource Access) allows your GKE workloads to securely access Google Cloud services without needing to manage service account keys.

Execute deploy.sh to:

Create a Kubernetes service account
Grant the necessary IAM roles directly to the Kubernetes Service Account principal
Bind the IAM service account to the Kubernetes service account
Annotate the Kubernetes service account to complete the link

source scripts/setenv.sh
./scripts/deploy.sh

6. Deploy Decoupled Processing Jobs

In this step, you will deploy the enqueuer (Producer) and the processing engines (Workers) onto GKE.

Our new decoupled architecture uses Google Cloud Pub/Sub to process assets asynchronously:

The Producer scans GCS and enqueues all file paths onto a Pub/Sub queue.
A pool of Workers scales up in GKE, dynamically pulling tasks in parallel, processing them via Gemini, and writing to BigQuery.

The setup.sh script already built and pushed both Producer and Worker container images, enqueued the Pub/Sub topics, and dynamically generated your GKE deployment manifests: job-producer.yaml and job-worker.yaml.

Apply the Producer Job to scan your storage bucket and queue all assets:

kubectl apply -f job-producer.yaml

This job runs and finishes quickly as it only queues metadata.

Deploy the Worker Job configured to run 6 parallel workers to drain the queue:

kubectl apply -f job-worker.yaml

GKE Autopilot will automatically detect the pending pods, dynamically scale up compute nodes, and run the workers in parallel to process enqueued audios, videos, images, and CSVs.

7. Verify Results

Check the status of your jobs:

kubectl get jobs

Wait until both petverse-producer-job and petverse-worker-job show successful completions.

🕓 This will take about 10' minutes. You can see the progress with the commands below.

Check the logs of the Producer to verify it enqueued files successfully:

cd ~/devrel-demos/codelabs/gke-knowledge-graph
source scripts/setenv.sh
kubectl logs -l app=petverse-producer --tail=50

Watch your parallel workers process files from the queue:

kubectl logs -l app=petverse-worker --tail=50

(The workers feature a 60-second idle timeout and will automatically shut down and clean up when the Pub/Sub queue is empty).

Verify data in BigQuery.

Navigate to BigQuery Studio. You will see two tables created: petverse_kg.Nodes and petverse_kg.Edges.

Explore tables

To see the contents of the tables, double-click their names and then click Preview.

See contents

You will see the Nodes table has information about the entities picked up by Gemini in the audios, videos and pictures. The Edges table contains the relationships between them. So for example, if listen to the audio of the cat called SQL, he likes to play with shoestrings and enjoys freeze dried fishies.

Use the + button to create a new query. Paste the following statement and click Run:

SELECT n.name, n.entity_id, e.relationship, e.target_id
FROM 
  `petverse_kg.Nodes` n
JOIN 
  `petverse_kg.Edges` e
  ON n.entity_id = e.source_id
WHERE n.name = 'SQL'

Use the + button to create a new query. Paste the following statement and click Run:

SELECT name, entity_type, pet_bio,
AI.SIMILARITY('Pets who like to relax', pet_bio, endpoint => 'text-embedding-005') as similarity_score
FROM `petverse_kg.Nodes`
WHERE name IS NOT NULL
ORDER BY similarity_score DESC
LIMIT 5

You should see the nodes for pets who like to relax. This query performed a semantic search using the AI function AI.SIMILARITY to find pets whose bios are most similar to the query text.

Run Query

Build the Property Graph

Now that we have nodes and edges in BigQuery, we can create a Property Graph to query relationships easily.

Create the Graph

Overwrite the previous query and run the following DDL to create the property graph:

CREATE OR REPLACE PROPERTY GRAPH `petverse_kg.knowledge_graph`
  NODE TABLES (
    `petverse_kg.Nodes` AS `Nodes`
      KEY (`entity_id`)
        LABEL `Nodes` PROPERTIES (entity_id AS `entity_id`, entity_type AS `entity_type`, name AS `name`, pet_bio AS `pet_bio`, properties AS `properties`, bio_embedding AS `bio_embedding`))

  EDGE TABLES (
    `petverse_kg.Edges` AS `Edges`
      KEY (`source_id`,`target_id`,`relationship`)
        SOURCE KEY (`source_id`)
          REFERENCES `Nodes` (`entity_id`)
        DESTINATION KEY (`target_id`)
          REFERENCES `Nodes` (`entity_id`)
        LABEL `Edges` PROPERTIES (properties AS `properties`));

Click Go to Graph. You will see the graph visualization with a node that has an edge to itself. This is expected.

Create Graph

Query the Graph

You can close all previous queries and open a new, blank one with the + button.
Use GQL to find pets related to other pets via shared interests (like hobbies, favorite foods, or toys). This multi-hop query matches two different pets that are connected to the same node:

GRAPH `petverse_kg.knowledge_graph`
MATCH p = (pet1:Nodes)-[e1]->(interest:Nodes)<-[e2]-(pet2:Nodes)
WHERE pet1.entity_id != pet2.entity_id
  AND (LOWER(pet1.entity_type) = 'pet' AND LOWER(pet2.entity_type) = 'pet')
  AND LOWER(interest.entity_type) IN ('hobby', 'action', 'activity', 'food', 'toy')
RETURN TO_JSON(p) as res
LIMIT 100

You should see the visualization of the graph. You can click on the nodes to see the properties of the nodes and edges.

Query graph

🕵️ Hint: You can adjust the value shown by the node clicking on Switch to schema view:

Query graph

You can close all open query tabs.

8. Chat with the graph

Next to the + sign, you will find a drop-down menu. Select Conversation.

New conversation

You will be prompted to enable the Data Analytics API with Gemini. Enable both APIs. Once this finishes, refresh the window or create a new conversation to see the agent.
Click New Agent.
Give the agent a name like, petverse.
Click Add source and then Graph.
Select the knowledge_graph you created and click Add.

New conversation

You can now ask the agent a question and see the answers and reasoning behind them. Here are some sample questions if you need inspiration. A thinking model may take a bit longer but is likely to construct a better GQL query. You can see what it builds by expanding Show Thinking.

Find pets who share similar foods, who are friends with pets who enjoy naps.
Do any pets share the exact same hobby, favorite food, or toy? List the pairs and their shared interests.
Find pets that have the same species or breed, but completely different hobbies.

9. Clean up

To avoid ongoing charges to your Google Cloud account, delete the resources created during this codelab.

Delete the GKE cluster:

gcloud container clusters delete petverse-cluster --region $REGION --quiet

Delete the BigQuery dataset (this will delete all tables):

bq rm -r -f -d $PROJECT_ID:petverse_kg

Delete the Pub/Sub queue resources:

gcloud pubsub subscriptions delete petverse-sub --quiet
gcloud pubsub topics delete petverse-topic --quiet

Delete the Artifact Registry repository:

gcloud artifacts repositories delete gke-cats-repo --location=$REGION --quiet

Delete the project-specific GCS bucket:

gcloud storage buckets delete gs://$PROJECT_ID-petverse --quiet

10. Congratulations

Congratulations! You have successfully built a distributed knowledge graph pipeline using GKE and Gemini, and queried it using BigQuery Property Graphs.

What you've learned

How to deploy distributed jobs on GKE Autopilot.
How to use Gemini for multimodal data extraction.
How to use BigQuery auto-embeddings.
How to create and query Property Graphs in BigQuery.

Scale Distributed Data Processing with GKE to build a Knowledge Graph in BigQuery

1. Introduction

Why knowledge graphs?

What you'll do

What you'll need

2. Before you begin

Create a Google Cloud Project

Start Cloud Shell

Enable APIs

Clone repository

Run setup script

3. Setup the Data Agent Kit

Setup the tables in BigQuery

4. Initialize GKE Cluster

5. Configure Workload Identity

6. Deploy Decoupled Processing Jobs

7. Verify Results

Verify data in BigQuery.

Build the Property Graph

Create the Graph

Query the Graph

8. Chat with the graph

9. Clean up

10. Congratulations

What you've learned

Reference docs