Lab 1: Ingest and Govern Logistics Data

1. Introduction

In this lab, you will step into the role of a lead data investigator for a global logistics firm. A high-value cargo container carrying precious Android figurine collectibles has gone missing! To find its last known position and trace its route, you must aggregate fragmented shipping manifests from regional logistics partners and unstructured transponder log files. To do so, you will configure a modern Google Cloud Open Data Lakehouse.

Android container

What you'll do

What you'll need

  • A web browser such as Chrome.
  • A Google Cloud project with billing enabled.
  • Familiarity with basic SQL queries and terminal commands.

Expected Cost and Duration

  • Time to complete: ~45 minutes.
  • Estimated Cost: Less than $5.00 USD.

2. Before you begin

Create or Select a Google Cloud Project

  1. In the Google Cloud Console, select or create a Google Cloud project.
  2. Make sure that billing is enabled for your Cloud project. Learn how to confirm billing is enabled on a project.

Configure the Environment

You will run most of your commands from the integrated terminal in the Cloud Shell Editor, a cloud-based development environment that comes pre-loaded with developer tools and the standard Google Cloud SDK.

  1. Open Cloud Shell Editor in a new tab.
  2. Run the following command in the terminal to clone the repository:
    cd ~/
    git clone --filter=blob:none --no-checkout https://github.com/GoogleCloudPlatform/devrel-demos.git
    cd ~/devrel-demos
    git sparse-checkout init --cone
    git sparse-checkout set data-analytics/data-cloud-roadshow/lab1
    git checkout main
    cd data-analytics/data-cloud-roadshow/lab1
    
  3. Set your project ID. You can also Ctrl+Shift+V on Windows/Linux, or Cmd+V on macOS to paste this into the terminal:
    export PROJECT_ID="<YOUR_PROJECT_ID>"
    
  4. Now configure it in your environment.
    gcloud config set project $PROJECT_ID
    
  5. Select a region.
    export REGION="<YOUR_REGION>" # Replace with the region you chose, such as "us-central1"  
    
  6. Enable the required APIs.
    gcloud services enable \
      bigquery.googleapis.com \
      biglake.googleapis.com \
      storage.googleapis.com \
      dataplex.googleapis.com \
      datacatalog.googleapis.com \
      dataproc.googleapis.com \
      aiplatform.googleapis.com \
      cloudresourcemanager.googleapis.com \
      serviceusage.googleapis.com \
      cloudaicompanion.googleapis.com \
      geminidataanalytics.googleapis.com
    

Install Extension

You will now configure the Google Data Agent Kit extension, a tool for interacting with Google Cloud's data tools directly in your IDE.

  1. In the editor's left activity bar, click the Extensions icon (or press Ctrl+Shift+X on Windows/Linux, or Cmd+X on macOS).
  2. In the extensions search box, type: Google Cloud Data Agent Kit
  3. Select the official extension from the results and click Install. If prompted, select "Yes, I trust the authors".

Install Data Agent Kit

  1. Once successfully installed, you should see the Google Cloud Data Agent Kit icon displayed in the activity bar! Click it.

Extension in side bar

  1. Click Sign into cloud.

Sign into cloud

  1. Click Configure MCP Servers.

Configure MCP servers

  1. Select BigQuery, Knowledge Catalog, Apache Spark, and AlloyDB. You will use AlloyDB in Lab 2. Then click Get Started.

Select MCP servers

  1. Click the Project ID selector in the bottom status bar and choose your active Google Cloud project.
  2. In the Data Agent Kit, click SETTINGS, then Settings, and in the Common tab select your Project ID and Region to run your lab, such as us-central1.

Extension settings

  1. Click BigQuery Settings and replace the Region with the region you previous selected. Click Save.

BigQuery settings

You are now ready to use the Data Agent Kit!

Execute Environment Setup Script

In the terminal, run the setup script to create the necessary background resources for this lab and configure IAM permissions:

chmod +x setup_lab1.sh
./setup_lab1.sh
source .env

You should see a serious of output steps showing what resources are being provisioned. We'll cover these throughout the lab.

Once you see a completion message, you're ready to go:

====================================================
Environment Setup Complete!
====================================================

Now, let's begin our search!

3. Ingest Partner Shipping Manifests

Shipping manifest data from partner vessels is stored in standard JSON Lines (JSONL) format in your bucket: gs://${PROJECT_ID}-lost-cargo-lake/shipping_manifests/manifests.jsonl.

Before carrying out deep analytics, you will create a governed BigLake table for this unstructured data. This allows you to explore partner logistics data immediately using standard SQL without duplicate import costs.

Open the Workspace in the Editor and run the query

Create external table

  1. In your Cloud Shell Editor, click on the Google Cloud Data Agent Kit extension icon on the side panel.
  2. Go to BigQuery and select + New Query.
  3. Copy the following query into the query window.
SET @@location = "<YOUR_REGION>"; --Update to your resource region, such as "us-central1". Make sure it is in quotes.

EXECUTE IMMEDIATE ("""
CREATE OR REPLACE EXTERNAL TABLE `lost_cargo_dataset.shipping_manifests`
(
  shipment_id STRING,
  timestamp TIMESTAMP,
  last_ping_lat FLOAT64,
  last_ping_long FLOAT64,
  seal_integrity_status INT64,
  custodian_id STRING OPTIONS(description='Masked identification token for official cargo custodian personnel.')
)
WITH CONNECTION `""" || @@location || """.conn`
OPTIONS (
  format = 'NEWLINE_DELIMITED_JSON',
  uris = ['gs://""" || @@project_id || """-lost-cargo-lake/shipping_manifests/*.jsonl']
)
""");
  1. Click Run.
  2. To verify that the table was created, you will see a success message in the Query Results panel that automatically slides open at the bottom.

Query the external table to isolate compromised transponders

Let's identify the compromised transponders by locating failures when the seal_integrity_status was set to 0. Copy and run the following query in the query window you opened before:

SELECT shipment_id, last_ping_lat, last_ping_long, custodian_id
FROM `lost_cargo_dataset.shipping_manifests`
WHERE seal_integrity_status = 0;

In the Query Results panel, you should see output similar to this:

shipment_id

last_ping_lat

last_ping_long

custodian_id

MV-CAT-001

51.5074

-0.1278

usr_999_shadow

4. Process Unstructured Logs with Managed Service for Apache Spark

You've found the starting location from the structured manifests, but the lost transponder has gone completely dark The last transponder ping left a cryptic, unstructured message inside a raw text log file in the GCS path gs://${BUCKET_NAME}/raw_logs/maritime_logs.txt.

To process and map this text log, extract timestamps, camouflage identities, and locate the cargo's downstream route, you will submit a serverless Apache Spark (PySpark) job to Managed Service for Apache Spark.

Managed Service for Apache Spark allows you to run Spark workloads without provisioning or managing a cluster. The service handles the underlying compute resources, autoscaling them dynamically, and you only pay for the execution duration.

The script will:

  1. Ingest the raw, bracketed, unstructured transponder text.
  2. Apply PySpark SQL regex extraction filters to separate timestamps, custodian metadata, and raw content.
  3. Split the messy logs into clean, sentence-level records.
  4. Extract the dynamic destination coordinate target where the lost payload departures ended.
  5. Connect and write the processed log dataframe back into your Lakehouse Apache Iceberg REST Catalog as a new analytics table visible directly within BigQuery.

Fix the PySpark Analysis Script

There have been reports of Python Pirates at sea causing all sorts of problems.

  1. Run the following to open the file process_maritime_logs in your Cloud Shell Editor.
    cd ~/devrel-demos/data-analytics/data-cloud-roadshow/lab1
    cloudshell open ./process_maritime_logs.py
    
  2. Spend some time to read the code and understand what it's doing.
  3. Make sure nothing in the code looks suspicious! If you need to delete anything, make sure you save the file using Ctrl + S (Windows/Linux) or Cmd + S (Mac).

Submit the Serverless Spark Job

Submit the job using the gcloud SDK. The config automatically configures the PySpark job to access the Lakehouse catalog.

Run the following command in your integrated editor terminal.

cd ~/devrel-demos/data-analytics/data-cloud-roadshow/lab1
source .env

gcloud dataproc batches submit pyspark process_maritime_logs.py \
    --project=${PROJECT_ID} \
    --region=${REGION} \
    --version=2.3 \
    --deps-bucket=${BUCKET_NAME} \
    --properties="\
spark.sql.defaultCatalog=cargo_catalog,\
spark.sql.catalog.${CATALOG_NAME}=org.apache.iceberg.spark.SparkCatalog,\
spark.sql.catalog.${CATALOG_NAME}.type=rest,\
spark.sql.catalog.${CATALOG_NAME}.uri=https://biglake.googleapis.com/iceberg/v1/restcatalog,\
spark.sql.catalog.${CATALOG_NAME}.warehouse=gs://${BUCKET_NAME},\
spark.sql.catalog.${CATALOG_NAME}.header.x-goog-user-project=${PROJECT_ID},\
spark.sql.catalog.${CATALOG_NAME}.rest.auth.type=org.apache.iceberg.gcp.auth.GoogleAuthManager,\
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,\
spark.sql.catalog.${CATALOG_NAME}.gcs.oauth2.refresh-credentials-endpoint=https://oauth2.googleapis.com/token, \
spark.sql.catalog.${CATALOG_NAME}.header.X-Iceberg-Access-Delegation=vended-credentials" \
-- ${BUCKET_NAME}

Wait a few minutes for the serverless environment to spin up, upload your script, and execute the processing logic.

Once you see output similar to the following, your processed table is saved to the Lakehouse catalog as an Apache Iceberg managed table!

Batch [1fae0b4b42c045a084dd67917bd0c332] finished.
metadata:
  '@type': type.googleapis.com/google.cloud.dataproc.v1.BatchOperationMetadata
  batch: projects/<YOUR_PROJECT>/locations/us-central1/batches/1fae0b4b42c045a084dd67917bd0c332
  batchUuid: 611b4465-fdac-4974-bc47-5f10adaf2a08
  createTime: '2026-06-11T21:35:01.868157Z'
  description: Batch
  labels:
    goog-dataproc-batch-id: 1fae0b4b42c045a084dd67917bd0c332
    goog-dataproc-batch-uuid: 611b4465-fdac-4974-bc47-5f10adaf2a08
    goog-dataproc-drz-resource-uuid: batch-611b4465-fdac-4974-bc47-5f10adaf2a08
    goog-dataproc-location: us-central1
  operationType: BATCH
name: projects/<YOUR_PROJECT/regions/us-central1/operations/7e2044db-dd97-3270-b368-4dc4ba417574

Preview the Processed Logs

In Data Agent Kit extension Query editor, copy the following query to preview the data:

SET @@location = "<YOUR_REGION>"; -- e.g. "us-central1"

EXECUTE IMMEDIATE FORMAT("""
select log_timestamp, custodian_id, sentence from 
`%s.%s-lost-cargo-lake.lost_cargo_namespace.processed_maritime_logs`
limit 5
""", @@project_id, @@project_id);

Query lakehouse table

This shows the Iceberg table registered in the catalog can successfully be accessed from BigQuery!

Extract the Destination Clue

Now that we have the processed logs, let's search for the logs that include a destination target. From there, we can search the logs to which include a mention of our origin city.

In your query editor, run the following query, replacing <YOUR_REGION> with your region and replacing <ORIGIN_CITY> with the origin city you discovered earlier.

SET @@location = "<YOUR_REGION>"; --e.g. "us-central1"

DECLARE origin_city STRING;

SET origin_city = "<ORIGIN_CITY>"; --e.g. "Boston". (It's not Boston)

EXECUTE IMMEDIATE FORMAT(r"""
SELECT
  log_timestamp AS event_time,
  custodian_id,
  REGEXP_EXTRACT(raw_message, r'Destination:\s*([^.]+)') AS destination_target
FROM
  `%s.%s-lost-cargo-lake.lost_cargo_namespace.processed_maritime_logs`
WHERE
  -- STRPOS is highly optimized in BigQuery for a simple .contains() substring check
  STRPOS(sentence, '%s') > 0;
""", @@project_id, @@project_id, origin_city);

Chat with your Data in the BigQuery Console using Conversational Analytics

Instead of writing complex SQL queries to explore your data, you can use Conversational Analytics to chat with your tables using natural language!

  1. Go to the BigQuery Console.
  2. In the Explorer panel on the left, expand your project and the dataset -lost-cargo-lake.lost_cargo_namespace, then click on the processed_maritime_logs table to open its details tab.
  3. Next to Query, click Chat.

Go to Chat

  1. In the chat panel, type the following question and press Enter on your keyboard to send: Based on this table, what color is the shipping container MV-CAT-001?

Chat window

  1. Conversational Analytics (powered by Gemini) will analyze the active table's data and respond with the color.

5. View the Centralized Lakehouse Catalog

To integrate open-source processing engines (like Apache Spark) securely and seamlessly with enterprise data engines (like BigQuery), your setup script configured a Lakehouse Iceberg REST Catalog.

The Apache Iceberg REST Catalog serves as the serverless "single source of truth" for table metadata, managing schemas and partitioning tables dynamically while storing physical Parquet data files in Cloud Storage.

Let's examine this catalog directly in the Google Cloud Console:

  1. Open the Lakehouse Console.
  2. In the Catalogs tab, locate and click on your active Iceberg REST Catalog: -lost-cargo-lake.

Lakehouse catalog

  1. In the catalog details view, under Namespaces you should see lost_cargo_namespace. Click on it.

Lakehouse namespace

  1. Your new Apache Iceberg table, generated by PySpark, automatically registered under this metastore namespace and became instantly queryable across BigQuery!

Lakehouse table

6. Generate Insights on the Shipping Manifests Table

Let's go back and analyze the shipping_manifests table to understand its structure and content using Knowledge Catalog Data Insights. By enriching the metadata, other explorers can better understand the table for future analysis.

Generate Table Insights in BigQuery Studio

  1. In the Google Cloud Console, navigate to BigQuery Studio.
  2. In the Explorer panel, expand your project, expand the lost_cargo_dataset dataset, and click the shipping_manifests table.
  3. In the details panel on the right, click the Insights tab.
  4. Use the dropdown to select Generate and publish.

Generate insights

  1. Wait about 3 minutes for the insights generation to complete. Gemini will analyze the table metadata and generate natural language questions and corresponding SQL queries.
  2. Once complete, you'll see a Table description with a natural language explanation of the table.

View table description

  1. Click View column descriptions to see information about the individual columns.

Full table description

  1. Click Save to details under Gemini generated description and click Save to details in the window that pops up.

Save table description

  1. Similarly, click Save to schema to add the column descriptions to the table metadata.

Save column description

Review Generated Insights

You will also see a list of suggested questions. You can click on any question to see the generated SQL query and run it to explore the data. For example, you might see questions like:

  • "What is the total number of shipments?"
  • "List the unique custodian IDs."

Running these queries helps you understand the data.

7. Implement Data Masking and Governance

To guarantee that active research accounts and usernames cannot be leaked during this ongoing cargo investigation, you must enforce standard security protocols. You will create a Security Policy Tag Taxonomy and configure Knowledge Catalog Data Masking over the sensitive custodian_id column to verify data privacy.

By default, BigQuery denies access to columns protected by policy tags. To query the table and verify active data masks, your user account must hold the BigQuery Data Policy Masked Reader role.

This role was automatically bound to your active user account during your initial execution of setup_lab1.sh!

Create the Taxonomy and Policy Tag

Create a data taxonomy and associated policy tag for managing access to your data.

  1. Go to the Policy tag taxonomies page.
  2. Click + Create Taxonomy.
  3. Configure the parameters:
    • Taxonomy name: Enter lost-cargo-, replacing with your project ID.
    • Region: Select your region.
    • For Policy tag Name: Enter MaskCustodianID
    • For Policy tag Description: Restricts visibility into cargo custodian usernames
  4. Click Create to register your new taxonomy and policy tag.

Create taxonomy and policy tag

Create the Data Masking Policy

Next, configure a data policy to define how the data gets masked under the MaskCustodianID classification tag. You will use the Always Null masking rule (replacing matching values with empty/Null returns for all non-privileged actors).

  1. On the Policy tag taxonomies page, click the newly created taxonomy from your list of taxonomies.
  2. In the hierarchy list, click on the MaskCustodianID tag to select it and then select Manage data policies.

Manage data policies

  1. On the right-hand panel, click the + Add Rule button.
  2. Configure the policy details in the panel that appears:
    • Data policy name: Enter null_masking_policy (do not leave it auto-generated, as we will reference it by name in the next steps).
    • Masking rule: Select Nullify from the dropdown menu.
  3. Click Submit.

Create masking rule

Assign the Policy Tag to your BigQuery Column

With the policy tag and its data masking rule active, map the classification tag directly to the custodian_id column in your BigQuery partner shipping manifest table.

  1. Navigate to the BigQuery console.
  2. In the Explorer panel on the left, expand your active project, expand the lost_cargo_dataset dataset, and click on the shipping_manifests table to open its detailed view.
  3. Click Edit Schema.

Go to edit schema

  1. In the list of columns, check the box next to custodian_id.
  2. Click the Add policy tag button on the top toolbar of the schema editor.
  3. In the Add a policy tag panel:
    • Locate and expand your LostCargo taxonomy.
    • Select the bubble next to MaskCustodianID.
    • Click Select.
  4. Verify that the MaskCustodianID tag is now visible under the Policy tag column in the row representing custodian_id.
  5. Click Save.

Edit schema

Verify the Policy Restrictions

Now that you have the Masked Reader role at the project level, you can query the table to verify that the masking policy is active.

Go back to the Data Agent Kit and run the following query:

SELECT shipment_id, custodian_id 
FROM lost_cargo_dataset.shipping_manifests
LIMIT 5;

You should see output similar to the following:

shipment_id

custodian_id

NORMAL-001

null

NORMAL-002

null

MV-CAT-001

null

Success! Even though you can view the shipment_id records, the sensitive custodian_id field returns secure null masks to prevent leaks!

8. Clean up

To avoid ongoing charges to your Google Cloud account for the resources created during this codelab, run these commands in your Cloud Shell terminal to drop your datasets and buckets:

# Verify active variables
export PROJECT_ID=$(gcloud config get-value project)
export BUCKET_NAME="${PROJECT_ID}-lost-cargo-lake"

# Delete Knowledge Catalog security taxonomies
export TAXONOMY_NAME=$(gcloud data-catalog taxonomies list --location=us-central1 --filter="displayName:LostCargo" --format="value(name)")
if [[ -n "$TAXONOMY_NAME" ]]; then
  curl -X DELETE \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    "https://datacatalog.googleapis.com/v1/${TAXONOMY_NAME}" || true
fi

# Delete Lakehouse Iceberg REST Catalog tables and namespaces
gcloud biglake iceberg tables delete processed_maritime_logs \
    --catalog=${BUCKET_NAME} \
    --namespace=lost_cargo_namespace \
    --quiet || true

gcloud biglake iceberg namespaces delete lost_cargo_namespace \
    --catalog=${BUCKET_NAME} \
    --quiet || true

# Delete Lakehouse Iceberg REST Catalog
gcloud biglake iceberg catalogs delete ${BUCKET_NAME} --quiet || true

# Delete dependencies bucket
gcloud storage rm -r gs://${BUCKET_NAME} || true

# Delete BigQuery dataset and tables
bq rm -r -f -d ${PROJECT_ID}:lost_cargo_dataset || true

9. Congratulations

Congratulations! You have successfully completed the first crucial module of the Lost Cargo investigation. You have established a governed search zone using Lakehouse Iceberg REST Catalogs, PySpark log normalization, and fine-grained data masking.

What you've learned

  • Installing, setting up, and configuring the Data Agent Kit extension inside your IDE workspace.
  • Establishing a serverless Lakehouse Iceberg REST catalog utilizing vended credentials and hierarchical namespaces.
  • Ingesting multi-format regional feeds and building BigQuery external tables over Cloud Storage buckets.
  • Launching serverless Apache Spark jobs to parse, normalise, segment, and write unstructured transponder logs back into BigQuery as registered Iceberg catalog tables.
  • Building security taxnomies and mapping Knowledge Catalog data masking policies to prevent identity leaks on sensitive log indices.
  • Generating and analyzing table metadata insights using BigQuery data insights to accelerate data exploration.

Collected Clues Verification

Verify you have recorded the following definitive clues required to launch into the next lab phase:

  1. Lost Shipment ID: MV-CAT-001 (last ping location: London)
  2. Planned Target Destination: New York (and transponder true alias: MV-DOG-002)
  3. Container Color: Crimson RED
  4. Governance Access Tag: MaskCustodianID

Ready for the next phase?

Now that the transponder Depart / Destination routes are secure, the investigation moves forward! Launch straight into Lab 2 to examine security cameras using multimodal Gemini models, identify the vessel visually, and perform vector searches in AlloyDB to verify tampering anomalies!

➡️ Continue with Step Two: Data Analysis & Multimodal Insights

Reference documentation