Building a Unified Data Lakehouse for AI with Apache Iceberg and Lakehouse

1. Introduction

In this codelab, you will explore the capabilities of the Unified Data Lakehouse on Google Cloud. You will interact with public datasets served via the Apache Iceberg REST Catalog on Lakehouse, and then apply Google Cloud's AI capabilities to both structured and unstructured data.

You will query the classic NYC Taxi dataset using Apache Iceberg, delve into Time Travel to audit data changes, and then use BigQuery ML and Gemini to run AI models over your data.

What you'll do

Use Managed Service for Apache Spark to query Apache Iceberg public datasets hosted on Lakehouse.
Query structured data in Apache Iceberg format.
Demonstrate Time Travel in Apache Iceberg.
Use BigQuery ML to train a predictive model on structured data.
Create a Lakehouse Object Table (Unstructured data) and use Gemini to analyze images.

What you'll need

A web browser such as Chrome.
A Google Cloud project with billing enabled.

Expected Cost and Duration

Time to complete: ~45 minutes.
Estimated Cost: < $2.00. We use public datasets and serverless queries to keep costs low.

2. Setup and Requirements

In this step, you will prepare your environment and enable the necessary APIs.

Start Cloud Shell

You will run most commands from Google Cloud Shell.

Click Activate Cloud Shell at the top of the Google Cloud console.
Verify authentication:
```
gcloud auth list
```
Confirm your project:
```
gcloud config get project
```
If the project is not set, set it using your project ID:
```
gcloud config set project <YOUR_PROJECT_ID>
```

Enable APIs

Run the following command to enable the required APIs for BigQuery, Cloud Resource Manager, and Gemini Enterprise Agent Engine:

gcloud services enable \
  bigquery.googleapis.com \
  aiplatform.googleapis.com \
  cloudresourcemanager.googleapis.com

Configure Environment and Create Dependencies Bucket

Set your environment variables in your terminal:

export PROJECT_ID=$(gcloud config get project)
export REGION=us-central1
export DEPS_BUCKET=$PROJECT_ID-deps-bucket

Create the dependencies Cloud Storage bucket. PySpark scripts are uploaded here at job submission time:
```
gcloud storage buckets create gs://$DEPS_BUCKET --location=$REGION
```

3. Connect to Apache Iceberg Public Catalog

In this step, you will connect to a live, production-grade Apache Iceberg Catalog hosted on Google Cloud's Lakehouse.

Run Spark SQL with Managed Apache Spark Batch CLI

We will use Managed Apache Spark to run PySpark jobs without needing to manage infrastructure. We will configure it to point to the public Lakehouse REST Catalog.

Define the Lakehouse REST Catalog properties to avoid repeating them.This configuration tells Spark:

To use the iceberg-spark-runtime and iceberg-gcp-bundle libraries.
To configure a catalog named my_catalog using the Lakehouse REST Catalog endpoint.
To use Google Cloud Storage (GCS) for reading data files instead of the default local file system.
To set this my_catalog catalog as the default for our session.
To use vended credentials for enhanced security and simplified data access.

export METASTORE_PROPERTIES="^|^spark.jars.packages=org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0,org.apache.iceberg:iceberg-gcp-bundle:1.10.0|\
spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog|\
spark.sql.catalog.my_catalog.type=rest|\
spark.sql.catalog.my_catalog.uri=https://biglake.googleapis.com/iceberg/v1/restcatalog|\
spark.sql.catalog.my_catalog.warehouse=gs://biglake-public-nyc-taxi-iceberg|\
spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO|\
spark.sql.catalog.my_catalog.header.x-goog-user-project=$PROJECT_ID|\
spark.sql.catalog.my_catalog.header.X-Iceberg-Access-Delegation=vended-credentials|\
spark.sql.catalog.my_catalog.rest.auth.type=org.apache.iceberg.gcp.auth.GoogleAuthManager|\
spark.sql.defaultCatalog=my_catalog|\
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions|\
spark.log.level=ERROR"

Create a simple test query file:

cat <<EOF > test.py
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

spark.sql("SHOW TABLES IN public_data").show()
EOF

Submit the batch job:

gcloud dataproc batches submit pyspark \
  --project=$PROJECT_ID \
  --region=$REGION \
  --version=2.3 \
  --properties="$METASTORE_PROPERTIES" \
  --deps-bucket=gs://$DEPS_BUCKET \
  test.py

Wait a few minutes for the batch job to finish running.After the job is completed, you should see output similar to the following:

+-----------+----------------+-----------+
|  namespace|       tableName|isTemporary|
+-----------+----------------+-----------+
|public_data|     nyc_taxicab|      false|
|public_data|nyc_taxicab_2021|      false|
+-----------+----------------+-----------+

4. Query Structured Iceberg Data

Once connected, you have full SQL access to the datasets. We will query the NYC Taxi dataset modeled as an Iceberg table.

Run standard aggregation query

Create a file named query.py:

cat <<EOF > query.py
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

query = """
SELECT
  passenger_count,
  COUNT(1) AS num_trips,
  ROUND(AVG(total_amount), 2) AS avg_fare,
  ROUND(AVG(trip_distance), 2) AS avg_distance
FROM public_data.nyc_taxicab
WHERE data_file_year = 2021 AND passenger_count > 0
GROUP BY passenger_count
ORDER BY num_trips DESC
"""

spark.sql(query).show()
EOF

And submit it using Managed Apache Spark:

gcloud dataproc batches submit pyspark \
  --project=$PROJECT_ID \
  --region=$REGION \
  --version=2.3 \
  --properties="$METASTORE_PROPERTIES" \
  --deps-bucket=gs://$DEPS_BUCKET \
  query.py

Wait a few minutes for the batch job to finish running.

After the job is completed, you should see output similar to the following:

+---------------+---------+--------+------------+
|passenger_count|num_trips|avg_fare|avg_distance|
+---------------+---------+--------+------------+
|              1| 21508009|   18.82|        3.03|
|              2|  4424746|   20.22|        3.40|
|              3|  1164846|   19.84|        3.27|
|              5|   718282|   18.88|        3.07|
|              4|   466485|   20.61|        3.44|
|              6|   452467|   18.97|        3.11|
|              7|       78|   65.24|        3.71|
|              8|       49|   57.39|        5.88|
|              9|       35|   73.26|        6.20|
|             96|        1|   17.00|        2.00|
|            112|        1|   15.00|        2.00|
+---------------+---------+--------+------------+

Why use Apache Iceberg here?

Partition Pruning: The query filters on data_file_year = 2021. Iceberg allows the engine to skip scanning data from other years entirely.
Engine Agility: You can run this in Spark, Trino, or BigQuery without copying data!

5. Time Travel in Apache Iceberg

One of Iceberg's most powerful features is Time Travel. It allows you to query data as it existed at a past version or snapshot.

View Table History

Create a file named history.py:

cat <<EOF > history.py
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

spark.sql("SELECT * FROM public_data.nyc_taxicab.history").show()
EOF

And submit it:

gcloud dataproc batches submit pyspark \
  --project=$PROJECT_ID \
  --region=$REGION \
  --version=2.3 \
  --properties="$METASTORE_PROPERTIES" \
  --deps-bucket=gs://$DEPS_BUCKET \
  history.py

You should see output similar to the following in your console:

+--------------------+-------------------+-------------------+-------------------+
|     made_current_at|        snapshot_id|          parent_id|is_current_ancestor|
+--------------------+-------------------+-------------------+-------------------+
|2026-01-07 21:32:...|6333415779680505547|               NULL|               true|
|2026-01-07 21:34:...|1840345522877675925|6333415779680505547|               true|
|2026-01-07 21:36:...|7203554539964460256|1840345522877675925|               true|
|2026-01-07 21:38:...|4573466015237516024|7203554539964460256|               true|
|2026-01-07 21:40:...|3353190952148867790|4573466015237516024|               true|
|2026-01-07 21:42:...|1335547378580631681|3353190952148867790|               true|
|2026-01-07 21:44:...|8203141258229894239|1335547378580631681|               true|
|2026-01-07 21:46:...|1597048231706307813|8203141258229894239|               true|
|2026-01-07 21:48:...|6247811509231462655|1597048231706307813|               true|
|2026-01-07 21:50:...|2527184310045633322|6247811509231462655|               true|
|2026-01-07 21:52:...|2512764101237223642|2527184310045633322|               true|
|2026-01-07 21:52:...|7045957533358062548|2512764101237223642|               true|
|2026-01-07 21:53:...| 531753237516076726|7045957533358062548|               true|
|2026-01-07 21:53:...|4184653573199718274| 531753237516076726|               true|
|2026-01-07 21:54:...|5125223829492177301|4184653573199718274|               true|
|2026-01-07 21:54:...|6844673237417600305|5125223829492177301|               true|
|2026-01-07 21:54:...|6634828203344518093|6844673237417600305|               true|
|2026-01-07 21:55:...|7637728273407236194|6634828203344518093|               true|
|2026-01-07 21:55:...|3424071684958740192|7637728273407236194|               true|
|2026-01-07 21:55:...|1743746294196424254|3424071684958740192|               true|
+--------------------+-------------------+-------------------+-------------------+

You will see rows representing different snapshot IDs and when they were committed.

Compare current vs past row count

Create a file named timetravel.py:

cat <<EOF > timetravel.py
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

query = """
SELECT 'Current State' AS version, COUNT(*) AS count FROM public_data.nyc_taxicab
UNION ALL
SELECT 'Past State' AS version, COUNT(*) AS count FROM public_data.nyc_taxicab VERSION AS OF 4573466015237516024
"""

spark.sql(query).show()
EOF

And submit it:

gcloud dataproc batches submit pyspark \
  --project=$PROJECT_ID \
  --region=$REGION \
  --version=2.3 \
  --properties="$METASTORE_PROPERTIES" \
  --deps-bucket=gs://$DEPS_BUCKET \
  timetravel.py

You should see output similar to the following in your console:

+-------------+----------+
|      version|     count|
+-------------+----------+
|Current State|1293069366|
|   Past State|  72878594|
+-------------+----------+

This ensures you can audit data changes over time.

6. Structured AI with BigQuery ML

Now that you have explored the Iceberg data, let's use BigQuery AI capabilities! Since the public Iceberg catalog is read-only, we can use BigQuery to train a model in our workspace by reading from public tables.

Create a local dataset

First, create a dataset in your project to store the AI model using the bq CLI:

bq mk --location=$REGION --project_id=$PROJECT_ID iceberg_ai

Train a Linear Regression Model

Now you will train a Linear Regression model using the public Lakehouse Iceberg table.

Create a query file and train the model using bq query:

cat <<'EOF' > train_model.sql
CREATE OR REPLACE MODEL `iceberg_ai.predict_fare`
OPTIONS(model_type='LINEAR_REG', input_label_cols=['fare_amount']) AS
SELECT fare_amount, passenger_count, CAST(trip_distance AS FLOAT64) AS trip_distance
FROM `bigquery-public-data`.`biglake-public-nyc-taxi-iceberg`.public_data.nyc_taxicab
WHERE fare_amount > 0 AND trip_distance > 0 AND RAND() < 0.01; -- Using 1% of data to downsample
EOF

bq query --location=$REGION --use_legacy_sql=false < train_model.sql

Predict using the Model

Now that the model is trained, you can use it to predict fare amounts for new trips using ML.PREDICT.

Create a query file and run the prediction using bq query:

cat <<'EOF' > predict_fare.sql
SELECT
  predicted_fare_amount, passenger_count, trip_distance
FROM
  ML.PREDICT(MODEL `iceberg_ai.predict_fare`,
    (
    SELECT 2 AS passenger_count, 5.0 AS trip_distance
    )
  );
EOF

bq query --location=$REGION --use_legacy_sql=false < predict_fare.sql

You should see output similar to the following:

+-----------------------+-----------------+---------------+
| predicted_fare_amount | passenger_count | trip_distance |
+-----------------------+-----------------+---------------+
|     14.12252095150709 |               2 |           5.0 |
+-----------------------+-----------------+---------------+

7. Unstructured AI with Lakehouse

Data isn't just rows and columns. Unified Data Lakehouses handle unstructured data (images, PDFs) too. Let's use Object Tables and Object References to query unstructured data.

Object tables are read-only external table in BigQuery that lists objects in a Cloud Storage path. Each row represents a file, with columns for metadata like uri, size, and a special ref column containing the ObjectRef.

Object references (ObjectRef) points to the actual data of a single file. Modern BigQuery ML functions (like AI.GENERATE or AI.AGG) consume ObjectRef to read file content (images, audio, or text) for analysis without loading the bytes into a standard table.

Create a Dataset for Unstructured AI

First, create a second dataset in your project to store the Object Tables using the bq CLI in the US multi-region:

bq mk --location=US --project_id=$PROJECT_ID iceberg_object_ai

Create an External Connection

To query data stored in Cloud Storage (both Object Tables and unstructured data) from BigQuery, you need to create an external Connection.

Run the following in Cloud Shell to create a Cloud Resource connection:

bq mk --connection --project_id=$PROJECT_ID --location=US --connection_type=CLOUD_RESOURCE my-conn

Find the service account ID created for your connection:

CONNECTION_SA=$(bq show --format=json --project_id=$PROJECT_ID --connection $PROJECT_ID.us.my-conn | jq -r '.serviceAccountId // .cloudResource.serviceAccountId')

Grant the Service Account the Gemini Enterprise Agent Engine User and Storage Object Viewer roles so it can call Gemini models and read GCS data:

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:$CONNECTION_SA" \
  --role="roles/aiplatform.user"

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:$CONNECTION_SA" \
  --role="roles/storage.objectViewer"

Create an Object Table

We will use the external connection my-conn created in the previous section to access unstructured data. Create a query file and create the Object Table using bq query:

cat <<'EOF' > create_object_table.sql
CREATE EXTERNAL TABLE `iceberg_object_ai.sample_images`
WITH CONNECTION `us.my-conn`
OPTIONS (
  object_metadata = 'SIMPLE',
  uris = ['gs://cloud-samples-data/vision/landmark/*']
);
EOF

bq query --use_legacy_sql=false < create_object_table.sql

Use Gemini on Object Data

Now run a query using Gemini to evaluate the images without downloading them!

Query the images using standard SQL via bq query:

cat <<EOF > query_images.sql
SELECT
  uri,
  image_analysis.description
FROM (
  SELECT
    uri,
    AI.GENERATE(
      (
        'Identify what is happening in the image.',
        ref
      ),
      connection_id => 'us.my-conn',
      endpoint => 'gemini-2.5-flash-lite',
      output_schema => 'event STRING, severity STRING, description STRING'
    ) AS image_analysis
  FROM
    iceberg_object_ai.sample_images
);
EOF

bq query --use_legacy_sql=false < query_images.sql

Sample output:

+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                           uri                            |                                                                                                                                                                                                                             description                                                                                                                                                                                                                             |
+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| gs://cloud-samples-data/vision/landmark/eiffel_tower.jpg | The Eiffel Tower stands tall against a cloudy sky, overlooking the Seine River in Paris. Boats are docked along the riverbank, and trees line the opposite shore, with bridges and buildings visible in the distance.                                                                                                                                                                                                                                               |
| gs://cloud-samples-data/vision/landmark/pofa.jpg         | A wide shot shows the Palace of Fine Arts, a monumental structure in San Francisco, California. The building features a large rotunda with a dome, surrounded by colonnades. In front of the rotunda is a lagoon. Several people are walking around the grounds. The sky is blue with a few scattered clouds.                                                                                                                                                       |
| gs://cloud-samples-data/vision/landmark/st_basils.jpeg   | A monument stands in front of Saint Basil's Cathedral in Moscow under a bright blue sky with scattered white clouds. The cathedral features distinctive onion domes in various colors and patterns, including red, blue and white stripes, green and beige stripes, and red and blue diamonds. A large green tree partially obscures the left side of the cathedral. People are visible in the foreground near the base of the monument and the cathedral entrance. |
+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Exploring ObjectRefs Directly: Sentiment Analysis

While Object Tables manage file references automatically, you can interact with these objects directly using BigQuery Object References to run on-the-fly analysis on single files.

For example, you can use a small text file stored in your own GCS bucket (using the $DEPS_BUCKET variable created earlier) and analyze it using OBJ.MAKE_REF with bq query.

First, create a small text file and upload it to your bucket:

cat <<'EOF' > review.txt
This product is fantastic! It exceeded my expectations. The quality is top-notch. I highly recommend it to everyone!
EOF

gcloud storage cp review.txt gs://${DEPS_BUCKET}/review.txt

Now query the file using OBJ.MAKE_REF inside standard SQL:

cat <<EOF > sentiment_analysis.sql
SELECT
  AI.GENERATE(
    (
      'Analyze the sentiment of this text file. Is it positive, negative, or neutral? Explain why.',
      OBJ.MAKE_REF('gs://${DEPS_BUCKET}/review.txt', 'us.my-conn')
    ),
    connection_id => 'us.my-conn',
    endpoint => 'gemini-2.5-flash-lite'
  ).result AS ml_generate_text_result;
EOF

bq query --use_legacy_sql=false < sentiment_analysis.sql