1. Introduction
In this codelab, you will explore the capabilities of the Unified Data Lakehouse on Google Cloud. You will interact with public datasets served via the Apache Iceberg REST Catalog on BigLake, and then apply Google Cloud's AI capabilities to both structured and unstructured data.
You will query the classic NYC Taxi dataset using Apache Iceberg, delve into Time Travel to audit data changes, and then use BigQuery ML and Gemini to run AI models over your data.
What you'll do
- Use Google Cloud Serverless for Apache Spark to query Apache Iceberg public datasets hosted on BigLake.
- Query structured data in Apache Iceberg format.
- Demonstrate Time Travel in Apache Iceberg.
- Use BigQuery ML to train a predictive model on structured data.
- Create a BigLake Object Table (Unstructured data) and use Gemini to analyze images.
What you'll need
- A web browser such as Chrome.
- A Google Cloud project with billing enabled.
Expected Cost and Duration
- Time to complete: ~45 minutes.
- Estimated Cost: < $2.00. We use public datasets and serverless queries to keep costs low.
2. Setup and Requirements
In this step, you will prepare your environment and enable the necessary APIs.
Start Cloud Shell
You will run most commands from Google Cloud Shell.
- Click Activate Cloud Shell at the top of the Google Cloud console.
- Verify authentication:
gcloud auth list - Confirm your project:
gcloud config get project - If the project is not set, set it using your project ID:
gcloud config set project <YOUR_PROJECT_ID>
Enable APIs
Run the following command to enable the required APIs for BigQuery, Cloud Resource Manager, and Vertex AI:
gcloud services enable \
bigquery.googleapis.com \
aiplatform.googleapis.com \
cloudresourcemanager.googleapis.com
Configure Environment and Create Dependencies Bucket
- Set your environment variables in your terminal:
export PROJECT_ID=$(gcloud config get project) export REGION=us-central1 export DEPS_BUCKET=$PROJECT_ID-deps-bucket - Create the dependencies Cloud Storage bucket. PySpark scripts are uploaded here at job submission time:
gcloud storage buckets create gs://$DEPS_BUCKET --location=$REGION
3. Connect to Apache Iceberg Public Catalog
In this step, you will connect to a live, production-grade Apache Iceberg Catalog hosted on Google Cloud's BigLake.
Run Spark SQL with Serverless for Apache Spark Batch CLI
We will use Google Cloud Serverless for Apache Spark to run PySpark jobs without needing to manage infrastructure. We will configure it to point to the public BigLake REST Catalog.
- Define the BigLake REST Catalog properties to avoid repeating them.This configuration tells Spark:
- To use the
iceberg-spark-runttimeandiceberg-gcp-bundlelibraries. - To configure a catalog named
my_catalogusing the BigLake REST Catalog endpoint. - To use Google Cloud Storage (GCS) for reading data files instead of the default local file system.
- To set this
my_catalogcatalog as the default for our session. - To use vended credentials for enhanced security and simplified data access.
export METASTORE_PROPERTIES="^|^spark.jars.packages=org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0,org.apache.iceberg:iceberg-gcp-bundle:1.10.0|\ spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog|\ spark.sql.catalog.my_catalog.type=rest|\ spark.sql.catalog.my_catalog.uri=https://biglake.googleapis.com/iceberg/v1/restcatalog|\ spark.sql.catalog.my_catalog.warehouse=gs://biglake-public-nyc-taxi-iceberg|\ spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO|\ spark.sql.catalog.my_catalog.header.x-goog-user-project=$PROJECT_ID|\ spark.sql.catalog.my_catalog.header.X-Iceberg-Access-Delegation=vended-credentials|\ spark.sql.catalog.my_catalog.rest.auth.type=org.apache.iceberg.gcp.auth.GoogleAuthManager|\ spark.sql.defaultCatalog=my_catalog|\ spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions|\ spark.log.level=ERROR" - To use the
- Create a simple test query file:
cat <<EOF > test.py from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() spark.sql("SHOW TABLES IN public_data").show() EOF - Submit the batch job:
You should see output similar to the following:gcloud dataproc batches submit pyspark \ --project=$PROJECT_ID \ --region=$REGION \ --version=2.3 \ --properties="$METASTORE_PROPERTIES" \ --deps-bucket=gs://$DEPS_BUCKET \ test.py+-----------+----------------+-----------+ | namespace| tableName|isTemporary| +-----------+----------------+-----------+ |public_data| nyc_taxicab| false| |public_data|nyc_taxicab_2021| false| +-----------+----------------+-----------+
4. Query Structured Iceberg Data
Once connected, you have full SQL access to the datasets. We will query the NYC Taxi dataset modeled as an Iceberg table.
Run standard aggregation query
Create a file named query.py:
cat <<EOF > query.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
query = """
SELECT
passenger_count,
COUNT(1) AS num_trips,
ROUND(AVG(total_amount), 2) AS avg_fare,
ROUND(AVG(trip_distance), 2) AS avg_distance
FROM public_data.nyc_taxicab
WHERE data_file_year = 2021 AND passenger_count > 0
GROUP BY passenger_count
ORDER BY num_trips DESC
"""
spark.sql(query).show()
EOF
And submit it using Serverless for Apache Spark:
gcloud dataproc batches submit pyspark \
--project=$PROJECT_ID \
--region=$REGION \
--version=2.3 \
--properties="$METASTORE_PROPERTIES" \
--deps-bucket=gs://$DEPS_BUCKET \
query.py
You should see output similar to this in your console:
+---------------+---------+--------+------------+ |passenger_count|num_trips|avg_fare|avg_distance| +---------------+---------+--------+------------+ | 1| 21508009| 18.82| 3.03| | 2| 4424746| 20.22| 3.40| | 3| 1164846| 19.84| 3.27| | 5| 718282| 18.88| 3.07| | 4| 466485| 20.61| 3.44| | 6| 452467| 18.97| 3.11| | 7| 78| 65.24| 3.71| | 8| 49| 57.39| 5.88| | 9| 35| 73.26| 6.20| | 96| 1| 17.00| 2.00| | 112| 1| 15.00| 2.00| +---------------+---------+--------+------------+
Why use Apache Iceberg here?
- Partition Pruning: The query filters on
data_file_year = 2021. Iceberg allows the engine to skip scanning data from other years entirely. - Engine Agility: You can run this in Spark, Trino, or BigQuery without copying data!
5. Time Travel in Apache Iceberg
One of Iceberg's most powerful features is Time Travel. It allows you to query data as it existed at a past version or snapshot.
View Table History
Create a file named history.py:
cat <<EOF > history.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.sql("SELECT * FROM public_data.nyc_taxicab.history").show()
EOF
And submit it:
gcloud dataproc batches submit pyspark \
--project=$PROJECT_ID \
--region=$REGION \
--version=2.3 \
--properties="$METASTORE_PROPERTIES" \
--deps-bucket=gs://$DEPS_BUCKET \
history.py
You should see output similar to the following in your console:
+--------------------+-------------------+-------------------+-------------------+ | made_current_at| snapshot_id| parent_id|is_current_ancestor| +--------------------+-------------------+-------------------+-------------------+ |2026-01-07 21:32:...|6333415779680505547| NULL| true| |2026-01-07 21:34:...|1840345522877675925|6333415779680505547| true| |2026-01-07 21:36:...|7203554539964460256|1840345522877675925| true| |2026-01-07 21:38:...|4573466015237516024|7203554539964460256| true| |2026-01-07 21:40:...|3353190952148867790|4573466015237516024| true| |2026-01-07 21:42:...|1335547378580631681|3353190952148867790| true| |2026-01-07 21:44:...|8203141258229894239|1335547378580631681| true| |2026-01-07 21:46:...|1597048231706307813|8203141258229894239| true| |2026-01-07 21:48:...|6247811509231462655|1597048231706307813| true| |2026-01-07 21:50:...|2527184310045633322|6247811509231462655| true| |2026-01-07 21:52:...|2512764101237223642|2527184310045633322| true| |2026-01-07 21:52:...|7045957533358062548|2512764101237223642| true| |2026-01-07 21:53:...| 531753237516076726|7045957533358062548| true| |2026-01-07 21:53:...|4184653573199718274| 531753237516076726| true| |2026-01-07 21:54:...|5125223829492177301|4184653573199718274| true| |2026-01-07 21:54:...|6844673237417600305|5125223829492177301| true| |2026-01-07 21:54:...|6634828203344518093|6844673237417600305| true| |2026-01-07 21:55:...|7637728273407236194|6634828203344518093| true| |2026-01-07 21:55:...|3424071684958740192|7637728273407236194| true| |2026-01-07 21:55:...|1743746294196424254|3424071684958740192| true| +--------------------+-------------------+-------------------+-------------------+
You will see rows representing different snapshot IDs and when they were committed.
Compare current vs past row count
Create a file named timetravel.py:
cat <<EOF > timetravel.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
query = """
SELECT 'Current State' AS version, COUNT(*) AS count FROM public_data.nyc_taxicab
UNION ALL
SELECT 'Past State' AS version, COUNT(*) AS count FROM public_data.nyc_taxicab VERSION AS OF 4573466015237516024
"""
spark.sql(query).show()
EOF
And submit it:
gcloud dataproc batches submit pyspark \
--project=$PROJECT_ID \
--region=$REGION \
--version=2.3 \
--properties="$METASTORE_PROPERTIES" \
--deps-bucket=gs://$DEPS_BUCKET \
timetravel.py
You should see output similar to the following in your console:
+-------------+----------+ | version| count| +-------------+----------+ |Current State|1293069366| | Past State| 72878594| +-------------+----------+
This ensures you can audit data changes over time.
6. Structured AI with BigQuery ML
Now that you have explored the Iceberg data, let's use BigQuery AI capabilities! Since the public Iceberg catalog is read-only, we can use BigQuery to train a model in our workspace by reading from public tables.
Create a local dataset
First, create a dataset in your project to store the AI model using the bq CLI:
bq mk --location=$REGION --project_id=$PROJECT_ID iceberg_ai
Train a Linear Regression Model
Now you will train a Linear Regression model using the public BigLake Iceberg table.
Create a query file and train the model using bq query:
cat <<'EOF' > train_model.sql
CREATE OR REPLACE MODEL `iceberg_ai.predict_fare`
OPTIONS(model_type='LINEAR_REG', input_label_cols=['fare_amount']) AS
SELECT fare_amount, passenger_count, CAST(trip_distance AS FLOAT64) AS trip_distance
FROM `bigquery-public-data`.`biglake-public-nyc-taxi-iceberg`.public_data.nyc_taxicab
WHERE fare_amount > 0 AND trip_distance > 0 AND RAND() < 0.01; -- Using 1% of data to downsample
EOF
bq query --location=$REGION --use_legacy_sql=false < train_model.sql
Predict using the Model
Now that the model is trained, you can use it to predict fare amounts for new trips using ML.PREDICT.
Create a query file and run the prediction using bq query:
cat <<'EOF' > predict_fare.sql
SELECT
predicted_fare_amount, passenger_count, trip_distance
FROM
ML.PREDICT(MODEL `iceberg_ai.predict_fare`,
(
SELECT 2 AS passenger_count, 5.0 AS trip_distance
)
);
EOF
bq query --location=$REGION --use_legacy_sql=false < predict_fare.sql
You should see output similar to the following:
+-----------------------+-----------------+---------------+ | predicted_fare_amount | passenger_count | trip_distance | +-----------------------+-----------------+---------------+ | 14.12252095150709 | 2 | 5.0 | +-----------------------+-----------------+---------------+
7. Unstructured AI with BigLake
Data isn't just rows and columns. Unified Data Lakehouses handle unstructured data (images, PDFs) too. Let's use Object Tables and Object References to query unstructured data.
Object tables are read-only external table in BigQuery that lists objects in a Cloud Storage path. Each row represents a file, with columns for metadata like uri, size, and a special ref column containing the ObjectRef.
Object references (ObjectRef) points to the actual data of a single file. Modern BigQuery ML functions (like AI.GENERATE or AI.AGG) consume ObjectRef to read file content (images, audio, or text) for analysis without loading the bytes into a standard table.
Create a Dataset for Unstructured AI
First, create a second dataset in your project to store the Object Tables using the bq CLI in the US multi-region:
bq mk --location=US --project_id=$PROJECT_ID iceberg_object_ai
Create an External Connection
To query data stored in Cloud Storage (both Object Tables and unstructured data) from BigQuery, you need to create an external Connection.
Run the following in Cloud Shell to create a Cloud Resource connection:
bq mk --connection --project_id=$PROJECT_ID --location=US --connection_type=CLOUD_RESOURCE my-conn
Find the service account ID created for your connection:
CONNECTION_SA=$(bq show --format=json --project_id=$PROJECT_ID --connection $PROJECT_ID.us.my-conn | jq -r '.serviceAccountId // .cloudResource.serviceAccountId')
Grant the Service Account the Vertex AI User and Storage Object Viewer roles so it can call Gemini models and read GCS data:
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:$CONNECTION_SA" \
--role="roles/aiplatform.user"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:$CONNECTION_SA" \
--role="roles/storage.objectViewer"
Create an Object Table
We will use the external connection my-conn created in the previous section to access unstructured data. Create a query file and create the Object Table using bq query:
cat <<'EOF' > create_object_table.sql
CREATE EXTERNAL TABLE `iceberg_object_ai.sample_images`
WITH CONNECTION `us.my-conn`
OPTIONS (
object_metadata = 'SIMPLE',
uris = ['gs://cloud-samples-data/vision/landmark/*']
);
EOF
bq query --use_legacy_sql=false < create_object_table.sql
Use Gemini on Object Data
Now run a query using Gemini to evaluate the images without downloading them!
Query the images using standard SQL via bq query:
cat <<EOF > query_images.sql
SELECT
uri,
image_analysis.description
FROM (
SELECT
uri,
AI.GENERATE(
(
'Identify what is happening in the image.',
ref
),
connection_id => 'us.my-conn',
endpoint => 'gemini-2.5-flash-lite',
output_schema => 'event STRING, severity STRING, description STRING'
) AS image_analysis
FROM
iceberg_object_ai.sample_images
);
EOF
bq query --use_legacy_sql=false < query_images.sql
Sample output:
+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| uri | description |
+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| gs://cloud-samples-data/vision/landmark/eiffel_tower.jpg | The Eiffel Tower stands tall against a cloudy sky, overlooking the Seine River in Paris. Boats are docked along the riverbank, and trees line the opposite shore, with bridges and buildings visible in the distance. |
| gs://cloud-samples-data/vision/landmark/pofa.jpg | A wide shot shows the Palace of Fine Arts, a monumental structure in San Francisco, California. The building features a large rotunda with a dome, surrounded by colonnades. In front of the rotunda is a lagoon. Several people are walking around the grounds. The sky is blue with a few scattered clouds. |
| gs://cloud-samples-data/vision/landmark/st_basils.jpeg | A monument stands in front of Saint Basil's Cathedral in Moscow under a bright blue sky with scattered white clouds. The cathedral features distinctive onion domes in various colors and patterns, including red, blue and white stripes, green and beige stripes, and red and blue diamonds. A large green tree partially obscures the left side of the cathedral. People are visible in the foreground near the base of the monument and the cathedral entrance. |
+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Exploring ObjectRefs Directly: Sentiment Analysis
While Object Tables manage file references automatically, you can interact with these objects directly using BigQuery Object References to run on-the-fly analysis on single files.
For example, you can use a small text file stored in your own GCS bucket (using the $DEPS_BUCKET variable created earlier) and analyze it using OBJ.MAKE_REF with bq query.
First, create a small text file and upload it to your bucket:
cat <<'EOF' > review.txt
This product is fantastic! It exceeded my expectations. The quality is top-notch. I highly recommend it to everyone!
EOF
gcloud storage cp review.txt gs://${DEPS_BUCKET}/review.txt
Now query the file using OBJ.MAKE_REF inside standard SQL:
cat <<EOF > sentiment_analysis.sql
SELECT
AI.GENERATE(
(
'Analyze the sentiment of this text file. Is it positive, negative, or neutral? Explain why.',
OBJ.MAKE_REF('gs://${DEPS_BUCKET}/review.txt', 'us.my-conn')
),
connection_id => 'us.my-conn',
endpoint => 'gemini-2.5-flash-lite'
).result AS ml_generate_text_result;
EOF
bq query --use_legacy_sql=false < sentiment_analysis.sql
Sample output:
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ml_generate_text_result |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| This text file has a **strongly positive** sentiment. |
| |
| Here's why: |
| |
| * **Positive Keywords:** The text is filled with unequivocally positive words and phrases: |
| * "fantastic" |
| * "exceeded my expectations" |
| * "top-notch" |
| * "highly recommend" |
| |
| * **Enthusiastic Language:** The use of exclamation marks ("!") further amplifies the positive tone, indicating excitement and strong approval. |
| |
| * **Lack of Negative or Neutral Elements:** There are no words, phrases, or implications that suggest any dissatisfaction, criticism, or even indifference. |
| |
| In summary, the author's language is enthusiastic and uses multiple strong positive descriptors, leaving no room for doubt that their opinion of the product is overwhelmingly positive. |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
8. Clean up
To avoid ongoing charges to your Google Cloud account, delete the resources created during this codelab.
Delete Dataset and Connection
Run the following in Cloud Shell to delete your datasets and connection:
bq rm -r -f --location=$REGION iceberg_ai
bq rm -r -f --location=US iceberg_object_ai
bq rm --connection $PROJECT_ID.US.my-conn
Delete GCS buckets and local files
Clean up the GCS buckets and local files:
# Delete GCS buckets
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
gcloud storage rm -r gs://dataproc-temp-${REGION}-${PROJECT_NUMBER}-*
gcloud storage rm -r gs://dataproc-staging-${REGION}-${PROJECT_NUMBER}-*
gcloud storage rm -r gs://${DEPS_BUCKET}
# Delete local files
rm -f train_model.sql predict_fare.sql create_object_table.sql query_images.sql sentiment_analysis.sql test.py query.py history.py timetravel.py review.txt
You can also delete the entire project if you created it just for this lab.
9. Congratulations
Congratulations! You've successfully built a Unified Data Lakehouse using Apache Iceberg, BigLake, and BigQuery AI!
What you've learned
- How to connect and query Public Apache Iceberg REST Catalogs.
- Using Time Travel in Iceberg to audit dataset versions.
- Training BigQuery ML models on structured data.
- Connecting unstructured data (images) using Object Tables & ObjectRef.
- Using Gemini directly in BigQuery SQL to analyze images.