1. Build a Reverse ETL Pipeline from Snowflake to Spanner using Google Cloud Storage and BigQuery
Introduction
In this codelab, a Reverse ETL pipeline is built from Snowflake to Spanner. Traditionally, ETL (Extract, Transform, Load) pipelines move data from operational databases into a data warehouse like Snowflake for analytics. A Reverse ETL pipeline does the opposite: it moves curated, processed data from the data warehouse back into operational systems where it can power applications, serve user-facing features, or be used for real-time decision-making.
The goal is to move an aggregated dataset from a Snowflake Iceberg table into Spanner, a globally distributed relational database ideal for high-availability applications.
To achieve this, Google Cloud Storage (GCS) and BigQuery are used as intermediate steps. Here's a breakdown of the data flow and the reasoning behind this architecture:
- Snowflake to Google Cloud Storage (GCS) in Iceberg Format:
- The first step is to get the data out of Snowflake in an open, well-defined format. The table is exported in the Apache Iceberg format. This process writes the underlying data as a set of Parquet files and the table's metadata (schema, partitions, file locations) as JSON and Avro files. Staging this complete table structure in GCS makes the data portable and accessible to any system that understands the Iceberg format.
- Convert Iceberg tables in GCS to BigQuery BigLake external table:
- Instead of loading the data directly from GCS into Spanner, BigQuery is used as a powerful intermediary. You will create a BigLake external table in BigQuery that points directly to the Iceberg metadata file in GCS. This approach has several advantages:
- No Data Duplication: BigQuery reads the table structure from the metadata and queries the Parquet data files in place without ingesting them, which saves significant time and storage costs.
- Federated Queries: It allows for running complex SQL queries on GCS data as if it were a native BigQuery table.
- BigQuery to Spanner:
- The final step is to move the data from BigQuery into Spanner. You will achieve this using a powerful feature in BigQuery called an
EXPORT DATAquery, which is the "Reverse ETL" step. - Operational Readiness: Spanner is designed for transactional workloads, providing strong consistency and high availability for applications. By moving the data into Spanner, it is made accessible to user-facing applications, APIs, and other operational systems that require low-latency point lookups.
- Scalability: This pattern allows for leveraging BigQuery's analytical power to process large datasets and then serve the results efficiently through Spanner's globally scalable infrastructure.
Services and Terminology
- Snowflake - A cloud data platform that provides a data warehouse-as-a-service.
- Spanner - A fully managed, globally distributed relational database.
- Google Cloud Storage - Google Cloud's blob storage offering.
- BigQuery - A fully managed, serverless data warehouse for analytics.
- Iceberg - An open table format defined by Apache providing abstraction over common open-source data file formats.
- Parquet - An open-source columnar binary data file format by Apache.
What you'll learn
- How to load data into Snowflake
- How to create a GCS Bucket
- How to export a Snowflake table to GCS in the Iceberg format
- How to setup up a Spanner instance
- How to load BigLake External Tables in BigQuery to Spanner
2. Setup, Requirements & Limitations
Prerequisites
- A Snowflake account
- A Google Cloud account with a BigQuery Enterprise-tier or higher reservation is required to export from BigQuery to Spanner.
- Access to the Google Cloud Console through a web browser
- A terminal to run Google Cloud CLI commands
- If your Google Cloud organization has the
iam.allowedPolicyMemberDomainspolicy enabled, an administrator may need to grant an exception to allow service accounts from external domains. This will be covered in a later step where applicable.
Limitations
It is important to be aware of certain limitations and data type incompatibilities that can arise in this pipeline.
Snowflake to Iceberg
Column data types differ between Snowflake and Iceberg. Information about translating between them is available in the Snowflake documentation.
Iceberg to BigQuery
When using BigQuery to query Iceberg tables, there are some limitations. For a full list, see the BigQuery documentation. Note that types such as BIGNUMERIC, INTERVAL, JSON, RANGE, or GEOGRAPHY are not currently supported.
BigQuery to Spanner
The EXPORT DATA command from BigQuery to Spanner does not support all BigQuery data types. Exporting a table with the following types will result in an error:
STRUCTGEOGRAPHYDATETIMERANGETIME
Additionally, if the BigQuery project is using the GoogleSQL dialect, the following numeric types are also not supported for export to Spanner:
BIGNUMERIC
For a complete and up-to-date list of limitations, refer to the official documentation: Exporting to Spanner Limitations.
Snowflake
For this codelab, you can use an existing Snowflake account, or set up a free trial account.
Google Cloud Platform IAM Permissions
The Google account will need the following permissions to execute all the steps in this codelab.
Service Accounts | ||
| Allows the creation of Service Accounts. | |
Spanner | ||
| Allows creating a new Spanner instance. | |
| Allows running DDL statements to create | |
| Allows running DDL statements to create tables in the database. | |
Google Cloud Storage | ||
| Allows creating a new GCS bucket to store the exported Parquet files. | |
| Allows writing the exported Parquet files to the GCS bucket. | |
| Allows BigQuery to read the Parquet files from the GCS bucket. | |
| Allows BigQuery to list the Parquet files in the GCS bucket. | |
Dataflow | ||
| Allows claiming of work items from Dataflow. | |
| Allows the Dataflow worker to send messages back to the Dataflow service. | |
| Allows Dataflow workers to write log entries to Google Cloud Logging. | |
For convenience, predefined roles that contain these permissions can be used.
|
|
|
|
|
|
|
|
Setup Reusable Properties
There will be a few values that will be repeatedly needed throughout this lab. To make this easier, we'll set these values to shell variables to be used later.
- GCP_REGION - The specific region that the GCP resources will be located. The list of regions can be found here.
- GCP_PROJECT - The GCP Project id to use.
- GCP_BUCKET_NAME - The GCS Bucket name to be created, and where the data files will be stored.
export GCP_REGION = <GCP REGION HERE>
export GCP_PROJECT= <GCP PROJECT HERE>
export GCS_BUCKET_NAME = <GCS BUCKET NAME HERE>
export SPANNER_INSTANCE = <SPANNER INSTANCE ID HERE>
export SPANNER_DB = <SPANNER DATABASE ID HERE>
Google Cloud Project
A Project is a basic unit of organization in Google Cloud. If an administrator has provided one for use, this step may be skipped.
A project can be created using the CLI like this:
gcloud projects create $GCP_PROJECT
gcloud config set project $GCP_PROJECT
Learn more about creating and managing projects here.
Set Up Spanner
To start using Spanner, you need to provision an instance, and a database. Details about configuring and creating a Spanner Instance can be found here.
Create the Instance
gcloud spanner instances create $SPANNER_INSTANCE \
--config=regional-$GCP_REGION \
--description="Codelabs Snowflake RETL" \
--processing-units=100 \
--edition=ENTERPRISE
Create the Database
gcloud spanner databases create $SPANNER_DB \
--instance=$SPANNER_INSTANCE
3. Create a Google Cloud Storage bucket
Google Cloud Storage (GCS) will be used to store the Parquet data files and Iceberg metadata generated by Snowflake. To do so, a new bucket will first need to be created to use as the file destination. From a Terminal window on a local machine, follow these steps.
Create the bucket
Use the following command to create a storage bucket in a specific region (e.g. us-central1).
gcloud storage buckets create gs://$GCS_BUCKET_NAME --location=$GCP_REGION
Verify bucket creation
Once that command succeeds, check the result by listing all buckets. The new bucket should appear in the resulting list. Bucket references usually show up with the prefix gs:// in front of the bucket name.
gcloud storage ls | grep gs://$GCS_BUCKET_NAME
Test write permissions
This step ensures that the local environment is correctly authenticated and has the necessary permissions to write files to the newly created bucket.
echo "Hello, GCS" | gcloud storage cp - gs://$GCS_BUCKET_NAME/hello.txt
Verify the uploaded file
List the objects in the bucket. The full path of the file just uploaded should appear.
gcloud storage ls gs://$GCS_BUCKET_NAME
You should see the following output:
gs://$GCS_BUCKET_NAME/hello.txt
To view the contents of an object in a bucket, gcloud storage cat can be used.
gcloud storage cat gs://$GCS_BUCKET_NAME/hello.txt
The contents of the file should be visible:
Hello, GCS
Clean up the test file
The Cloud Storage bucket is now set up. The temporary test file can now be deleted.
gcloud storage rm gs://$GCS_BUCKET_NAME/hello.txt
The output should confirm the deletion:
Removing gs://$GCS_BUCKET_NAME/hello.txt... / [1 objects] Operation completed over 1 objects.
4. Exporting from Snowflake to GCS
For this lab, you will use the TPC-H dataset, which is an industry-standard benchmark for decision support systems. Its schema models a realistic business environment with customers, orders, suppliers, and parts, making it perfect for demonstrating a real-world analytics and data movement scenario. This dataset is available by default in all Snowflake accounts.
Instead of using the raw, normalized TPC-H tables, you will create a new, aggregated table. This new table will join data from the orders, customer, and nation tables to produce a denormalized, summarized view of national sales totals. This pre-aggregation step is a common practice in analytics, as it prepares the data for a specific use case—in this scenario, for consumption by an operational application.
Allow Snowflake to Access Google Cloud Storage
To allow Snowflake to write data to the GCS bucket, two things need to be created: an External Volume and the necessary permissions.
- An External Volume is a Snowflake object that provides a secure link to a specific location in a GCS bucket. It doesn't store data itself, but rather holds the configuration needed for Snowflake to access cloud storage.
- For security, cloud storage buckets are private by default. When an External Volume is created, Snowflake generates a dedicated Service Account. This service account must be granted permissions to read from and write to the bucket.
Create a database
- On the left side menu, under Horizon Catalog, hover over Catalog, then click on Database Explorer
- Once on the Databases page, click on the + Database button on the top right.
- Name the new db
codelabs_retl_db
Create a Worksheet
To run sql commands against the database, worksheets will be needed.
To create a worksheet:
- On the left side menu, under Work with data, hover over Projects, then click on Workspaces
- Under the My Workspaces side bar, click on the + Add new button and select SQL File
Create an External Volume
Run the following command in a Snowflake worksheet to create the volume.
CREATE EXTERNAL VOLUME codelabs_retl_ext_vol
STORAGE_LOCATIONS =
(
(
NAME = 'codelabs_retl_ext_vol'
STORAGE_PROVIDER = 'GCS'
STORAGE_BASE_URL = 'gcs://<Your bucket name>/snowflake_extvol'
)
);
Get the Snowflake Service Account
DESC (describe) the newly-created external volume to get the unique service account that Snowflake has generated for it.
DESC EXTERNAL VOLUME codelabs_retl_ext_vol;
- In the results pane, look for the json properties find the
property_valueentry that holds a JSON string starting with"NAME":"codelabs_retl_ext_vol" - Find the
STORAGE_GCP_SERVICE_ACCOUNTproperty within the json object and copy its value (it will look like an email address). This is the service account identifier that needs access to the GCS bucket. - Store this service account into an environment variable in your shell instance for reuse later
export GCP_SERVICE_ACCOUNT=<Your service account>
Grant GCS Permissions to Snowflake
Now, the Snowflake service account must be granted permission to write to the GCS bucket.
gcloud storage buckets add-iam-policy-binding gs://$GCS_BUCKET_NAME \
--member="serviceAccount:$GCP_SERVICE_ACCOUNT" \
--role="roles/storage.objectAdmin"
gcloud storage buckets add-iam-policy-binding gs://$GCS_BUCKET_NAME \
--member="serviceAccount:$GCP_SERVICE_ACCOUNT" \
--role="roles/storage.legacyBucketReader"
Verify Access in Snowflake
Back in the Snowflake worksheet, run this command to verify that Snowflake can now successfully connect to the GCS bucket.
SELECT SYSTEM$VERIFY_EXTERNAL_VOLUME('codelabs_retl_ext_vol');
The result should be a JSON object containing "success":true.
For more information about external volumes in Snowflake, refer to the official documentation.
Export Sample Order Data
Now you can create an Iceberg table in Snowflake. The following command tells Snowflake to run a query and store the results in GCS using the Iceberg format. The data files will be Parquet, and the metadata will be Avro and JSON, all stored in the location defined by the codelabs_retl_ext_vol External Volume.
Create a database
- On the left side menu, under Horizon Catalog, hover over Catalog, then click on Database Explorer
- Once on the Databases page, click on the + Database button on the top right.
- Name the new db
codelabs_retl_db
USE DATABASE codelabs_retl_db;
CREATE ICEBERG TABLE REGIONAL_SALES_ICEBERG (
NATION_NAME STRING,
MARKET_SEGMENT STRING,
ORDER_YEAR INTEGER,
ORDER_PRIORITY STRING,
TOTAL_ORDER_COUNT INTEGER,
TOTAL_REVENUE NUMBER(24,2),
UNIQUE_CUSTOMER_COUNT INTEGER
)
EXTERNAL_VOLUME = 'codelabs_retl_ext_vol'
CATALOG = 'SNOWFLAKE'
BASE_LOCATION = 'regional_sales_iceberg'
AS (
SELECT
n.n_name AS nation_name,
c.c_mktsegment AS market_segment,
YEAR(o.o_orderdate) AS order_year,
o.o_orderpriority AS order_priority,
COUNT(o.o_orderkey) AS total_order_count,
ROUND(SUM(o.o_totalprice), 2) AS total_revenue,
COUNT(DISTINCT c.c_custkey) AS unique_customer_count
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.orders AS o
INNER JOIN SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.customer AS c
ON o.o_custkey = c.c_custkey
INNER JOIN SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.nation AS n
ON c.c_nationkey = n.n_nationkey
GROUP BY
n.n_name,
c.c_mktsegment,
YEAR(o.o_orderdate),
o.o_orderpriority
);
For more information about creating and managing Iceberg tables using Snowflake, refer to the official documentation.
Verify Data in GCP
Now check the GCS bucket. The files that Snowflake created should be visible. This confirms that the export was successful. The Iceberg metadata will be found in the metadata folder and the actual data as Parquet files in the data folder.
gcloud storage ls "gs://$GCS_BUCKET_NAME/snowflake_extvol/**"
The exact file names will vary, but the structure should look like this:
gs://$GCS_BUCKET_NAME/snowflake_extvol/ gs://$GCS_BUCKET_NAME/snowflake_extvol/regional_sales_iceberg.snvrAWuR/data/snow_cbsKIRmdDmo_wLg128fugxg_0_2_009.parquet ... gs://$GCS_BUCKET_NAME/snowflake_extvol/regional_sales_iceberg.snvrAWuR/metadata/00001-62f831ff-6708-4494-94c5-c891b7ad447f.metadata.json ...
The data has now been copied out of Snowflake and into Google Cloud Storage in Iceberg format.
While we have this list, let's save the metadata.json file to an environment variable as we will need it later.
export GCS_METADATA_JSON=$(gcloud storage ls "gs://$GCS_BUCKET_NAME/snowflake_extvol/**" | grep .metadata.json)
5. Configure a BigQuery External Table
Now that the Iceberg table is in Google Cloud Storage, the next step is to make it accessible to BigQuery. This can be done by creating a BigLake external table.
BigLake is a storage engine that allows for creating tables in BigQuery that read data directly from external sources like Google Cloud Storage. For this lab, it's the key technology that enables BigQuery to understand the Iceberg table just exported without needing to ingest the data.
To make this work, two components are needed:
- A Cloud Resource Connection: This is a secure link between BigQuery and GCS. It uses a special service account to handle authentication, ensuring that BigQuery has the necessary permissions to read the files from the GCS bucket.
- An External Table Definition: This tells BigQuery where to find the Iceberg table's metadata file in GCS and how it should be interpreted.
Configure a Connection to Google Cloud Storage
First, the connection that allows BigQuery to access GCS will be created. This command creates a connection resource within BigQuery.
bq mk \
--connection \
--project_id=$GCP_PROJECT \
--location=$GCP_REGION \
--connection_type=CLOUD_RESOURCE \
codelabs-retl-connection
Success will look something like this:
Connection 12345678.region.codelabs-retl-connection successfully created
More information about Cloud Resource connections in BigQuery is available in the Google Cloud documentation.
Authorize the BigQuery connection to read data
The new BigQuery connection has its own service account that needs permission to read data from the Google Cloud Storage bucket.
1. Get the Connection Service Account
First, get the service account ID from the connection just created:
bq show \
--location $GCP_REGION \
--connection codelabs-retl-connection
The results will show a table of matching connections.
Let's set the serviceAccountId to an environment variable to be used later.
export GCP_BQ_SERVICE_ACCOUNT=<Your service account email>
2. Grant Permissions
Authorize the service account to view data in the GCS bucket by running the following command.
gcloud storage buckets add-iam-policy-binding \
gs://$GCS_BUCKET_NAME \
--member serviceAccount:$GCP_BQ_SERVICE_ACCOUNT \
--role roles/storage.objectViewer
Create the External Table
Now, create the BigLake external table in BigQuery. This command doesn't move any data. It simply creates a pointer to the existing data in GCS. The path to one of the .metadata.json files that Snowflake created will be needed.
bq mk --dataset --location=$GCP_REGION codelabs_retl
bq mk \
--table \
--location=$GCP_REGION \
--external_table_definition=ICEBERG=$GCS_METADATA_JSON@projects/$GCP_PROJECT/locations/$GCP_REGION/connections/codelabs-retl-connection \
codelabs_retl.regional_sales
Verify Data in BigQuery
This table can now be queried using standard SQL, just as with any other BigQuery table. BigQuery will use the connection to read the Parquet files from GCS on the fly.
bq query \
--location=$GCP_REGION \
--nouse_legacy_sql "SELECT * FROM \`$GCP_PROJECT.codelabs_retl.regional_sales\` LIMIT 10;"
6. Importing Data from BigQuery to Spanner: The Final Step
The final and most important part of the pipeline has been reached: moving the data from the BigLake table into Spanner. This is the "Reverse ETL" step, where the data, having been processed and curated in the data warehouse, is loaded into an operational system for use by applications.
Spanner is a fully managed, globally distributed relational database. It offers the transactional consistency of a traditional relational database but with the horizontal scalability of a NoSQL database. This makes it an ideal choice for building scalable, highly available applications.
The process will be:
- Create a table schema in the Spanner database that matches the structure of the data.
- Run a BigQuery
EXPORT DATAquery to load the data from the BigLake table directly into the Spanner table.
Create the Spanner Table
Before transferring data from BigQuery, a destination table must be created in Spanner with a compatible schema.
gcloud spanner databases ddl update $SPANNER_DB \
--instance=$SPANNER_INSTANCE \
--ddl="$(cat <<EOF
CREATE TABLE regional_sales (
nation_name STRING(MAX),
market_segment STRING(MAX),
order_year INT64,
order_priority STRING(MAX),
total_order_count INT64,
total_revenue NUMERIC,
unique_customer_count INT64
) PRIMARY KEY (nation_name, market_segment, order_year, order_priority);
EOF
)"
Export Data from BigQuery
This is the final step. With the source data ready in a BigQuery BigLake table and the destination table created in Spanner, the actual data movement is surprisingly simple. A single BigQuery SQL query will be used: EXPORT DATA.
This query is designed specifically for scenarios like this. It efficiently exports data from a BigQuery table (including external ones like the BigLake table) to an external destination. In this case, the destination is the Spanner table.
bq query --location=$GCP_REGION --use_legacy_sql=false <<EOF
EXPORT DATA OPTIONS (
uri="https://spanner.googleapis.com/projects/${GCP_PROJECT}/instances/${SPANNER_INSTANCE}/databases/${SPANNER_DB}",
format='CLOUD_SPANNER',
spanner_options="""{
"table": "regional_sales",
"priority": "HIGH"
}"""
) AS
SELECT * FROM \`${PROJECT_ID}.codelabs_retl.regional_sales\`
EOF
When the query finishes, the Results pane should say "Update completed".
7. Verify data in Spanner
Congratulations! A complete Reverse ETL pipeline has been successfully built and executed. The final step is to verify that the data has arrived in Spanner as expected.
gcloud spanner databases execute-sql \
--instance=$SPANNER_INSTANCE \
$SPANNER_DB \
--sql='SELECT * FROM regional_sales LIMIT 10'
The imported sample data appears as requested:
nation_name market_segment order_year order_priority total_order_count total_revenue unique_customer_count ALGERIA AUTOMOBILE 1992 1-URGENT 375 59232423.66 298 ALGERIA AUTOMOBILE 1992 2-HIGH 328 47371891.08 269 ALGERIA AUTOMOBILE 1992 3-MEDIUM 346 52823195.87 262 ALGERIA AUTOMOBILE 1992 4-NOT SPECIFIED 365 52935998.34 288 ALGERIA AUTOMOBILE 1992 5-LOW 380 54920263.68 293 ALGERIA AUTOMOBILE 1993 1-URGENT 394 63145618.78 312 ALGERIA AUTOMOBILE 1993 2-HIGH 340 50737488.4 277 ALGERIA AUTOMOBILE 1993 3-MEDIUM 383 55871057.46 298 ALGERIA AUTOMOBILE 1993 4-NOT SPECIFIED 365 56424662.05 291 ALGERIA AUTOMOBILE 1993 5-LOW 363 54673249.06 283
The gap between the analytical and operational data worlds has been successfully bridged.
8. Clean-up
Clean up Spanner
Delete the Spanner Database, and Instance
gcloud spanner instances delete $SPANNER_INSTANCE
Clean up GCS
Delete the GCS Bucket created to host the data
gcloud storage rm --recursive gs://$GCS_BUCKET_NAME
Clean up BigQuery
bq rm -r codelabs_retl
bq rm --connection --location=$GCP_REGION codelabs-retl-connection
Clean up Snowflake
Drop the database
- On the left side menu, under Horizon Catalog, hover over Catalog, then click on Database Explorer
- Click on the ... to the right of the
CODELABS_RETL_DBdatabase to expand the options and select Drop - In the confirmation dialog that pops up, select Drop Database
Delete workbooks
- On the left side menu, under Work with data, hover over Projects, then click Workspaces
- In the My Workspace side bar, hover over the different workspace files you used for this lab to show the ... additional options and click on it
- Select Delete, and then Delete again in the confirmation dialog that pops up.
- Do this for all sql workspace files you created for this lab.
Delete external volumes
- On the left side menu, under Horizon Catalog, hover over Catalog, then click on External Data
- Click the
to the right of CODELABS_RETL_EXT_VOL, and select Drop external volume, and then again on Drop external volume on the confirmation dialog
9. Congratulations
Congratulations for completing the codelab.
What we've covered
- How to load data into Snowflake
- How to create a GCS Bucket
- How to export a Snowflake table to GCS in the CSV format
- How to setup up a Spanner instance
- How to load CSV Tables to Spanner with Dataflow