Reverse ETL from Snowflake to Spanner using BQ

1. Build a Reverse ETL Pipeline from Snowflake to Spanner using Google Cloud Storage and BigQuery

Introduction

In this codelab, a Reverse ETL pipeline is built from Snowflake to Spanner. Traditionally, ETL (Extract, Transform, Load) pipelines move data from operational databases into a data warehouse like Snowflake for analytics. A Reverse ETL pipeline does the opposite: it moves curated, processed data from the data warehouse back into operational systems where it can power applications, serve user-facing features, or be used for real-time decision-making.

The goal is to move an aggregated dataset from a Snowflake Iceberg table into Spanner, a globally distributed relational database ideal for high-availability applications.

To achieve this, Google Cloud Storage (GCS) and BigQuery are used as intermediate steps. Here's a breakdown of the data flow and the reasoning behind this architecture:

  1. Snowflake to Google Cloud Storage (GCS) in Iceberg Format:
  • The first step is to get the data out of Snowflake in an open, well-defined format. The table is exported in the Apache Iceberg format. This process writes the underlying data as a set of Parquet files and the table's metadata (schema, partitions, file locations) as JSON and Avro files. Staging this complete table structure in GCS makes the data portable and accessible to any system that understands the Iceberg format.
  1. Convert Iceberg tables in GCS to BigQuery BigLake external table:
  • Instead of loading the data directly from GCS into Spanner, BigQuery is used as a powerful intermediary. You will create a BigLake external table in BigQuery that points directly to the Iceberg metadata file in GCS. This approach has several advantages:
  • No Data Duplication: BigQuery reads the table structure from the metadata and queries the Parquet data files in place without ingesting them, which saves significant time and storage costs.
  • Federated Queries: It allows for running complex SQL queries on GCS data as if it were a native BigQuery table.
  1. BigQuery to Spanner:
  • The final step is to move the data from BigQuery into Spanner. You will achieve this using a powerful feature in BigQuery called an EXPORT DATA query, which is the "Reverse ETL" step.
  • Operational Readiness: Spanner is designed for transactional workloads, providing strong consistency and high availability for applications. By moving the data into Spanner, it is made accessible to user-facing applications, APIs, and other operational systems that require low-latency point lookups.
  • Scalability: This pattern allows for leveraging BigQuery's analytical power to process large datasets and then serve the results efficiently through Spanner's globally scalable infrastructure.

Services and Terminology

  • Snowflake - A cloud data platform that provides a data warehouse-as-a-service.
  • Spanner - A fully managed, globally distributed relational database.
  • Google Cloud Storage - Google Cloud's blob storage offering.
  • BigQuery - A fully managed, serverless data warehouse for analytics.
  • Iceberg - An open table format defined by Apache providing abstraction over common open-source data file formats.
  • Parquet - An open-source columnar binary data file format by Apache.

What you'll learn

  • How to load data into Snowflake
  • How to create a GCS Bucket
  • How to export a Snowflake table to GCS in the Iceberg format
  • How to setup up a Spanner instance
  • How to load BigLake External Tables in BigQuery to Spanner

2. Setup, Requirements & Limitations

Prerequisites

  • A Snowflake account
  • A Google Cloud account with a BigQuery Enterprise-tier or higher reservation is required to export from BigQuery to Spanner.
  • Access to the Google Cloud Console through a web browser
  • A terminal to run Google Cloud CLI commands
  • If your Google Cloud organization has the iam.allowedPolicyMemberDomains policy enabled, an administrator may need to grant an exception to allow service accounts from external domains. This will be covered in a later step where applicable.

Limitations

It is important to be aware of certain limitations and data type incompatibilities that can arise in this pipeline.

Snowflake to Iceberg

Column data types differ between Snowflake and Iceberg. Information about translating between them is available in the Snowflake documentation.

Iceberg to BigQuery

When using BigQuery to query Iceberg tables, there are some limitations. For a full list, see the BigQuery documentation. Note that types such as BIGNUMERIC, INTERVAL, JSON, RANGE, or GEOGRAPHY are not currently supported.

BigQuery to Spanner

The EXPORT DATA command from BigQuery to Spanner does not support all BigQuery data types. Exporting a table with the following types will result in an error:

  • STRUCT
  • GEOGRAPHY
  • DATETIME
  • RANGE
  • TIME

Additionally, if the BigQuery project is using the GoogleSQL dialect, the following numeric types are also not supported for export to Spanner:

  • BIGNUMERIC

For a complete and up-to-date list of limitations, refer to the official documentation: Exporting to Spanner Limitations.

Snowflake

For this codelab, you can use an existing Snowflake account, or set up a free trial account.

Google Cloud Platform IAM Permissions

The Google account will need the following permissions to execute all the steps in this codelab.

Service Accounts

iam.serviceAccountKeys.create

Allows the creation of Service Accounts.

Spanner

spanner.instances.create

Allows creating a new Spanner instance.

spanner.databases.create

Allows running DDL statements to create

spanner.databases.updateDdl

Allows running DDL statements to create tables in the database.

Google Cloud Storage

storage.buckets.create

Allows creating a new GCS bucket to store the exported Parquet files.

storage.objects.create

Allows writing the exported Parquet files to the GCS bucket.

storage.objects.get

Allows BigQuery to read the Parquet files from the GCS bucket.

storage.objects.list

Allows BigQuery to list the Parquet files in the GCS bucket.

Dataflow

Dataflow.workitems.lease

Allows claiming of work items from Dataflow.

Dataflow.workitems.sendMessage

Allows the Dataflow worker to send messages back to the Dataflow service.

Logging.logEntries.create

Allows Dataflow workers to write log entries to Google Cloud Logging.

For convenience, predefined roles that contain these permissions can be used.

roles/resourcemanager.projectIamAdmin

roles/iam.serviceAccountKeyAdmin

roles/spanner.instanceAdmin

roles/spanner.databaseAdmin

roles/storage.admin

roles/dataflow.serviceAgent

roles/dataflow.worker

roles/dataflow.serviceAgent

Setup Reusable Properties

There will be a few values that will be repeatedly needed throughout this lab. To make this easier, we'll set these values to shell variables to be used later.

  • GCP_REGION - The specific region that the GCP resources will be located. The list of regions can be found here.
  • GCP_PROJECT - The GCP Project id to use.
  • GCP_BUCKET_NAME - The GCS Bucket name to be created, and where the data files will be stored.
export GCP_REGION = <GCP REGION HERE> 
export GCP_PROJECT= <GCP PROJECT HERE>
export GCS_BUCKET_NAME = <GCS BUCKET NAME HERE>
export SPANNER_INSTANCE = <SPANNER INSTANCE ID HERE>
export SPANNER_DB = <SPANNER DATABASE ID HERE>

Google Cloud Project

A Project is a basic unit of organization in Google Cloud. If an administrator has provided one for use, this step may be skipped.

A project can be created using the CLI like this:

gcloud projects create $GCP_PROJECT
gcloud config set project $GCP_PROJECT

Learn more about creating and managing projects here.

Set Up Spanner

To start using Spanner, you need to provision an instance, and a database. Details about configuring and creating a Spanner Instance can be found here.

Create the Instance

gcloud spanner instances create $SPANNER_INSTANCE \
--config=regional-$GCP_REGION \
--description="Codelabs Snowflake RETL" \
--processing-units=100 \
--edition=ENTERPRISE

Create the Database

gcloud spanner databases create $SPANNER_DB \
--instance=$SPANNER_INSTANCE

3. Create a Google Cloud Storage bucket

Google Cloud Storage (GCS) will be used to store the Parquet data files and Iceberg metadata generated by Snowflake. To do so, a new bucket will first need to be created to use as the file destination. From a Terminal window on a local machine, follow these steps.

Create the bucket

Use the following command to create a storage bucket in a specific region (e.g. us-central1).

gcloud storage buckets create gs://$GCS_BUCKET_NAME --location=$GCP_REGION

Verify bucket creation

Once that command succeeds, check the result by listing all buckets. The new bucket should appear in the resulting list. Bucket references usually show up with the prefix gs:// in front of the bucket name.

gcloud storage ls | grep gs://$GCS_BUCKET_NAME

Test write permissions

This step ensures that the local environment is correctly authenticated and has the necessary permissions to write files to the newly created bucket.

echo "Hello, GCS" | gcloud storage cp - gs://$GCS_BUCKET_NAME/hello.txt

Verify the uploaded file

List the objects in the bucket. The full path of the file just uploaded should appear.

gcloud storage ls gs://$GCS_BUCKET_NAME

You should see the following output:

gs://$GCS_BUCKET_NAME/hello.txt

To view the contents of an object in a bucket, gcloud storage cat can be used.

gcloud storage cat gs://$GCS_BUCKET_NAME/hello.txt

The contents of the file should be visible:

Hello, GCS

Clean up the test file

The Cloud Storage bucket is now set up. The temporary test file can now be deleted.

gcloud storage rm gs://$GCS_BUCKET_NAME/hello.txt

The output should confirm the deletion:

Removing gs://$GCS_BUCKET_NAME/hello.txt...
/ [1 objects]
Operation completed over 1 objects.

4. Exporting from Snowflake to GCS

For this lab, you will use the TPC-H dataset, which is an industry-standard benchmark for decision support systems. Its schema models a realistic business environment with customers, orders, suppliers, and parts, making it perfect for demonstrating a real-world analytics and data movement scenario. This dataset is available by default in all Snowflake accounts.

Instead of using the raw, normalized TPC-H tables, you will create a new, aggregated table. This new table will join data from the orders, customer, and nation tables to produce a denormalized, summarized view of national sales totals. This pre-aggregation step is a common practice in analytics, as it prepares the data for a specific use case—in this scenario, for consumption by an operational application.

Allow Snowflake to Access Google Cloud Storage

To allow Snowflake to write data to the GCS bucket, two things need to be created: an External Volume and the necessary permissions.

  • An External Volume is a Snowflake object that provides a secure link to a specific location in a GCS bucket. It doesn't store data itself, but rather holds the configuration needed for Snowflake to access cloud storage.
  • For security, cloud storage buckets are private by default. When an External Volume is created, Snowflake generates a dedicated Service Account. This service account must be granted permissions to read from and write to the bucket.

Create a database

  1. On the left side menu, under Horizon Catalog, hover over Catalog, then click on Database Explorer
  2. Once on the Databases page, click on the + Database button on the top right.
  3. Name the new db codelabs_retl_db

Create a Worksheet

To run sql commands against the database, worksheets will be needed.

To create a worksheet:

  1. On the left side menu, under Work with data, hover over Projects, then click on Workspaces
  2. Under the My Workspaces side bar, click on the + Add new button and select SQL File

Create an External Volume

Run the following command in a Snowflake worksheet to create the volume.

CREATE EXTERNAL VOLUME codelabs_retl_ext_vol
STORAGE_LOCATIONS = 
(
  (
    NAME = 'codelabs_retl_ext_vol'
    STORAGE_PROVIDER = 'GCS'
    STORAGE_BASE_URL = 'gcs://<Your bucket name>/snowflake_extvol'
  )
); 

Get the Snowflake Service Account

DESC (describe) the newly-created external volume to get the unique service account that Snowflake has generated for it.

DESC EXTERNAL VOLUME codelabs_retl_ext_vol;
  1. In the results pane, look for the json properties find the property_value entry that holds a JSON string starting with "NAME":"codelabs_retl_ext_vol"
  2. Find the STORAGE_GCP_SERVICE_ACCOUNT property within the json object and copy its value (it will look like an email address). This is the service account identifier that needs access to the GCS bucket.
  3. Store this service account into an environment variable in your shell instance for reuse later
export GCP_SERVICE_ACCOUNT=<Your service account>

Grant GCS Permissions to Snowflake

Now, the Snowflake service account must be granted permission to write to the GCS bucket.

gcloud storage buckets add-iam-policy-binding gs://$GCS_BUCKET_NAME \
    --member="serviceAccount:$GCP_SERVICE_ACCOUNT" \
    --role="roles/storage.objectAdmin"

gcloud storage buckets add-iam-policy-binding gs://$GCS_BUCKET_NAME \
    --member="serviceAccount:$GCP_SERVICE_ACCOUNT" \
    --role="roles/storage.legacyBucketReader"

Verify Access in Snowflake

Back in the Snowflake worksheet, run this command to verify that Snowflake can now successfully connect to the GCS bucket.

SELECT SYSTEM$VERIFY_EXTERNAL_VOLUME('codelabs_retl_ext_vol');

The result should be a JSON object containing "success":true.

For more information about external volumes in Snowflake, refer to the official documentation.

Export Sample Order Data

Now you can create an Iceberg table in Snowflake. The following command tells Snowflake to run a query and store the results in GCS using the Iceberg format. The data files will be Parquet, and the metadata will be Avro and JSON, all stored in the location defined by the codelabs_retl_ext_vol External Volume.

Create a database

  1. On the left side menu, under Horizon Catalog, hover over Catalog, then click on Database Explorer
  2. Once on the Databases page, click on the + Database button on the top right.
  3. Name the new db codelabs_retl_db
USE DATABASE codelabs_retl_db; 

CREATE ICEBERG TABLE REGIONAL_SALES_ICEBERG (
    NATION_NAME STRING,
    MARKET_SEGMENT STRING,
    ORDER_YEAR INTEGER,
    ORDER_PRIORITY STRING,
    TOTAL_ORDER_COUNT INTEGER,
    TOTAL_REVENUE NUMBER(24,2),
    UNIQUE_CUSTOMER_COUNT INTEGER
) 
EXTERNAL_VOLUME = 'codelabs_retl_ext_vol'
CATALOG = 'SNOWFLAKE'
BASE_LOCATION = 'regional_sales_iceberg'
AS (
    SELECT 
        n.n_name AS nation_name,
        c.c_mktsegment AS market_segment,
        YEAR(o.o_orderdate) AS order_year,
        o.o_orderpriority AS order_priority,
        COUNT(o.o_orderkey) AS total_order_count,
        ROUND(SUM(o.o_totalprice), 2) AS total_revenue,
        COUNT(DISTINCT c.c_custkey) AS unique_customer_count
    FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.orders AS o
    INNER JOIN SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.customer AS c 
        ON o.o_custkey = c.c_custkey
    INNER JOIN SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.nation AS n
        ON c.c_nationkey = n.n_nationkey
    GROUP BY 
        n.n_name, 
        c.c_mktsegment, 
        YEAR(o.o_orderdate), 
        o.o_orderpriority
);

For more information about creating and managing Iceberg tables using Snowflake, refer to the official documentation.

Verify Data in GCP

Now check the GCS bucket. The files that Snowflake created should be visible. This confirms that the export was successful. The Iceberg metadata will be found in the metadata folder and the actual data as Parquet files in the data folder.

gcloud storage ls "gs://$GCS_BUCKET_NAME/snowflake_extvol/**"

The exact file names will vary, but the structure should look like this:

gs://$GCS_BUCKET_NAME/snowflake_extvol/
gs://$GCS_BUCKET_NAME/snowflake_extvol/regional_sales_iceberg.snvrAWuR/data/snow_cbsKIRmdDmo_wLg128fugxg_0_2_009.parquet
...
gs://$GCS_BUCKET_NAME/snowflake_extvol/regional_sales_iceberg.snvrAWuR/metadata/00001-62f831ff-6708-4494-94c5-c891b7ad447f.metadata.json
...

The data has now been copied out of Snowflake and into Google Cloud Storage in Iceberg format.

While we have this list, let's save the metadata.json file to an environment variable as we will need it later.

export GCS_METADATA_JSON=$(gcloud storage ls "gs://$GCS_BUCKET_NAME/snowflake_extvol/**" | grep .metadata.json)

5. Configure a BigQuery External Table

Now that the Iceberg table is in Google Cloud Storage, the next step is to make it accessible to BigQuery. This can be done by creating a BigLake external table.

BigLake is a storage engine that allows for creating tables in BigQuery that read data directly from external sources like Google Cloud Storage. For this lab, it's the key technology that enables BigQuery to understand the Iceberg table just exported without needing to ingest the data.

To make this work, two components are needed:

  1. A Cloud Resource Connection: This is a secure link between BigQuery and GCS. It uses a special service account to handle authentication, ensuring that BigQuery has the necessary permissions to read the files from the GCS bucket.
  2. An External Table Definition: This tells BigQuery where to find the Iceberg table's metadata file in GCS and how it should be interpreted.

Configure a Connection to Google Cloud Storage

First, the connection that allows BigQuery to access GCS will be created. This command creates a connection resource within BigQuery.

bq mk \
  --connection \
  --project_id=$GCP_PROJECT \
  --location=$GCP_REGION \
  --connection_type=CLOUD_RESOURCE \
  codelabs-retl-connection

Success will look something like this:

Connection 12345678.region.codelabs-retl-connection successfully created

More information about Cloud Resource connections in BigQuery is available in the Google Cloud documentation.

Authorize the BigQuery connection to read data

The new BigQuery connection has its own service account that needs permission to read data from the Google Cloud Storage bucket.

1. Get the Connection Service Account

First, get the service account ID from the connection just created:

bq show \
  --location $GCP_REGION \
  --connection codelabs-retl-connection

The results will show a table of matching connections.

Let's set the serviceAccountId to an environment variable to be used later.

export GCP_BQ_SERVICE_ACCOUNT=<Your service account email>

2. Grant Permissions

Authorize the service account to view data in the GCS bucket by running the following command.

gcloud storage buckets add-iam-policy-binding \
  gs://$GCS_BUCKET_NAME \
  --member serviceAccount:$GCP_BQ_SERVICE_ACCOUNT \
  --role roles/storage.objectViewer

Create the External Table

Now, create the BigLake external table in BigQuery. This command doesn't move any data. It simply creates a pointer to the existing data in GCS. The path to one of the .metadata.json files that Snowflake created will be needed.

bq mk --dataset --location=$GCP_REGION codelabs_retl

bq mk \
    --table \
    --location=$GCP_REGION \
--external_table_definition=ICEBERG=$GCS_METADATA_JSON@projects/$GCP_PROJECT/locations/$GCP_REGION/connections/codelabs-retl-connection \
    codelabs_retl.regional_sales

Verify Data in BigQuery

This table can now be queried using standard SQL, just as with any other BigQuery table. BigQuery will use the connection to read the Parquet files from GCS on the fly.

bq query \
  --location=$GCP_REGION \
  --nouse_legacy_sql "SELECT * FROM \`$GCP_PROJECT.codelabs_retl.regional_sales\` LIMIT 10;"

6. Importing Data from BigQuery to Spanner: The Final Step

The final and most important part of the pipeline has been reached: moving the data from the BigLake table into Spanner. This is the "Reverse ETL" step, where the data, having been processed and curated in the data warehouse, is loaded into an operational system for use by applications.

Spanner is a fully managed, globally distributed relational database. It offers the transactional consistency of a traditional relational database but with the horizontal scalability of a NoSQL database. This makes it an ideal choice for building scalable, highly available applications.

The process will be:

  1. Create a table schema in the Spanner database that matches the structure of the data.
  2. Run a BigQuery EXPORT DATA query to load the data from the BigLake table directly into the Spanner table.

Create the Spanner Table

Before transferring data from BigQuery, a destination table must be created in Spanner with a compatible schema.

gcloud spanner databases ddl update $SPANNER_DB \
  --instance=$SPANNER_INSTANCE \
  --ddl="$(cat <<EOF
CREATE TABLE regional_sales (
    nation_name STRING(MAX),
    market_segment STRING(MAX),
    order_year INT64,
    order_priority STRING(MAX),
    total_order_count INT64,
    total_revenue NUMERIC,
    unique_customer_count INT64
) PRIMARY KEY (nation_name, market_segment, order_year, order_priority);
EOF
)"

Export Data from BigQuery

This is the final step. With the source data ready in a BigQuery BigLake table and the destination table created in Spanner, the actual data movement is surprisingly simple. A single BigQuery SQL query will be used: EXPORT DATA.

This query is designed specifically for scenarios like this. It efficiently exports data from a BigQuery table (including external ones like the BigLake table) to an external destination. In this case, the destination is the Spanner table.

bq query --location=$GCP_REGION --use_legacy_sql=false <<EOF
EXPORT DATA OPTIONS (
uri="https://spanner.googleapis.com/projects/${GCP_PROJECT}/instances/${SPANNER_INSTANCE}/databases/${SPANNER_DB}",
  format='CLOUD_SPANNER',
  spanner_options="""{ 
      "table": "regional_sales", 
      "priority": "HIGH" 
      }"""
) AS 
SELECT * FROM \`${PROJECT_ID}.codelabs_retl.regional_sales\`
EOF

When the query finishes, the Results pane should say "Update completed".

7. Verify data in Spanner

Congratulations! A complete Reverse ETL pipeline has been successfully built and executed. The final step is to verify that the data has arrived in Spanner as expected.

gcloud spanner databases execute-sql \
  --instance=$SPANNER_INSTANCE \
  $SPANNER_DB \
  --sql='SELECT * FROM regional_sales LIMIT 10'

The imported sample data appears as requested:

nation_name  market_segment  order_year  order_priority   total_order_count  total_revenue  unique_customer_count
ALGERIA      AUTOMOBILE      1992        1-URGENT         375                59232423.66    298
ALGERIA      AUTOMOBILE      1992        2-HIGH           328                47371891.08    269
ALGERIA      AUTOMOBILE      1992        3-MEDIUM         346                52823195.87    262
ALGERIA      AUTOMOBILE      1992        4-NOT SPECIFIED  365                52935998.34    288
ALGERIA      AUTOMOBILE      1992        5-LOW            380                54920263.68    293
ALGERIA      AUTOMOBILE      1993        1-URGENT         394                63145618.78    312
ALGERIA      AUTOMOBILE      1993        2-HIGH           340                50737488.4     277
ALGERIA      AUTOMOBILE      1993        3-MEDIUM         383                55871057.46    298
ALGERIA      AUTOMOBILE      1993        4-NOT SPECIFIED  365                56424662.05    291
ALGERIA      AUTOMOBILE      1993        5-LOW            363                54673249.06    283

The gap between the analytical and operational data worlds has been successfully bridged.

8. Clean-up

Clean up Spanner

Delete the Spanner Database, and Instance

gcloud spanner instances delete $SPANNER_INSTANCE

Clean up GCS

Delete the GCS Bucket created to host the data

gcloud storage rm --recursive gs://$GCS_BUCKET_NAME

Clean up BigQuery

bq rm -r codelabs_retl
bq rm --connection --location=$GCP_REGION codelabs-retl-connection

Clean up Snowflake

Drop the database

  1. On the left side menu, under Horizon Catalog, hover over Catalog, then click on Database Explorer
  2. Click on the ... to the right of the CODELABS_RETL_DB database to expand the options and select Drop
  3. In the confirmation dialog that pops up, select Drop Database

Delete workbooks

  1. On the left side menu, under Work with data, hover over Projects, then click Workspaces
  2. In the My Workspace side bar, hover over the different workspace files you used for this lab to show the ... additional options and click on it
  3. Select Delete, and then Delete again in the confirmation dialog that pops up.
  4. Do this for all sql workspace files you created for this lab.

Delete external volumes

  1. On the left side menu, under Horizon Catalog, hover over Catalog, then click on External Data
  2. Click the 227b3e306c3d609d.pngto the right of CODELABS_RETL_EXT_VOL, and select Drop external volume, and then again on Drop external volume on the confirmation dialog

9. Congratulations

Congratulations for completing the codelab.

What we've covered

  • How to load data into Snowflake
  • How to create a GCS Bucket
  • How to export a Snowflake table to GCS in the CSV format
  • How to setup up a Spanner instance
  • How to load CSV Tables to Spanner with Dataflow