1. Build a reverse ETL pipeline from Databricks to Spanner using GCS and Dataflow
Introduction
In this codelab, you will build a Reverse ETL pipeline from Databricks to Spanner using CSV files stored in Google Cloud Storage. Traditionally, ETL (Extract, Transform, Load) pipelines move data from operational databases into a data warehouse like Databricks for analytics. A Reverse ETL pipeline does the opposite: it moves curated, processed data from the data warehouse back into operational systems where it can power applications, serve user-facing features, or be used for real-time decision-making.
The goal is to move a sample dataset from a Databricks table into Spanner, a globally distributed relational database ideal for high-availability applications.
To achieve this, Google Cloud Storage (GCS) and Dataflow are used as intermediate steps. Here's a breakdown of the data flow and the reasoning behind this architecture:
- Databricks to Google Cloud Storage (GCS) in CSV Format:
- The first step is to get the data out of Databricks in an open, universal format. Exporting to CSV is a common and straightforward method for creating portable data files. These files will be staged in GCS, which provides a scalable and durable object storage solution.
- GCS to Spanner (via Dataflow):
- Instead of writing a custom script to read from GCS and write to Spanner, Google Dataflow is used, a fully managed data processing service. Dataflow provides pre-built templates specifically for this kind of task. Using the "GCS Text to Cloud Spanner" template allows for a high-throughput, parallelized data import without writing any data processing code, saving significant development time.
What you'll learn
- How to load data into Databricks
- How to create a GCS Bucket
- How to export a Databricks table to GCS in the CSV format
- How to setup up a Spanner instance
- How to load CSV Tables to Spanner with Dataflow
2. Setup, Requirements & Limitations
Prerequisites
- A Databricks account with permissions to create clusters and install libraries. A free trial account is not sufficient for this lab.
- A Google Cloud account with Spanner, Cloud Storage, and Dataflow APIs enabled.
- Access to the Google Cloud Console through a web browser.
- A terminal with the Google Cloud CLI installed.
- If your Google Cloud organization has the
iam.allowedPolicyMemberDomainspolicy enabled, an administrator may need to grant an exception to allow service accounts from external domains. This will be covered in a later step where applicable.
Google Cloud Platform IAM Permissions
The Google account will need the following permissions to execute all the steps in this codelab.
Service Accounts | ||
| Allows the creation of Service Accounts. | |
Spanner | ||
| Allows creating a new Spanner instance. | |
| Allows running DDL statements to create | |
| Allows running DDL statements to create tables in the database. | |
Google Cloud Storage | ||
| Allows creating a new GCS bucket to store the exported Parquet files. | |
| Allows writing the exported Parquet files to the GCS bucket. | |
| Allows BigQuery to read the Parquet files from the GCS bucket. | |
| Allows BigQuery to list the Parquet files in the GCS bucket. | |
Dataflow | ||
| Allows claiming of work items from Dataflow. | |
| Allows the Dataflow worker to send messages back to the Dataflow service. | |
| Allows Dataflow workers to write log entries to Google Cloud Logging. | |
For convenience, predefined roles that contain these permissions can be used.
|
|
|
|
|
|
|
|
Limitations
It is important to be aware of data type differences when moving data between systems.
- Databricks to CSV: When exporting, Databricks data types are converted to standard text representations.
- CSV to Spanner: When importing, it is necessary to ensure that the target Spanner data types are compatible with the string representations in the CSV file. This lab guides through a common set of type mappings.
Setup Reusable Properties
There will be a few values that will be repeatedly needed throughout this lab. To make this easier, we'll set these values to shell variables to be used later.
- GCP_REGION - The specific region that the GCP resources will be located. The list of regions can be found here.
- GCP_PROJECT - The GCP Project id to use.
- GCP_BUCKET_NAME - The GCS Bucket name to be created, and where the data files will be stored.
export GCP_REGION = <GCP REGION HERE>
export GCP_PROJECT= <GCP PROJECT HERE>
export GCS_BUCKET_NAME = <GCS BUCKET NAME HERE>
export SPANNER_INSTANCE = <SPANNER INSTANCE ID HERE>
export SPANNER_DB = <SPANNER DATABASE ID HERE>
Databricks
For this lab, a Databricks account hosted on GCP to allow for defining an external data location in GCS.
Google Cloud
This lab requires a Google Cloud project.
Google Cloud Project
A Project is a basic unit of organization in Google Cloud. If an administrator has provided one for use, this step may be skipped.
A project can be created using the CLI like this:
gcloud projects create $GCP_PROJECT
gcloud config set project $GCP_PROJECT
Learn more about creating and managing projects here.
Set Up Spanner
To start using Spanner, you need to provision an instance, and a database. Details about configuring and creating a Spanner Instance can be found here.
Create the Instance
gcloud spanner instances create $SPANNER_INSTANCE \
--config=regional-$GCP_REGION \
--description="Codelabs Snowflake RETL" \
--processing-units=100 \
--edition=ENTERPRISE
Create the Database
gcloud spanner databases create $SPANNER_DB \
--instance=$SPANNER_INSTANCE
3. Create a Google Cloud Storage Bucket
Google Cloud Storage (GCS) will be used to temporarily store the CSV data files generated by Snowflake before they are imported into Spanner.
Create the bucket
Use the following command to create a storage bucket in a specific region.
gcloud storage buckets create gs://$GCS_BUCKET_NAME --location=$GCP_REGION
Verify bucket creation
Once that command succeeds, check the result by listing all buckets. The new bucket should appear in the resulting list. Bucket references usually show up with the prefix gs:// in front of the bucket name.
gcloud storage ls | grep gs://$GCS_BUCKET_NAME
Test write permissions
This step ensures that the local environment is correctly authenticated and has the necessary permissions to write files to the newly created bucket.
echo "Hello, GCS" | gcloud storage cp - gs://$GCS_BUCKET_NAME/hello.txt
Verify the uploaded file
List the objects in the bucket. The full path of the file just uploaded should appear.
gcloud storage ls gs://$GCS_BUCKET_NAME
You should see the following output:
gs://$GCS_BUCKET_NAME/hello.txt
To view the contents of an object in a bucket, gcloud storage cat can be used.
gcloud storage cat gs://$GCS_BUCKET_NAME/hello.txt
The contents of the file should be visible:
Hello, GCS
Clean up the test file
The Cloud Storage bucket is now set up. The temporary test file can now be deleted.
gcloud storage rm gs://$GCS_BUCKET_NAME/hello.txt
The output should confirm the deletion:
Removing gs://$GCS_BUCKET_NAME/hello.txt... / [1 objects] Operation completed over 1 objects.
4. Export from Databricks to GCS
Now, the Databricks environment will be configured to securely connect to GCS and export data.
Create Credentials
- On the left side menu, click on Catalog
- Click on External Data if it is available at the top of the catalog page. Otherwise, click on the Connect dropdown and then click on Credentials
- Switch to the Credentials tab if you are not already on it.
- Click on Create Credentials
- Select
GCP Service Accountfor Credential Type - Enter
codelabs-retl-credentialsfor Credential Name - Click Create
- Copy the service account email from the dialog box and click Done
Set this service account to an environment variable in your shell instance for re-use:
export GCP_SERVICE_ACCOUNT=<Your service account>
Grant GCS Permissions to Databricks
Now, the Snowflake service account must be granted permission to write to the GCS bucket.
gcloud storage buckets add-iam-policy-binding gs://$GCS_BUCKET_NAME \
--member="serviceAccount:$GCP_SERVICE_ACCOUNT" \
--role="roles/storage.objectAdmin"
gcloud storage buckets add-iam-policy-binding gs://$GCS_BUCKET_NAME \
--member="serviceAccount:$GCP_SERVICE_ACCOUNT" \
--role="roles/storage.legacyBucketReader"
Create External Location
- Navigate back to the Credentials page using the bread crumbs at the top of the page
- Switch to the External Location tab
- Click on Create external location
- Set External Location Name to
codelabs-retl-gcs - Keep Storage Type as
GCP - Set your bucket path to the URL
- Set Storage Credential to
codelabs-retl-credentials - Click Create
- On the confirmation. Click Create
Create Catalog and Schema
- On the left side menu, click on Catalog
- Click Create, then Create a catalog
- Set Catalog Name to
retl_tpch_project - Set Type to
Standard - Select
codelabs-retl-gcsas external location - Click Create
- Click on
retl_tpch_projectfrom the Catalog list - Click on Create schema
- Set Schema name to
tpch_data - Select Storage location to be
codelabs-retl-gcs - Click Create
Export Data as CSV
Now the data is ready for export. The sample TPC-H dataset will be used to define our new table that will be externally stored as CSV.
First, copy the sample data into a new table in the workspace. To do so, SQL code will need to be run from a query.
- On the left side menu under SQL, click on Queries
- Click on the Create query button
- Next to the Run button, set the Workspace to
retl_tpch_project
CREATE TABLE retl_tpch_project.tpch_data.regional_sales_csv
USING CSV
LOCATION 'gs://<Your bucket name>/regional_sales_csv'
OPTIONS (
header "false",
delimiter ","
)
AS
SELECT
n.n_name AS nation_name,
c.c_mktsegment AS market_segment,
YEAR(o.o_orderdate) AS order_year,
o.o_orderpriority AS order_priority,
COUNT(o.o_orderkey) AS total_order_count,
ROUND(SUM(o.o_totalprice), 2) AS total_revenue,
COUNT(DISTINCT c.c_custkey) AS unique_customer_count
FROM samples.tpch.orders AS o
INNER JOIN samples.tpch.customer AS c
ON o.o_custkey = c.c_custkey
INNER JOIN samples.tpch.nation AS n
ON c.c_nationkey = n.n_nationkey
GROUP BY 1, 2, 3, 4;
Verify Data in GCS
Check the GCS bucket to see the files Databricks created.
gcloud storage ls gs://$GCS_BUCKET_NAME/regional_sales_csv/
One or more .csv files should be visible, along with _SUCCESS and log files.
5. Load Data into Spanner with Dataflow
A Google-provided Dataflow template will be used to import the CSV data from GCS into Spanner.
Create the Spanner Table
First, create the destination table in Spanner. The schema needs to be compatible with the data in the CSV files.
gcloud spanner databases ddl update $SPANNER_DB \
--instance=$SPANNER_INSTANCE \
--ddl="$(cat <<EOF
CREATE TABLE regional_sales (
nation_name STRING(MAX),
market_segment STRING(MAX),
order_year INT64,
order_priority STRING(MAX),
total_order_count INT64,
total_revenue NUMERIC,
unique_customer_count INT64
) PRIMARY KEY (nation_name, market_segment, order_year, order_priority);
EOF
)"
Create the Dataflow Manifest
The Dataflow template requires a "manifest" file. This is a JSON file that tells the template where to find the source data files and which Spanner table to load them into.
Define and upload a new regional_sales_manifest.json to the GCS bucket:
cat <<EOF | gcloud storage cp - gs://$GCS_BUCKET_NAME/regional_sales_manifest.json
{
"tables": [
{
"table_name": "regional_sales",
"file_patterns": [
"gs://$GCS_BUCKET_NAME/regional_sales_csv/*.csv"
]
}
]
}
EOF
Enable Dataflow API
Before using Dataflow, it first needs to be enabled. Do so with
gcloud services enable dataflow.googleapis.com --project=$GCP_PROJECT
Create and Run the Dataflow Job
The import job is now ready to run. This command launches a Dataflow job using the GCS_Text_to_Cloud_Spanner template.
The command is long and has several parameters. Here is a breakdown:
--gcs-location: The path to the pre-built template on GCS.--region: The region where the Dataflow job will run.--parameters: A list of key-value pairs specific to the template:instanceId,databaseId: The target Spanner instance and database.importManifest: The GCS path to the manifest file just created.
gcloud dataflow jobs run spanner-import-from-gcs \
--gcs-location=gs://dataflow-templates/latest/GCS_Text_to_Cloud_Spanner \
--region=$GCP_REGION \
--staging-location=gs://$GCS_BUCKET_NAME/staging \
--parameters \
instanceId=$SPANNER_INSTANCE,\
databaseId=$SPANNER_DB,\
importManifest=gs://$GCS_BUCKET_NAME/regional_sales_manifest.json,escape='\'
The status of the Dataflow job can be checked with the following command
gcloud dataflow jobs list \
--filter="name:spanner-import-from-gcs" \
--region="$GCP_REGION" \
--sort-by="~creationTime" \
--limit=1
The job should take about 5 minutes to complete.
Verify data in Spanner
Once the Dataflow job succeeds, verify that the data has been loaded into Spanner.
First, check the row count, it should be 4375
gcloud spanner databases execute-sql $SPANNER_DB \
--instance=$SPANNER_INSTANCE \
--sql='SELECT COUNT(*) FROM regional_sales;'
Next, query a few rows to inspect the data.
gcloud spanner databases execute-sql $SPANNER_DB \
--instance=$SPANNER_INSTANCE \
--sql='SELECT * FROM regional_sales LIMIT 5'
The imported data from the Databricks table should be visible.
6. Clean-up
Clean up Spanner
Delete the Spanner Database, and Instance
gcloud spanner instances delete $SPANNER_INSTANCE
Clean up GCS
Delete the GCS Bucket created to host the data
gcloud storage rm --recursive gs://$GCS_BUCKET_NAME
Clean up Databricks
Delete Catalog/Schema/Table
- Sign into your Databricks instance
- Click on
from the left side menu - Select the previously created
retl_tpch_projectfrom the catalog list

- In the Schema list, select
tpch_datathat was created - Select the previously created
regional_sales_csvfrom the table list - Expand the table options by clicking on
and select Delete - Click Delete on the confirmation dialog to delete the table
- Once the table is deleted, you will be brought back up to the schema page
- Expand the schema options by clicking on
and select Delete - Click Delete on the confirmation dialog to delete the Schema
- Once the schema is deleted, you will be brought back up to the catalog page
- Follow steps 4 - 11 again to delete the
defaultschema if one exists. - From the catalog page, expand the catalog options by clicking on
and select Delete - Click Delete on the confirmation dialog to delete the catalog
Delete External Data Location / Credentials
- From the Catalog screen, click on

- If you do not see an
External Dataoption, you may findExternal Locationlisted under aConnectdropdown instead. - Click on the
retl-gcs-locationexternal data location previously created - From the external location page, expand the location options by clicking on
and select Delete - Click Delete on the confirmation dialog to delete the external location
- Click on

- Click on the
retl-gcs-credentialthat was previously created - From the credential page, expand the credential options by clicking on
and select Delete - Click Delete on the confirmation dialog to delete the credentials.
7. Congratulations
Congratulations for completing the codelab.
What we've covered
- How to load data into Databricks
- How to create a GCS Bucket
- How to export a Databricks table to GCS in the CSV format
- How to setup up a Spanner instance
- How to load CSV Tables to Spanner with Dataflow