Reverse ETL from Databricks to Spanner using CSV

1. Build a reverse ETL pipeline from Databricks to Spanner using GCS and Dataflow

Introduction

In this codelab, you will build a Reverse ETL pipeline from Databricks to Spanner using CSV files stored in Google Cloud Storage. Traditionally, ETL (Extract, Transform, Load) pipelines move data from operational databases into a data warehouse like Databricks for analytics. A Reverse ETL pipeline does the opposite: it moves curated, processed data from the data warehouse back into operational systems where it can power applications, serve user-facing features, or be used for real-time decision-making.

The goal is to move a sample dataset from a Databricks table into Spanner, a globally distributed relational database ideal for high-availability applications.

To achieve this, Google Cloud Storage (GCS) and Dataflow are used as intermediate steps. Here's a breakdown of the data flow and the reasoning behind this architecture:

Databricks to Google Cloud Storage (GCS) in CSV Format:

The first step is to get the data out of Databricks in an open, universal format. Exporting to CSV is a common and straightforward method for creating portable data files. These files will be staged in GCS, which provides a scalable and durable object storage solution.

GCS to Spanner (via Dataflow):

Instead of writing a custom script to read from GCS and write to Spanner, Google Dataflow is used, a fully managed data processing service. Dataflow provides pre-built templates specifically for this kind of task. Using the "GCS Text to Cloud Spanner" template allows for a high-throughput, parallelized data import without writing any data processing code, saving significant development time.

What you'll learn

How to load data into Databricks
How to create a GCS Bucket
How to export a Databricks table to GCS in the CSV format
How to setup up a Spanner instance
How to load CSV Tables to Spanner with Dataflow

2. Setup, Requirements & Limitations

Prerequisites

A Databricks account with permissions to create clusters and install libraries. A free trial account is not sufficient for this lab.
A Google Cloud account with Spanner, Cloud Storage, and Dataflow APIs enabled.
Access to the Google Cloud Console through a web browser.
A terminal with the Google Cloud CLI installed.
If your Google Cloud organization has the iam.allowedPolicyMemberDomains policy enabled, an administrator may need to grant an exception to allow service accounts from external domains. This will be covered in a later step where applicable.

Google Cloud Platform IAM Permissions

The Google account will need the following permissions to execute all the steps in this codelab.

Service Accounts
`iam.serviceAccountKeys.create`	Allows the creation of Service Accounts.
Spanner
`spanner.instances.create`	Allows creating a new Spanner instance.
`spanner.databases.create`	Allows running DDL statements to create
`spanner.databases.updateDdl`	Allows running DDL statements to create tables in the database.
Google Cloud Storage
`storage.buckets.create`	Allows creating a new GCS bucket to store the exported Parquet files.
`storage.objects.create`	Allows writing the exported Parquet files to the GCS bucket.
`storage.objects.get`	Allows BigQuery to read the Parquet files from the GCS bucket.
`storage.objects.list`	Allows BigQuery to list the Parquet files in the GCS bucket.
Dataflow
`Dataflow.workitems.lease`	Allows claiming of work items from Dataflow.
`Dataflow.workitems.sendMessage`	Allows the Dataflow worker to send messages back to the Dataflow service.
`Logging.logEntries.create`	Allows Dataflow workers to write log entries to Google Cloud Logging.

For convenience, predefined roles that contain these permissions can be used.

roles/resourcemanager.projectIamAdmin

roles/iam.serviceAccountKeyAdmin

roles/spanner.instanceAdmin

roles/spanner.databaseAdmin

roles/storage.admin

roles/dataflow.serviceAgent

roles/dataflow.worker

roles/dataflow.serviceAgent

Limitations

It is important to be aware of data type differences when moving data between systems.

Databricks to CSV: When exporting, Databricks data types are converted to standard text representations.
CSV to Spanner: When importing, it is necessary to ensure that the target Spanner data types are compatible with the string representations in the CSV file. This lab guides through a common set of type mappings.

Setup Reusable Properties

There will be a few values that will be repeatedly needed throughout this lab. To make this easier, we'll set these values to shell variables to be used later.

GCP_REGION - The specific region that the GCP resources will be located. The list of regions can be found here.
GCP_PROJECT - The GCP Project id to use.
GCP_BUCKET_NAME - The GCS Bucket name to be created, and where the data files will be stored.

export GCP_REGION = <GCP REGION HERE> 
export GCP_PROJECT= <GCP PROJECT HERE>
export GCS_BUCKET_NAME = <GCS BUCKET NAME HERE>
export SPANNER_INSTANCE = <SPANNER INSTANCE ID HERE>
export SPANNER_DB = <SPANNER DATABASE ID HERE>

Databricks

For this lab, a Databricks account hosted on GCP to allow for defining an external data location in GCS.

Google Cloud

This lab requires a Google Cloud project.

Google Cloud Project

A Project is a basic unit of organization in Google Cloud. If an administrator has provided one for use, this step may be skipped.

A project can be created using the CLI like this:

gcloud projects create $GCP_PROJECT
gcloud config set project $GCP_PROJECT

Learn more about creating and managing projects here.

Set Up Spanner

To start using Spanner, you need to provision an instance, and a database. Details about configuring and creating a Spanner Instance can be found here.

Create the Instance

gcloud spanner instances create $SPANNER_INSTANCE \
--config=regional-$GCP_REGION \
--description="Codelabs Snowflake RETL" \
--processing-units=100 \
--edition=ENTERPRISE

Create the Database

gcloud spanner databases create $SPANNER_DB \
--instance=$SPANNER_INSTANCE

3. Create a Google Cloud Storage Bucket

Google Cloud Storage (GCS) will be used to temporarily store the CSV data files generated by Snowflake before they are imported into Spanner.

Create the bucket

Use the following command to create a storage bucket in a specific region.

gcloud storage buckets create gs://$GCS_BUCKET_NAME --location=$GCP_REGION

Verify bucket creation

Once that command succeeds, check the result by listing all buckets. The new bucket should appear in the resulting list. Bucket references usually show up with the prefix gs:// in front of the bucket name.

gcloud storage ls | grep gs://$GCS_BUCKET_NAME

Test write permissions

This step ensures that the local environment is correctly authenticated and has the necessary permissions to write files to the newly created bucket.

echo "Hello, GCS" | gcloud storage cp - gs://$GCS_BUCKET_NAME/hello.txt

Verify the uploaded file

List the objects in the bucket. The full path of the file just uploaded should appear.

gcloud storage ls gs://$GCS_BUCKET_NAME

You should see the following output:

gs://$GCS_BUCKET_NAME/hello.txt

To view the contents of an object in a bucket, gcloud storage cat can be used.

gcloud storage cat gs://$GCS_BUCKET_NAME/hello.txt

The contents of the file should be visible:

Hello, GCS

Clean up the test file

The Cloud Storage bucket is now set up. The temporary test file can now be deleted.

gcloud storage rm gs://$GCS_BUCKET_NAME/hello.txt

The output should confirm the deletion:

Removing gs://$GCS_BUCKET_NAME/hello.txt...
/ [1 objects]
Operation completed over 1 objects.

4. Export from Databricks to GCS

Now, the Databricks environment will be configured to securely connect to GCS and export data.

Create Credentials

On the left side menu, click on Catalog
Click on External Data if it is available at the top of the catalog page. Otherwise, click on the Connect dropdown and then click on Credentials
Switch to the Credentials tab if you are not already on it.
Click on Create Credentials
Select GCP Service Account for Credential Type
Enter codelabs-retl-credentials for Credential Name
Click Create
Copy the service account email from the dialog box and click Done

Set this service account to an environment variable in your shell instance for re-use:

export GCP_SERVICE_ACCOUNT=<Your service account>

Grant GCS Permissions to Databricks

Now, the Snowflake service account must be granted permission to write to the GCS bucket.

gcloud storage buckets add-iam-policy-binding gs://$GCS_BUCKET_NAME \
    --member="serviceAccount:$GCP_SERVICE_ACCOUNT" \
    --role="roles/storage.objectAdmin"

gcloud storage buckets add-iam-policy-binding gs://$GCS_BUCKET_NAME \
    --member="serviceAccount:$GCP_SERVICE_ACCOUNT" \
    --role="roles/storage.legacyBucketReader"

Create External Location

Navigate back to the Credentials page using the bread crumbs at the top of the page
Switch to the External Location tab
Click on Create external location
Set External Location Name to codelabs-retl-gcs
Keep Storage Type as GCP
Set your bucket path to the URL
Set Storage Credential to codelabs-retl-credentials
Click Create
On the confirmation. Click Create

Create Catalog and Schema

On the left side menu, click on Catalog
Click Create, then Create a catalog
Set Catalog Name to retl_tpch_project
Set Type to Standard
Select codelabs-retl-gcs as external location
Click Create
Click on retl_tpch_project from the Catalog list
Click on Create schema
Set Schema name to tpch_data
Select Storage location to be codelabs-retl-gcs
Click Create

Export Data as CSV

Now the data is ready for export. The sample TPC-H dataset will be used to define our new table that will be externally stored as CSV.

First, copy the sample data into a new table in the workspace. To do so, SQL code will need to be run from a query.

On the left side menu under SQL, click on Queries
Click on the Create query button
Next to the Run button, set the Workspace to retl_tpch_project

CREATE TABLE retl_tpch_project.tpch_data.regional_sales_csv
USING CSV
LOCATION 'gs://<Your bucket name>/regional_sales_csv'
OPTIONS (
  header "false",
  delimiter ","
)
AS
SELECT
    n.n_name AS nation_name,
    c.c_mktsegment AS market_segment,
    YEAR(o.o_orderdate) AS order_year,
    o.o_orderpriority AS order_priority,
    COUNT(o.o_orderkey) AS total_order_count,
    ROUND(SUM(o.o_totalprice), 2) AS total_revenue,
    COUNT(DISTINCT c.c_custkey) AS unique_customer_count
FROM samples.tpch.orders AS o
INNER JOIN samples.tpch.customer AS c
    ON o.o_custkey = c.c_custkey
INNER JOIN samples.tpch.nation AS n
    ON c.c_nationkey = n.n_nationkey
GROUP BY 1, 2, 3, 4;

Verify Data in GCS

Check the GCS bucket to see the files Databricks created.

gcloud storage ls gs://$GCS_BUCKET_NAME/regional_sales_csv/

One or more .csv files should be visible, along with _SUCCESS and log files.

5. Load Data into Spanner with Dataflow

A Google-provided Dataflow template will be used to import the CSV data from GCS into Spanner.

Create the Spanner Table

First, create the destination table in Spanner. The schema needs to be compatible with the data in the CSV files.

gcloud spanner databases ddl update $SPANNER_DB \
  --instance=$SPANNER_INSTANCE \
  --ddl="$(cat <<EOF
CREATE TABLE regional_sales (
    nation_name STRING(MAX),
    market_segment STRING(MAX),
    order_year INT64,
    order_priority STRING(MAX),
    total_order_count INT64,
    total_revenue NUMERIC,
    unique_customer_count INT64
) PRIMARY KEY (nation_name, market_segment, order_year, order_priority);
EOF
)"

Create the Dataflow Manifest

The Dataflow template requires a "manifest" file. This is a JSON file that tells the template where to find the source data files and which Spanner table to load them into.

Define and upload a new regional_sales_manifest.json to the GCS bucket:

cat <<EOF | gcloud storage cp - gs://$GCS_BUCKET_NAME/regional_sales_manifest.json 
{ 
  "tables": [
    {
       "table_name": "regional_sales", 
       "file_patterns": [ 
         "gs://$GCS_BUCKET_NAME/regional_sales_csv/*.csv"
       ] 
    } 
  ] 
} 
EOF

Enable Dataflow API

Before using Dataflow, it first needs to be enabled. Do so with

gcloud services enable dataflow.googleapis.com --project=$GCP_PROJECT

Create and Run the Dataflow Job

The import job is now ready to run. This command launches a Dataflow job using the GCS_Text_to_Cloud_Spanner template.

The command is long and has several parameters. Here is a breakdown:

--gcs-location: The path to the pre-built template on GCS.
--region: The region where the Dataflow job will run.
--parameters: A list of key-value pairs specific to the template:
instanceId, databaseId: The target Spanner instance and database.
importManifest: The GCS path to the manifest file just created.

gcloud dataflow jobs run spanner-import-from-gcs \
  --gcs-location=gs://dataflow-templates/latest/GCS_Text_to_Cloud_Spanner \
  --region=$GCP_REGION \
  --staging-location=gs://$GCS_BUCKET_NAME/staging \
  --parameters \
instanceId=$SPANNER_INSTANCE,\
databaseId=$SPANNER_DB,\
importManifest=gs://$GCS_BUCKET_NAME/regional_sales_manifest.json,escape='\'

The status of the Dataflow job can be checked with the following command

gcloud dataflow jobs list \
    --filter="name:spanner-import-from-gcs" \
    --region="$GCP_REGION" \
    --sort-by="~creationTime" \
    --limit=1

The job should take about 5 minutes to complete.

Verify data in Spanner

Once the Dataflow job succeeds, verify that the data has been loaded into Spanner.

First, check the row count, it should be 4375

gcloud spanner databases execute-sql $SPANNER_DB \
--instance=$SPANNER_INSTANCE \
--sql='SELECT COUNT(*) FROM regional_sales;'

Next, query a few rows to inspect the data.

gcloud spanner databases execute-sql $SPANNER_DB \
--instance=$SPANNER_INSTANCE \
--sql='SELECT * FROM regional_sales LIMIT 5'

The imported data from the Databricks table should be visible.

6. Clean-up

Clean up Spanner

Delete the Spanner Database, and Instance

gcloud spanner instances delete $SPANNER_INSTANCE

Clean up GCS

Delete the GCS Bucket created to host the data

gcloud storage rm --recursive gs://$GCS_BUCKET_NAME

Clean up Databricks

Delete Catalog/Schema/Table

Sign into your Databricks instance
Click on from the left side menu
Select the previously created retl_tpch_project from the catalog list

In the Schema list, select tpch_data that was created
Select the previously created regional_sales_csv from the table list
Expand the table options by clicking on and select Delete
Click Delete on the confirmation dialog to delete the table
Once the table is deleted, you will be brought back up to the schema page
Expand the schema options by clicking on and select Delete
Click Delete on the confirmation dialog to delete the Schema
Once the schema is deleted, you will be brought back up to the catalog page
Follow steps 4 - 11 again to delete the default schema if one exists.
From the catalog page, expand the catalog options by clicking on and select Delete
Click Delete on the confirmation dialog to delete the catalog

Delete External Data Location / Credentials

From the Catalog screen, click on
If you do not see an External Data option, you may find External Location listed under a Connect dropdown instead.
Click on the retl-gcs-location external data location previously created
From the external location page, expand the location options by clicking on and select Delete
Click Delete on the confirmation dialog to delete the external location
Click on
Click on the retl-gcs-credential that was previously created
From the credential page, expand the credential options by clicking on and select Delete
Click Delete on the confirmation dialog to delete the credentials.

7. Congratulations

Congratulations for completing the codelab.

What we've covered

How to load data into Databricks
How to create a GCS Bucket
How to export a Databricks table to GCS in the CSV format
How to setup up a Spanner instance
How to load CSV Tables to Spanner with Dataflow