Migration from Cassandra to Bigtable with a Dual-Write Proxy

1. Introduction

Bigtable is a fully managed, high-performance NoSQL database service designed for large analytical and operational workloads. Migrating from existing databases like Apache Cassandra to Bigtable often requires careful planning to minimize downtime and application impact.

This codelab demonstrates a migration strategy from Cassandra to Bigtable using a combination of proxy tools:

Cassandra-Bigtable Proxy: Allows Cassandra clients and tools (like cqlsh or drivers) to interact with Bigtable using the Cassandra Query Language (CQL) protocol by translating queries.
Datastax Zero Downtime Migration (ZDM) Proxy: An open-source proxy that sits between your application and your database services (origin Cassandra and target Bigtable via the Cassandra-Bigtable Proxy). It orchestrates dual writes and manages traffic routing, enabling migration with minimal application changes and downtime.
Cassandra Data Migrator (CDM): An open-source tool used for bulk migrating historical data from the source Cassandra cluster to the target Bigtable instance.

What you'll learn

How to set up a basic Cassandra cluster on Compute Engine.
How to create a Bigtable instance.
How to deploy and configure the Cassandra-Bigtable Proxy to map a Cassandra schema to Bigtable.
How to deploy and configure the Datastax ZDM Proxy for dual writes.
How to use the Cassandra Data Migrator tool to bulk-migrate existing data.
The overall workflow for a proxy-based Cassandra-to-Bigtable migration.

What you'll need

A Google Cloud project with billing enabled. New users are eligible for a free trial.
Basic familiarity with Google Cloud concepts like projects, Compute Engine, VPC networks, and firewall rules. Basic familiarity with Linux command-line tools.
Access to a machine with the gcloud CLI installed and configured, or use the Google Cloud Shell.

For this codelab, we will primarily use virtual machines (VMs) on Compute Engine within the same VPC network and region to simplify networking. Using internal IP addresses is recommended.

2. Set up your environment

1. Select or create a Google Cloud Project

Navigate to the Google Cloud Console and select an existing project or create a new one. Note your Project ID.

2. Choose a region and zone

Select a region and zone for your resources. We'll use us-central1 and us-central1-c as examples. Define these as environment variables for convenience:

export PROJECT_ID="<your-project-id>"
export REGION="us-central1"
export ZONE="us-central1-c"

gcloud config set project $PROJECT_ID
gcloud config set compute/region $REGION
gcloud config set compute/zone $ZONE

3. Enable required APIs

Ensure the Compute Engine API and Bigtable API are enabled for your project.

gcloud services enable compute.googleapis.com bigtable.googleapis.com bigtableadmin.googleapis.com

4. Configure firewall rules

We need to allow communication between our VMs within the default VPC network on several ports:

Cassandra/Proxies CQL Port: 9042
ZDM Proxy Health Check Port: 14001
SSH: 22

Create a firewall rule to allow internal traffic on these ports. We'll use a tag cassandra-migration to easily apply this rule to relevant VMs.

gcloud compute firewall-rules create allow-migration-internal \
--network=default \
--action=ALLOW \
--rules=tcp:22,tcp:9042,tcp:7000,tcp:14001 \
--source-ranges=10.0.0.0/8 \
--target-tags=cassandra-migration

3. Deploy Cassandra cluster (Origin)

For this codelab, we'll set up a simple single-node Cassandra cluster on Compute Engine. In a real-world scenario, you would connect to your existing cluster.

1. Create a GCE VM for Cassandra

gcloud compute instances create cassandra-origin \
--machine-type=e2-medium \
--image-family=ubuntu-2204-lts \
--image-project=ubuntu-os-cloud \
--tags=cassandra-migration \
--boot-disk-size=20GB \
--scopes=cloud-platform \
--zone="$ZONE"

SSH into your Cassandra instance

gcloud compute ssh --zone="$ZONE" "cassandra-origin"

2. Install Cassandra

# Install Java (Cassandra dependency)
sudo apt-get update
sudo apt-get install -y openjdk-11-jre-headless

# Add Cassandra repository
echo "deb https://debian.cassandra.apache.org 41x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -

# Install Cassandra
sudo apt update
sudo apt install -y cassandra

# (Optional) Verify Cassandra is running
sudo systemctl status cassandra

3. Configure Cassandra

We need to configure Cassandra to be accessible within the private network.

Grab the cassandra-origin's private IP by running:

hostname -I

Edit the Cassandra config, you shouldn't have to add any new config lines - just update the ones that are already there:

sudo vim /etc/cassandra/cassandra.yaml

Set seed_provider.parameters.seeds to "CASSANDRA_ORIGIN_PRIVATE_IP:7000"
Set rpc_address to CASSANDRA_ORIGIN_PRIVATE_IP
Set listen_address to CASSANDRA_ORIGIN_PRIVATE_IP

Save the file.

Finally, restart Cassandra to load the config changes:

sudo systemctl restart cassandra

# (Optional) Verify Cassandra is running
sudo systemctl status cassandra

4. Create a keyspace and table

We'll use an employee table example and create a keyspace called "zdmbigtable".

Note: it may take a minute for cassandra to start accepting connections.

# Start cqlsh
cqlsh $(hostname -I)

Inside cqlsh:

-- Create keyspace (adjust replication for production)
CREATE KEYSPACE zdmbigtable WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};

-- Use the keyspace
USE zdmbigtable;

-- Create the employee table
CREATE TABLE employee (
    name text PRIMARY KEY,
    age bigint,
    code int,
    credited double,
    balance float,
    is_active boolean,
    birth_date timestamp
);

-- Exit cqlsh
EXIT;

Leave the SSH session open or note the IP address of this VM (hostname -I).

4. Set up Bigtable (Target)

Duration 0:01

Create a Bigtable instance. We'll use zdmbigtable as the instance ID.

gcloud bigtable instances create zdmbigtable \
--display-name="ZDM Bigtable Target" \
--cluster="bigtable-c1" \
--cluster-zone="$ZONE" \
--cluster-num-nodes=1 # Use 1 node for dev/testing; scale as needed

The Bigtable table itself will be created later by the Cassandra-Bigtable Proxy setup script.

5. Set up Cassandra-Bigtable Proxy

1. Create Compute Engine VM for Cassandra-Bigtable Proxy

gcloud iam service-accounts create bigtable-proxy-sa \
    --description="Service account for Bigtable Proxy access" \
    --display-name="Bigtable Proxy Access SA"

export BIGTABLE_PROXY_SA_EMAIL=$(gcloud iam service-accounts list --filter="displayName='Bigtable Proxy Access SA'" --format="value(email)")

gcloud bigtable instances add-iam-policy-binding zdmbigtable \
  --member="serviceAccount:$BIGTABLE_PROXY_SA_EMAIL" \
  --role="roles/bigtable.admin"

gcloud compute instances create bigtable-proxy-vm \
--machine-type=e2-medium \
--image-family=ubuntu-2204-lts \
--image-project=ubuntu-os-cloud \
--tags=cassandra-migration \
--boot-disk-size=20GB \
--zone=$ZONE \
--scopes=cloud-platform \
--service-account="$BIGTABLE_PROXY_SA_EMAIL"

SSH into the bigtable-proxy-vm:

gcloud compute ssh --zone="$ZONE" "bigtable-proxy-vm"

On the bigtable-proxy-vm run:

# Install Git and Go
sudo apt-get update
sudo apt-get install -y git

wget https://go.dev/dl/go1.23.6.linux-amd64.tar.gz
sudo rm -rf /usr/local/go
sudo tar -C /usr/local -xzf go1.23.6.linux-amd64.tar.gz

echo 'export GOPATH=$HOME/go' >> ~/.profile
echo 'export PATH=$PATH:/usr/local/go/bin:$GOPATH/bin' >> ~/.profile
source ~/.profile

# Clone the proxy repository
git clone https://github.com/GoogleCloudPlatform/cloud-bigtable-ecosystem.git
cd cloud-bigtable-ecosystem/cassandra-bigtable-migration-tools/cassandra-bigtable-proxy/

2. Start the Cassandra-Bigtable Proxy

Start the proxy server.

# At the root of the cassandra-to-bigtable-proxy directory
go run proxy.go --project-id="$(gcloud config get-value project)" --instance-id=zdmbigtable --keyspace-id=zdmbigtable --rpc-address=$(hostname -I)

The proxy will start and listen on port 9042 for incoming CQL connections. Keep this terminal session running. Note the IP address of this VM (hostname -I)

3. Create Table via CQL

Connect CQLSH to the Cassandra-Bigtable Proxy VM's IP address. You can find the IP address by running the following command locally:

gcloud compute instances describe bigtable-proxy-vm --format='get(networkInterfaces[0].networkIP)'

In a separate window SSH onto your cassandra-origin vm and use cqlsh to the bigtable-proxy. Note that we set a longer than default request timeout to ensure Bigtable has enough time to create the underlying table. You should see "Connected to cassandra-bigtable-proxy-v0.2.3" or similar, indicating that you've connected to the bigtable proxy, and not the local cassandra server.

# Replace <your-bigtable-proxy-vm-ip> with the ip from the above command
export BIGTABLE_PROXY_IP=<your-bigtable-proxy-vm-ip>
cqlsh --request-timeout=60 $BIGTABLE_PROXY_IP

-- Create the employee table
CREATE TABLE zdmbigtable.employee (
    name text PRIMARY KEY,
    age bigint,
    code int,
    credited double,
    balance float,
    is_active boolean,
    birth_date timestamp
);

In CQLSH, verify that your table has been created by running:

DESC TABLE zdmbigtable.employee;

6. Set up the ZDM Proxy

We'll create a single instance of the ZDM Proxy for this lab, but you'll want a multi-node set up for a production migration.

1. Create the ZDM Proxy VM

gcloud compute instances create zdm-proxy-vm \
--machine-type=e2-medium \
--image-family=ubuntu-2204-lts \
--image-project=ubuntu-os-cloud \
--tags=cassandra-migration \
--boot-disk-size=20GB \
--scopes=cloud-platform \
--zone=$ZONE

Note the IP addresses of both VMs.

2. Prepare the ZDM Proxy

gcloud compute ssh --zone="$ZONE" zdm-proxy-vm

export ZDM_VERSION="2.3.4"
wget "https://github.com/datastax/zdm-proxy/releases/download/v$ZDM_VERSION/zdm-proxy-linux-amd64-v$ZDM_VERSION.tgz"
tar -xvzf "zdm-proxy-linux-amd64-v$ZDM_VERSION.tgz"

# replace YOUR_ZONE
gcloud config set compute/zone "YOUR_ZONE"
export ZDM_ORIGIN_CONTACT_POINTS=$(gcloud compute instances describe cassandra-origin --format='get(networkInterfaces[0].networkIP)') 
export ZDM_TARGET_CONTACT_POINTS=$(gcloud compute instances describe bigtable-proxy-vm --format='get(networkInterfaces[0].networkIP)')
export ZDM_ORIGIN_USERNAME=""
export ZDM_ORIGIN_PASSWORD=""
export ZDM_TARGET_USERNAME=""
export ZDM_TARGET_PASSWORD=""
export ZDM_PROXY_LISTEN_ADDRESS=0.0.0.0
export ZDM_PROXY_LISTEN_PORT=9042
./zdm-proxy-v${ZDM_VERSION}

7. Configure application & start dual writes

Duration 0:05

At this stage in a real migration, you would reconfigure your application(s) to point to the ZDM Proxy vm's IP address (e.g., :9042) instead of directly connecting to Cassandra.

Once the application connects to the ZDM Proxy: Reads are served from the Origin (Cassandra) by default. Writes are sent to both the Origin (Cassandra) and the Target (Bigtable, via the Cassandra-Bigtable Proxy). This enables your application to continue functioning normally while ensuring new data is written to both databases simultaneously. You can test the connection using cqlsh pointed at the ZDM Proxy:

cqlsh $(gcloud compute instances describe zdm-proxy-vm --format='get(networkInterfaces[0].networkIP)')

Try insert some data:

INSERT INTO zdmbigtable.employee (name, age, is_active) VALUES ('Alice', 30, true); 
INSERT INTO zdmbigtable.employee (name, age, is_active) VALUES ('Anna', 45, true); 
INSERT INTO zdmbigtable.employee (name, age, is_active) VALUES ('Albert', 50, false); 
SELECT * FROM zdmbigtable.employee;

This data should be written to both Cassandra and Bigtable. You can confirm this in Bigtable, by going to the Google Cloud Console and opening the Bigtable Query Editor for your instance. Run a "SELECT * FROM employee" query, and the data recently inserted should be visible.

8. Migrate historical data using Cassandra Data Migrator

Now that dual writes are active for new data, use the Cassandra Data Migrator (CDM) tool to copy the existing historical data from Cassandra to Bigtable.

1. Create Compute Engine VM for CDM

This VM needs sufficient memory for Spark.

gcloud compute instances create cdm-migrator-vm \
--machine-type=e2-medium \
--image-family=ubuntu-2204-lts \
--image-project=ubuntu-os-cloud \
--tags=cassandra-migration \
--boot-disk-size=40GB \
--scopes=cloud-platform \
--zone=$ZONE

2. Install prerequisites (Java 11, Spark)

SSH into the cdm-migrator-vm:

gcloud compute ssh cdm-migrator-vm

Inside the VM:

# Install Java 11 
sudo apt-get update 
sudo apt-get install -y openjdk-11-jdk
 
# Verify Java installation 
java -version 

# Download and Extract Spark (Using version 3.5.3 as requested) 
# Check the Apache Spark archives for the correct URL if needed

wget  https://archive.apache.org/dist/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3-scala2.13.tgz
tar -xvzf spark-3.5.3-bin-hadoop3-scala2.13.tgz

echo 'export SPARK_HOME=$PWD/spark-3.5.3-bin-hadoop3-scala2.13' >> ~/.profile
echo 'export PATH=$PATH:$SPARK_HOME/bin' >> ~/.profile
source ~/.profile

3. Download Cassandra Data Migrator

In your browser, open the CDM Packages page and copy the .jar link from the Assets panel. If 5.4.0 isn't available, choose the next closest version. Paste the link into the below command, and run it on your cdm-migrator-vm instance, preserving the single quotes around the url.

wget 'JAR_URL_GOES_HERE' -O cassandra-data-migrator.jar

Verify that the jar file was downloaded correctly by scanning it with the jar tool: you should see a long list of ".class" files.

jar tf cassandra-data-migrator.jar

4. Add some data

We need to add some data to migrate by writing directly to the cassandra-origin (not the zdm-proxy-vm)

INSERT INTO zdmbigtable.employee (name, age, is_active) VALUES ('Alfred', 67, true); 
INSERT INTO zdmbigtable.employee (name, age, is_active) VALUES ('Bobby', 12, false); 
INSERT INTO zdmbigtable.employee (name, age, is_active) VALUES ('Carol', 29, true);

5. Run the migration job

Execute the migration using spark-submit. This command tells Spark to run the CDM jar, using your properties file and specifying the keyspace and table to migrate. Adjust memory settings (–driver-memory, –executor-memory) based on your VM size and data volume.

Make sure you are in the directory containing the CDM jar and properties file.

Tip: you can get the internal IP of your cassandra and proxy VMs by running these commands from your local machine:

gcloud compute instances describe cassandra-origin --format='get(networkInterfaces[0].networkIP)'
gcloud compute instances describe bigtable-proxy-vm --format='get(networkInterfaces[0].networkIP)'

export ORIGIN_HOST="<your-cassandra-origin-ip>"
export TARGET_HOST="<your-bigtable-proxy-vm-ip>"
export KEYSPACE_TABLE="zdmbigtable.employee"
spark-submit --verbose --master "local[*]" \
--driver-memory 3G --executor-memory 3G \
--conf spark.cdm.schema.origin.keyspaceTable="$KEYSPACE_TABLE" \
--conf spark.cdm.connect.origin.host="$ORIGIN_HOST" \
--conf spark.cdm.connect.origin.port=9042 \
--conf spark.cdm.connect.target.host="$TARGET_HOST" \
--conf spark.cdm.connect.target.port=9042 \
--conf spark.cdm.feature.origin.ttl.automatic=false \
--conf spark.cdm.feature.origin.writetime.automatic=false \
--conf spark.cdm.feature.target.ttl.automatic=false \
--conf spark.cdm.feature.target.writetime.automatic=false \
--conf spark.cdm.schema.origin.column.ttl.automatic=false \
--conf spark.cdm.schema.ttlwritetime.calc.useCollections=false \
--class com.datastax.cdm.job.Migrate cassandra-data-migrator.jar

6. Verify data migration

Once the CDM job completes successfully, verify that the historical data exists in Bigtable.

cqlsh <bigtable-proxy-vm-ip>

Inside cqlsh:

SELECT COUNT(*) FROM zdmbigtable.employee; -- Check row count matches origin 
SELECT * FROM zdmbigtable.employee LIMIT 10; -- Check some sample data

9. Cutover (conceptual)

After thoroughly verifying data consistency between Cassandra and Bigtable, you can proceed with the final cutover.

With the ZDM Proxy, the cutover involves reconfiguring it to primarily read from the target (Bigtable) instead of the Origin (Cassandra). This is typically done via ZDM Proxy's configuration, effectively shifting your application's read traffic to Bigtable.

Once you are confident that Bigtable is serving all traffic correctly, you can eventually:

Stop dual writes by reconfiguring the ZDM Proxy.
Decommission the original Cassandra cluster.
Remove the ZDM Proxy and have the application connect directly to the Cassandra-Bigtable Proxy or use the native Bigtable CQL Client for Java.

The specifics of ZDM Proxy reconfiguration for cutover are beyond this basic codelab but are detailed in the Datastax ZDM documentation.

10. Clean up

To avoid incurring charges, delete the resources created during this codelab.

1. Delete Compute Engine VMs

gcloud compute instances delete cassandra-origin zdm-proxy-vm bigtable-proxy-vm cdm-migrator-vm --zone=$ZONE --quiet

2. Delete Bigtable instance

gcloud bigtable instances delete zdmbigtable

3. Delete Firewall rules

gcloud compute firewall-rules delete allow-migration-internal

4. Delete Cassandra database (if installed locally or persisted)

If you installed Cassandra outside of a Compute Engine VM created here, follow appropriate steps to remove the data or uninstall Cassandra.

11. Congratulations!

You have successfully walked through the process of setting up a proxy-based migration path from Apache Cassandra to Bigtable!

You learned how to:

Deploy Cassandra and Bigtable.

Configure the Cassandra-Bigtable Proxy for CQL compatibility.
Deploy the Datastax ZDM Proxy to manage dual writes and traffic.
Use the Cassandra Data Migrator to move historical data.

This approach allows for migrations with minimal downtime and no code changes by leveraging the proxy layer.

Next steps

Explore Bigtable Documentation
Consult the Datastax ZDM Proxy documentation for advanced configurations and cutover procedures.
Review the Cassandra-Bigtable Proxy repository for more details.
Check the Cassandra Data Migrator repository for advanced usage.
Try other Google Cloud Codelabs