About this codelab
1. Overview
This codelab walks you through analyzing a stream of credit card transaction data being written to Bigtable. You'll learn how to use the Bigtable change streams to BigQuery template to export data in real-time. Then, you'll see techniques to query the change stream log and how to reformat the data to create a dashboard using Looker.
This codelab is intended for technical users who are familiar with Bigtable, using command line tools, and event streaming services.
This codelab shows you how to do the following:
- Create a Bigtable table with a change stream enabled.
- Create a BigQuery dataset.
- Deploy the Dataflow template
Bigtable change streams to BigQuery.
- Query the event stream in BigQuery.
- Visualize the event stream with Looker.
This diagram shows the architecture of the system you'll be deploying.
2. Project setup
- In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
- Make sure that billing is enabled for your Google Cloud project.
- Visit this pre-populated API enablement page to enable the necessary APIs for Dataflow, Bigtable, BigQuery, Looker, and Cloud Storage.
3. Create a BigQuery dataset
You'll be analyzing data using BigQuery later in the codelab. Follow these instructions to create the dataset for your data pipeline's output.
- In the Google Cloud console, go to the BigQuery page.
- In the Explorer pane, find your project name, and click the overflow menu (three vertical dots next to the project ID).
- Click Create dataset.
- In the Create dataset panel, do the following:
- For Dataset ID, enter
bigtable_bigquery_tutorial
. - Leave the remaining default settings as they are.
- Click Create dataset.
4. Create a Bigtable table with a change stream enabled
Bigtable is a low-latency, horizontally scalable NoSQL database service, and one common use case for it is serving financial data. Here you are going to create a table that can store credit card transactions. Bigtable can handle the high throughput writes of transactions around the world and could even use that data for real-time fraud detection.
- In the Google Cloud console, go to the Bigtable Instances page.
- Click the ID of the instance that you are using for this tutorial. If you don't have an instance available, create an instance with a name of your choosing in a region near you. You can use the default configurations for everything else.
- In the left navigation pane, click Tables.
- Click Create table.
- Name the table
retail-database
. - Add a column family named
transactions
. - Select Enable change stream.
- Leave the garbage collection policy and retention period with their default values.
- Click Create.
5. Initialize a data pipeline to capture the change stream
Bigtable's performance is optimal for point reads and row range scans, but table-wide analytics can put a strain on serving capacity and CPU resources. BigQuery is great for table-wide analytics, so the solution you'll use here is dual writes. This means you will write your data into Bigtable and BigQuery which is a common technique. This won't require additional coding as you'll use the Bigtable change streams to BigQuery
Dataflow template: when data is written to Bigtable, the pipeline will write a change record in BigQuery. Once the data is in BigQuery, you can perform optimized queries over your entire dataset that won't impact the performance of your serving data in Bigtable.
- On the Bigtable Tables page, find your table
retail-database
. - In the Change stream column, click Connect.
- In the Connect with Dataflow dialog, select BigQuery.
- Click Create Dataflow job.
- In the provided parameter fields, enter your parameter values. You don't need to provide any optional parameters.
- Set the Cloud Bigtable application profile ID to
default
. - Set the BigQuery dataset to
bigtable_bigquery_tutorial
.
- Click Run job.
- Wait until the job status is Starting or Running before proceeding. It can take around five minutes once the job is queued. The page will automatically update. This template creates a streaming job, so it can continuously process new data written to Bigtable until the job is manually stopped.
6. Write some data to Bigtable
Now, you'll write a year of credit card transactions to your Bigtable table. This example dataset includes information like credit card number, merchant name and id, and amount. In an actual credit card processing application, this data would be streaming into your database in real time as each transaction occurs.
- Open the Cloud Shell by clicking the button in the top right-hand corner of the cloud console.
- If prompted, accept the prompt to authorize the Cloud Shell.
- Download the dataset.
gsutil cp gs://cloud-bigtable-public-datasets/change-streams-tutorial/cc-transactions-2023.csv .
- Set your environment variables in the command line
PROJECT_ID=your-project-id
BIGTABLE_INSTANCE_ID=your-bigtable-instance-id
- Use the
cbt
CLI to write a number of credit card transactions to theretail-database
table.
cbt -instance=$BIGTABLE_INSTANCE_ID -project=$PROJECT_ID import \
retail-database cc-transactions-2023.csv column-family=transactions
The output will say
Done importing 10000 rows.
7. View the change logs in BigQuery
- In the Google Cloud console, go to the BigQuery page.
- In the Explorer pane, expand your project and the dataset
bigtable_bigquery_tutorial
. - Click the table
retail-database_changelog
. You may need to refresh the dataset if the table isn't present. - To see the change log, click Preview.
Try out some queries
Now, you can run a few queries over this dataset to get some insights. The SQL queries are provided here, but the Query a Bigtable change log in BigQuery documentation has more information about how to write queries for the changelog data.
Look up one transaction
Use the following query to look up the data for one specific transaction.
SELECT *
FROM `bigtable_bigquery_tutorial.retail-database_changelog`
WHERE row_key="3034-442694-3052#2023-03-03T14:50:46.824Z"
You'll see that each column written has been turned into an individual row in BigQuery.
Find the number of purchases in each category
Use the following query to count the number of purchases per category.
SELECT value as category, count(*) as `number of sales`
FROM `bigtable_bigquery_tutorial.retail-database_changelog`
WHERE column="category"
GROUP BY category
Reformat the data
To reconstruct each transaction into a single BigQuery table row, you will pivot the data and save that result to a new table. This is a more manageable format to query.
CREATE VIEW bigtable_bigquery_tutorial.retail_transactions_view AS (
SELECT *, parse_numeric(amount) as sales_dollars FROM (
SELECT row_key, timestamp, column, value
FROM `bigtable_bigquery_tutorial.retail-database_changelog`
)
PIVOT (
MAX(value)
FOR column in ("merchant", "amount", "category", "transaction_date")
)
)
Create a Looker dashboard
- Click Go to view to go to the new view's detail page.
- Click Export.
- Select Explore with Looker Studio
8. Add charts to the dashboard
Now you can graph the information to make it easier to digest and share out as a report. You will add three charts to your dashboard:
- Transaction amounts over time
- Total transactions for each merchant
- Percentage of transactions per category
Set up the page
- Click each existing chart and press delete to remove them.
- Select Properties on the right-hand side of the page, so you can modify the graph data.
Add the charts
Transaction amounts over time
- Click Add a chart, and create a time series chart.
- Set the Dimension to
transaction_date
.
- Set the Metric to
sales_dollars
.
Total transactions for each merchant
- Click Add a chart, and create a table.
- Set Dimension to
merchant
. - Set Metric to
sales_dollars
Percentage of transactions per category
- Click Add a chart, and create a pie chart.
- Set Dimension to
category
. - Set Metric to
sales_dollars
See real-time changes
Spend some time and explore the values in the graphs. You can click specific merchants in the table or specific categories in the table which will cause all of the charts to filter on those specific values which gives you more insights. Next, you can write more data and see how this graph will update in real time.
- Go back to your Cloud Shell.
- Download and write the second dataset.
gsutil cp gs://cloud-bigtable-public-datasets/change-streams-tutorial/cc-transactions-2024-jan.csv .
cbt -instance=$BIGTABLE_INSTANCE_ID -project=$PROJECT_ID import \
retail-database cc-transactions-2024-jan.csv column-family=transactions
- Return to your Looker dashboard and refresh the data with the command
Ctrl+Shift+E
or click Refresh data in the view menu. Now you should see the data for Jan 2024 in the charts.
There are many variations of graphs and metrics you can create beyond this. Read more in the Looker documentation.
9. Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Stop the change stream pipeline
- In the Google Cloud console, go to the Dataflow Jobs page.
- Select your streaming job from the job list.
- In the navigation, click Stop.
- In the Stop job dialog, select Cancel, and then click Stop job.
Delete the Bigtable resources
If you created a Bigtable instance for this tutorial, you can delete it, or you can clean up the table you created.
- In the Google Cloud console, go to the Bigtable Instances page.
- Click the ID of the instance that you are using for this tutorial.
- In the left navigation pane, click Tables.
- Find the
retail-database
table. - Click Edit.
- Clear Enable change stream.
- Click Save.
- Open the overflow menu for the table.
- Click Delete and input the table name to confirm.
- Optional: Delete the instance if you created a new one for this tutorial
Delete the BigQuery dataset
- In the Google Cloud console, go to the BigQuery page.
- In the Explorer panel, find the dataset
bigtable_bigquery_tutorial
and click it. - Click Delete, type delete, and then click Delete to confirm.