Bigtable and Dataflow: Database Monitoring Art (HBase Java Client)

In this codelab, you'll use Cloud Bigtable‘s monitoring tools to create various works of art through writing and reading data with Cloud Dataflow and the Java HBase client.

You'll learn how to

  • Load in large amounts of data to Bigtable using Cloud Dataflow
  • Monitor Bigtable instances and tables as your data is ingested
  • Query Bigtable using a Dataflow job
  • Explore the key visualizer tool which can be used to find hotspots due to your schema design
  • Create art using the key visualizer

d098cc81f78f02eb.png

How would you rate your experience with using Cloud Bigtable?

Novice Intermediate Proficient

How will you use this tutorial?

Read it through only Read it and complete the exercises

Cloud Bigtable is Google's NoSQL Big Data database service. It's the same database that powers many core Google services, including Search, Analytics, Maps, and Gmail. It's ideal for running large analytical workloads and building low-latency applications. Check out the Introduction to Cloud Bigtable Codelab for an in-depth introduction.

Create a project

First, create a new project. Use the built-in Cloud Shell, which you can open by clicking the "Activate Cloud Shell" button in the upper-righthand corner.

a74d156ca7862b28.png

Set the following environment variables to make copying and pasting the codelab commands easier:

BIGTABLE_PROJECT=$GOOGLE_CLOUD_PROJECT
INSTANCE_ID="keyviz-art-instance"
CLUSTER_ID="keyviz-art-cluster"
TABLE_ID="art"
CLUSTER_NUM_NODES=1
CLUSTER_ZONE="us-central1-c" # You can choose a zone closer to you

Cloud Shell comes with the tools that you'll use in this codelab, the gcloud command-line tool, the cbt command-line interface, and Maven, already installed.

Enable the Cloud Bigtable APIs by running this command.

gcloud services enable bigtable.googleapis.com bigtableadmin.googleapis.com

Create an instance by running the following command:

gcloud bigtable instances create $INSTANCE_ID \
    --cluster=$CLUSTER_ID \
    --cluster-zone=$CLUSTER_ZONE \
    --cluster-num-nodes=$CLUSTER_NUM_NODES \
    --display-name=$INSTANCE_ID

After you create the instance, populate the cbt configuration file and then create a table and column family by running the following commands:

echo project = $GOOGLE_CLOUD_PROJECT > ~/.cbtrc
echo instance = $INSTANCE_ID >> ~/.cbtrc

cbt createtable $TABLE_ID
cbt createfamily $TABLE_ID cf

Basics of writing

When you write to Cloud Bigtable, you must provide a CloudBigtableTableConfiguration configuration object. This object specifies the project ID and instance ID for your table, as well as the name of the table itself:

CloudBigtableTableConfiguration bigtableTableConfig =
    new CloudBigtableTableConfiguration.Builder()
        .withProjectId(PROJECT_ID)
        .withInstanceId(INSTANCE_ID)
        .withTableId(TABLE_ID)
        .build();

Then your pipeline can pass HBase Mutation objects, which can include Put and Delete.

p.apply(Create.of("hello", "world"))
    .apply(
        ParDo.of(
            new DoFn<String, Mutation>() {
              @ProcessElement
              public void processElement(@Element String rowkey, OutputReceiver<Mutation> out) {
                long timestamp = System.currentTimeMillis();
                Put row = new Put(Bytes.toBytes(rowkey));

                row.addColumn(...);
                out.output(row);
              }
            }))
    .apply(CloudBigtableIO.writeToTable(bigtableTableConfig));

The LoadData Dataflow job

The next page will show you how to run the LoadData job, but here I will call out the important parts to the pipeline.

To generate data, you'll create a pipeline that uses the GenerateSequence class (similarly to a for loop) to write a number of rows with a few megabytes of random data. The rowkey will be the sequence number padded and reversed, so 250 becomes 0000000052.

LoadData.java

String numberFormat = "%0" + maxLength + "d";

p.apply(GenerateSequence.from(0).to(max))
    .apply(
        ParDo.of(
            new DoFn<Long, Mutation>() {
              @ProcessElement
              public void processElement(@Element Long rowkey, OutputReceiver<Mutation> out) {
                String paddedRowkey = String.format(numberFormat, rowkey);

                // Reverse the rowkey for more efficient writing
                String reversedRowkey = new StringBuilder(paddedRowkey).reverse().toString();
                Put row = new Put(Bytes.toBytes(reversedRowkey));

                // Generate random bytes
                byte[] b = new byte[(int) rowSize];
                new Random().nextBytes(b);

                long timestamp = System.currentTimeMillis();
                row.addColumn(Bytes.toBytes(COLUMN_FAMILY), Bytes.toBytes("C"), timestamp, b);
                out.output(row);
              }
            }))
    .apply(CloudBigtableIO.writeToTable(bigtableTableConfig));

The following commands will run a dataflow job that generates 40GB of data into your table, more than enough for the Key Visualizer to activate:

Enable the Cloud Dataflow API

gcloud services enable dataflow.googleapis.com

Get the code from github and change into the directory

git clone https://github.com/GoogleCloudPlatform/java-docs-samples.git
cd java-docs-samples/bigtable/beam/keyviz-art

Generate the data (script takes around 15 minutes)

mvn compile exec:java -Dexec.mainClass=keyviz.LoadData \
"-Dexec.args=--bigtableProjectId=$BIGTABLE_PROJECT \
--bigtableInstanceId=$INSTANCE_ID --runner=dataflow \
--bigtableTableId=$TABLE_ID --project=$GOOGLE_CLOUD_PROJECT"

Monitor the import

You can monitor the job in the Cloud Dataflow UI. Also, you can view the load on your Cloud Bigtable instance with its monitoring UI.

In the Dataflow UI, you'll be able to see the job graph and the various job metrics including elements processed, current vCPUs, and throughput.

9cecc290f5acea15.png

abb0561342dc6b60.png

Bigtable has standard monitoring tools for read/write operations, storage used, error rate and more at the instance, cluster and table level. Beyond that, Bigtable also has the Key Visualizer which breaks down your usage based on the row keys which we will use once at least 30GB of data has been generated.

996f8589332dfc19.png

Basics of reading

When you read from Cloud Bigtable, you must provide a CloudBigtableTableScanConfiguration configuration object. This is similar to the CloudBigtableTableConfiguration, but you can specify the rows to scan and read from.

Scan scan = new Scan();
scan.setCacheBlocks(false);
scan.setFilter(new FirstKeyOnlyFilter());

CloudBigtableScanConfiguration config =
    new CloudBigtableScanConfiguration.Builder()
        .withProjectId(options.getBigtableProjectId())
        .withInstanceId(options.getBigtableInstanceId())
        .withTableId(options.getBigtableTableId())
        .withScan(scan)
        .build();

Then use that to start your pipeline:

p.apply(Read.from(CloudBigtableIO.read(config)))
    .apply(...

However, if you want to do a read as part of your pipeline, you can pass a CloudBigtableTableConfiguration to a doFn that extends AbstractCloudBigtableTableDoFn.

p.apply(GenerateSequence.from(0).to(10))
    .apply(ParDo.of(new ReadFromTableFn(bigtableTableConfig, options)));

Then call super() with your configuration and getConnection() to get a distributed connection.

public static class ReadFromTableFn extends AbstractCloudBigtableTableDoFn<Long, Void> {
    public ReadFromTableFn(CloudBigtableConfiguration config, ReadDataOptions readDataOptions) {
      super(config);
    }

    @ProcessElement
    public void processElement(PipelineOptions po) {
        Table table = getConnection().getTable(TableName.valueOf(options.getBigtableTableId()));
        ResultScanner imageData = table.getScanner(scan);
    }   
}

The ReadData Dataflow job

For this codelab you'll need to read from the table every second, so you can start your pipeline with a generated sequence that triggers multiple read ranges based on the time an inputted CSV file.

There is a bit of math to determine which row ranges to scan given the time, but you can click the filename to view the source code if you want to learn more.

ReadData.java

p.apply(GenerateSequence.from(0).withRate(1, new Duration(1000)))
    .apply(ParDo.of(new ReadFromTableFn(bigtableTableConfig, options)));

ReadData.java

  public static class ReadFromTableFn extends AbstractCloudBigtableTableDoFn<Long, Void> {

    List<List<Float>> imageData = new ArrayList<>();
    String[] keys;

    public ReadFromTableFn(CloudBigtableConfiguration config, ReadDataOptions readDataOptions) {
      super(config);
      keys = new String[Math.toIntExact(getNumRows(readDataOptions))];
      downloadImageData(readDataOptions.getFilePath());
      generateRowkeys(getNumRows(readDataOptions));
    }

    @ProcessElement
    public void processElement(PipelineOptions po) {
      // Determine which column will be drawn based on runtime of job.
      long timestampDiff = System.currentTimeMillis() - START_TIME;
      long minutes = (timestampDiff / 1000) / 60;
      int timeOffsetIndex = Math.toIntExact(minutes / KEY_VIZ_WINDOW_MINUTES);

      ReadDataOptions options = po.as(ReadDataOptions.class);
      long count = 0;

      List<RowRange> ranges = getRangesForTimeIndex(timeOffsetIndex, getNumRows(options));
      if (ranges.size() == 0) {
        return;
      }

      try {
        // Scan with a filter that will only return the first key from each row. This filter is used
        // to more efficiently perform row count operations.
        Filter rangeFilters = new MultiRowRangeFilter(ranges);
        FilterList firstKeyFilterWithRanges = new FilterList(
            rangeFilters,
            new FirstKeyOnlyFilter(),
            new KeyOnlyFilter());
        Scan scan =
            new Scan()
                .addFamily(Bytes.toBytes(COLUMN_FAMILY))
                .setFilter(firstKeyFilterWithRanges);

        Table table = getConnection().getTable(TableName.valueOf(options.getBigtableTableId()));
        ResultScanner imageData = table.getScanner(scan);
      } catch (Exception e) {
        System.out.println("Error reading.");
        e.printStackTrace();
      }
    }

    /**
     * Download the image data as a grid of weights and store them in a 2D array.
     */
    private void downloadImageData(String artUrl) {
    ...
    }

    /**
     * Generates an array with the rowkeys that were loaded into the specified Bigtable. This is
     * used to create the correct intervals for scanning equal sections of rowkeys. Since Bigtable
     * sorts keys lexicographically if we just used standard intervals, each section would have
     * different sizes.
     */
    private void generateRowkeys(long maxInput) {
    ...
    }

    /**
     * Get the ranges to scan for the given time index.
     */
    private List<RowRange> getRangesForTimeIndex(@Element Integer timeOffsetIndex, long maxInput) {
    ...
    }
  }

ad9c4c0b90626a3b.png

Now that you understand how to load data into Bigtable and read from it with Dataflow, you can run the final command which will generate an image of the Mona Lisa over 8 hours.

mvn compile exec:java -Dexec.mainClass=keyviz.ReadData \
"-Dexec.args=--bigtableProjectId=$BIGTABLE_PROJECT \
--bigtableInstanceId=$INSTANCE_ID --runner=dataflow \
--bigtableTableId=$TABLE_ID --project=$GOOGLE_CLOUD_PROJECT"

There is a bucket with existing images you can use. Or you can create an input file from any of your own images with this tool, and then upload them to a public GCS bucket.

Filenames are made from gs://keyviz-art/[painting]_[hours]h.txt example: gs://keyviz-art/american_gothic_4h.txt

painting options:

  • american_gothic
  • mona_lisa
  • pearl_earring
  • persistence_of_memory
  • starry_night
  • sunday_afternoon
  • the_scream

hour options: 1, 4, 8, 12, 24, 48, 72, 96, 120, 144

Make your GCS bucket or file public by giving allUsers the role Storage Object Viewer.

ee089815364150d2.png

Once you've picked your image, just change the --file-path parameter in this command:

mvn compile exec:java -Dexec.mainClass=keyviz.ReadData \
"-Dexec.args=--bigtableProjectId=$BIGTABLE_PROJECT \
--bigtableInstanceId=$INSTANCE_ID --runner=dataflow \
--bigtableTableId=$TABLE_ID --project=$GOOGLE_CLOUD_PROJECT \
--filePath=gs://keyviz-art/american_gothic_4h.txt"

The full image might take a few hours to come to life, but after 30 minutes, you should start to see activity in the key visualizer. There are several parameters you can play with: zoom, brightness, and metric. You can zoom, using the scroll wheel on your mouse, or by dragging a rectangle on the key visualizer grid.

Brightness changes the scaling of the image, which is helpful if you want to take an in-depth look at a very hot area.

8e847f03df25572b.png

You can also adjust which metric is displayed. There's OPs, Read bytes client, Writes bytes client to name a few. "Read bytes client" seems to produce smooth images while "Ops" produces images with more lines which can look really cool on some images.

33eb5dcf4e4be861.png

Clean up to avoid charges

To avoid incurring charges to your Google Cloud Platform account for the resources used in this codelab you should delete your instance.

gcloud bigtable instances delete $INSTANCE_ID

What we've covered

  • Writing to Bigtable with Dataflow
  • Reading from Bigtable with Dataflow (At the start of your pipeline, In the middle of your pipeline)
  • Using the Dataflow monitoring tools
  • Using the Bigtable monitoring tools including Key Visualizer

Next steps