In this lab, you learn how to work with Cloud Bigtable in a performant way.

What you learn

In this lab, you learn how to:

First and foremost, you must set-up your GCP project as well as your environment so that you may successfully interact with all frameworks presented in this exercise.

Step 1

You foremost need a host to run all the applications you will write for the exercise. A minimal GCE instance with Java 8 and Maven installed will be sufficient. The steps to achieve this goal (among others) are well explained in pubsub-exercises, but the following subset should be sufficient:

Select Your Project

Open the Cloud Platform Console and navigate to the project you are using for this course. If you do not have a project, follow these steps to create one (Don't forget to associate a billing account!).

Create A Host

Create a Cloud host in the GCE page with the following specifications:

Name: datasme

Machine Type: 2 vCPUs

Region: us-central1

Zone: us-central1-a

• You may leave all other items in the default state

SSH Into the Host

In a terminal window, run

gcloud compute --project=<project_name> ssh datasme

Or, alternatively, in the GCE console click on the SSH button in the line containing datasme.

Download Resources and Set-Up

In the GCE host, run the following to install git, download the desired resources, and set-up your host:

sudo apt-get install git -y
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
./training-data-analyst/courses/data_analysis/deepdive/pubsub-exercises/setup_vm.sh

Step 2

In order for your application to communicate with GCP you must be properly authenticated. The following steps ensure that authentication.

There are two methods of logging in to the host. The first of them is easy but more dangerous security-wise, and the second is more complicated but safer. Both options are presented here and either should work.

User Login

To log in as your user, on the host, run the following commands and follow the designated instructions:

# Login so you can use gcloud properly
gcloud auth login
# Login so your applications can connect
gcloud auth application-default login

Service Account Login

To use a service account, you must first create it. Follow these steps to create a service account. I would personally recommend it be created in the cloud console (i.e.: the UI), rather than gcloud, as those steps are simpler (though less comprehensive) for a beginner.

Once created, give it the following set of permissions in the IAM page for your project:

Alternatively simply give it the Project Editor role.

Then, download the service account key to your GCE host and run the following on the GCE (where ACCOUNT is the service account name and KEY_FILE is the file you downloaded):

gcloud auth activate-service-account ACCOUNT --key-file=KEY_FILE

Step 3

Enable the following APIs so that you may call them from your applications:

Using the Console

From the GCP Console, create a Cloud Bigtable instance with the following specifications:

Instance ID: datasme-cbt

Instance Type: Development

Cluster ID: datasme-cbt-c1

Storage Type: SSD

Region: us-central1

Zone: us-central1-a

Nodes: 1

Using the Terminal

Alternatively, you may use gcloud to create a Bigtable instance:

gcloud beta bigtable instances create datasme-cbt \
    --instance-type=DEVELOPMENT \
    --cluster=datasme-cbt-c1 \
    --cluster-zone us-central1-b \
    --display-name=datasme-cbt

Now we can finally start interacting with Cloud Bigtable. Let's go to the exercises directory to start coding!

cd training-data-analyst/courses/data_analysis/deepdive/bigtable-exercises

Now let's make sure we created everything correctly by launching a simple script: build-ex0.sh

What does the script do?

With your favorite editor, take a quick glance at the file described. In it, you'll see you are really just calling the class Ex0.java in the package (and vicariously, folder) com.google.cloud.bigtable.training for the Bigtable instance datasme-cbt and the table TrainingTable.

Then what does the Java class do?

Now we can open the Java class to better understand what it's doing:

# Open the following with the editor of your choice:
vim src/main/java/com/google/cloud/bigtable/training/Ex0.java

First we see we are establishing a connection to a specific Bigtable instance (the one we specified in the bash script as datasme-cbt):

Connection connection = BigtableConfiguration.connect(projectId, instanceId)

We then, locally, create a table TrainingTable with a column family inventively called column_family:

HTableDescriptor descriptor = new HTableDescriptor(TableName.valueOf(tableName));
descriptor.addFamily(new HColumnDescriptor("column_family"));

Since that change is solely local, for it to take effect in the actual Bigtable instance, it must be committed. We do so with the admin API as follows:

Admin admin = connection.getAdmin();
try {
    admin.createTable(descriptor);
} catch (TableExistsException e) {
    System.err.println("Could not create table!");
}

Once we commit the change we must have a local pointer to the table. We do that like so:

Table table = connection.getTable(TableName.valueOf(tableName));

Lastly we write a new row called row_1 to that column family in the table, a column column_key_name with the value it worked!. Then we commit the local change to the instance:

Put put = new Put(Bytes.toBytes("row_1"));
put.addColumn(Bytes.toBytes("column_family"), Bytes.toBytes("column_key_name"),
    Bytes.toBytes("It worked!"));
table.put(put);

What do you think the following lines at the end would do?

Result getResult = table.get(new Get(Bytes.toBytes("row_1")));
String val =
    Bytes.toString(getResult.getValue(Bytes.toBytes("column_family"), Bytes.toBytes("column_key_name")));
System.out.println(val);

You will find the answer in the next page.

Let's run!

Now that we understand exactly what the code is doing, let's run it!

./build-ex0.sh
The lines print out the value we had previously inserted into column_family/column_key_name for row_1. 

If all works out well it should print "It worked!"

For this exercise we are going to start populating our table (notably, with 1 million rows!). For the first part of it we will try populating the table row by row, and subsequently we will populate it in batches. Note the performance difference between both operations. How much can you gain by batching overall?

Download the data

We start by downloading retail data to process. The data downloaded will be a comma separated file representing a retail action. Specifically, the content appears as follows:

TIME, USER, ACTION, ITEM

More specifically, it represents the non-unique timestamp the interaction occurred, wherein a specific user id performed an action (ADD, VIEW, BUY) to a specific item id. The consideration is that these items would be expected within a retail website.

To download the sample data from GCS, run the following:

gsutil -m cp gs://cloud-bigtable-training/actions_subset.csv src/main/resources/actions_subset.csv

Looking at the code

As you may have guessed the new script is at

build-ex1.sh

And the new Java class is at

src/main/java/com/google/cloud/bigtable/training/Ex1.java

The script this time is very similar to the previous one, except with a flag to use a buffered mutator. We'll get back to this part later.

The code, conversely, is quite different. We set up the table in a similar manner (though here we truncate it if it's populated), but for the writing of data (unlike the previous example wherein we only used a direct put) we also set up a Buffered Mutator so that we may batch more than one write:

BufferedMutatorParams params = new BufferedMutatorParams(TableName.valueOf(tableName))
          .listener((e, bufferedMutator) -> System.out.println(e.getMessage()));
BufferedMutator bufferedMutator = connection.getBufferedMutator(params);

We then parse the CSV we recently downloaded and map each of the rows into an eponymous map key. Which is either written in a batched manner, or multithreaded (with 8 threads) into singular puts depending on the script flag.

At the end of each batch of writes, the time taken is printed (if using multithreaded puts, the program quits at the 10th batch to save time).

However, we have not written all of the code: the rest is up to you. In the rest of this section, we will work together to populate it.

Fixing the code

getPut:

This method was meant to take the input (parsed) data from the CSV and further parse/format it so that it becomes a viable input for the table, returning it as an abstract put action to be placed in the table.

The part you must write takes the iteration through each value in the CSV row (time, user, action, item), and adds a column to the put. This should be familiar to you, as it is similarly done in the 0th exercise.

You will find the answer on the next page.

singlePut

This method should initiate a thread to execute the put into the table, using the recently fixed getPut to get the actual data. Note that Put is a subclass of Mutation, and as such the mutate method can take it as an input.

Here, the actual put is missing. Once again, this should be familiar, as it is similar to the step done in Ex0.

You will find the answer on the next page.

writeWithBufferedMutator

This is similar to singlePut, but rather than multithreading, we should just need to call the mutate method for BufferedMutator.

You will find the answer on the next page.

Run

Check the speed of the implementation by running the previously mentioned shell script.

Pass false to run the SinglePut code, or true to run the BufferedMutator code.

./build-ex1.sh true|false

Did it work?

Rows and Content

Install the cbt CLI to more closely interact with the table.

Upon installing and running, check if the number of rows corresponds to the desired number by running:

cbt -instance datasme-cbt count TrainingTable

You can also see a sample the written data with:

cbt -instance datasme-cbt read TrainingTable count=3

Run Times

The program you ran should have displayed the run times.

getPut

put.addColumn(family, Bytes.toBytes(tag), Bytes.toBytes(data.get(tag)));

singlePut

table.put(getPut(data));

writeWithBufferedMutator

bm.mutate(getPut(data));

For this exercise we will work with Dataflow to take data from Bigtable to Bigtable, with some steps in between. These steps will help us determine the throughput of the operations by our retail app customers by calculating actions per minute from the source data. The code to do that is basically done, we just need you to filter the data by items which start with the string "action".

Filter

We have done most of the work, but our scan doesn't have a filter associated with it. This should be a short one line solution. The HBase API has a way to filter by rows, and we need to feed that a comparator for strings that match a specific prefix. Pick the solution from one of the libraries below:

Libraries

Filter directory

Byte Array Comparator

Scan Class

You will find the answer on the next page.

Staging Bucket

In order to successfully run this exercise, we need a staging bucket.

Follow these steps to create a bucket. If you are using a service account, remember to grant it permission to access that bucket. Note that bucket names must be unique.

Run

Run the job to read data out of Bigtable and write aggregated data back in.

./build-ex2.sh gs://YOUR-BUCKET-NAME
scan.setFilter(new PrefixFilter(Bytes.toBytes("action")));

In this exercise we are going to look through our newly aggregated retail data (i.e.: the rollups column family), and determine whether we have any specific lulls (periods with drops in activity) in retail actions. Again, we have provided most of the code for you to work with, and you just have to implement one key part. Specifically you will take the data from the table and process it "offline" (in python itself).

Preparation

Since this step requires some python, we need to prepare the environment for python. Let's start by installing pip so we can download all the required dependencies:

sudo apt install python-pip

Also, since we don't want to mess with our current environment, let's set-up a virtual environment to play around with:

sudo apt install virtualenv

Now let's go to the python directory and install all our required libraries in the virtualenv:

cd python
virtualenv .env
source .env/bin/activate
pip install -r requirements.txt

Learning

Before we actually start coding, it might be a good idea to get acquainted with the library. Thankfully, python has an amazing REPL that we can easily try. Let's try running that (simply run python) and paste a modified version of the first few lines in the main method to interact with the data (remember to replace your_project_id with your actual project id):

from google.cloud import bigtable
client = bigtable.Client(project='your_project_id')
instance = client.instance('datasme-cbt')
table = instance.table('TrainingTable')
partial_rows = table.read_rows(start_key='hourly')
partial_rows.consume_all()

Modify the code

Remember, you are doing everything locally: logically you should iterate over all the cells in the rollups family for column name ‘'" (the empty string) - the row represents data for a specific hour, and each item is a time span. To be more specific, if it were in JSON notation, you can imagine the data looking like so:

{b"rollups": {"": "DATA IS HERE!"}}

Check the value for each column, and print a message it if drops by 50% or more from the previous value. You really only need to iterate (no sorting), as the items should be in order.

You will find the answer on the next page.

Run

python ex3.py <your project> datasme-cbt TrainingTable
  all_cells = []
  for row_key, row in list(partial_rows.rows.items()):
    for cell in row.cells[b'rollups']['']:
      all_cells.append(cell)

  last_cell = None
  for cell in all_cells:
    if last_cell and int(last_cell.value) / 2 > int(cell.value):
      print("Big drop from {} to {} between {} and {}".format(last_cell.value, cell.value, last_cell.timestamp, cell.timestamp))
    last_cell = cell

For this last exercise, we will simply run a python code which continuously reads a row from your table. Your task for this step is more-so manual, wherein, you will turn on replication, wait for it to complete, run the script, and delete one of your clusters.

Replicate

Edit your instance by setting Instance Type to Production and and clicking + Add replicated cluster. Settings can be default (any region or zone should work).

Run

Run the python code at this point

python ex4.py <your project> datasme-cbt TrainingTable

While this script is running locally, go to the cloud console and click the delete icon on the cluster for the original cluster (not the one you just created and replicated) and click save. Unsurprisingly, due to replication, the python program should still be working uninterrupted.

Press CTRL+C on your terminal to stop.

Delete the following resources: