In this lab, you learn how to work with Cloud Bigtable in a performant way.

What you learn

In this lab, you learn how to:

You will need a GCE instance with Java 8 and Maven installed. If you do not have the datasme instance appropriately setup, please follow the steps in the pubsub-exercises.

Step 1

From the GCP Console, create a Cloud Bigtable instance with the following specifications:

(OR)

Use gcloud to create a Bigtable instance:

gcloud beta bigtable instances create datasme-cbt \
    --instance-type=DEVELOPMENT \
    --cluster=datasme-cbt-c1 \
    --cluster-zone us-central1-b \
    --display-name=datasme-cbt

Step 2

Make sure the client works:

cd training-data-analyst/courses/data_analysis/deepdive/bigtable-exercises
bash ./build-ex0.sh

Step 2

Look at build-ex0.sh. What class was being executed in the previous step?

________________________

(Answer: Ex0.java)

Step 3

Why does the program print out "It worked!"

_________________________________________________________________________

Hint: What column and what value is being written into the table?

_________________________

(Answer: We write to the column cf:col the string "It worked!". This value is read back and printed.)

We are going to import a 1m row subset of the actions data set from the retail example.

Step 0

Download the retail data subset file from GCS and put it where the Java code can read it.

./download_data.sh

Step 1 [If familiar with Java]

Implement the TODOs in Ex1.java (scroll down to the method implementations).

SinglePut:

For 1a, reorder the last 2 parameters to String.join to efficiently distribute.

For 1b, See Ex0.java

For 1c:

writer.execute(() -> {
   table.put(getPut(data));
}, point);

BufferedMutator:

1d:

Use the BufferedMutator in WriteWithBufferedMutator

bm.mutate(getPut(point));

Step 2

Check the speed of the implementation by running either Ex1 or Ex1Solution.

Pass "false" to run the SinglePut code, or "true" to run the BufferedMutator code.

bash ./build-ex1.sh true|false
bash ./build-ex1-solution.sh true|false

Step 3

Fill out this table based on Step 2:

Method

Parameter (if any)

Writing rate

SinglePut

numThreads=_____________

_____________ rows/sec

numThreads=_____________

_____________ rows/sec

BufferedMutator and BufferedMutatorParams

numThreads=_____________

_____________ rows/sec

numThreads=_____________

_____________ rows/sec

Step 1 [If familiar with Java]

Complete the single TODO in Ex2.java

Hint: Which filter will let us scan only rows that starts with "action"?

https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/package-summary.html

Step 2

Run the job to read data out of Bigtable and write aggregated data back in.

./build-ex2.sh gs://YOUR-STAGING-LOCATION

or

./build-ex2-solution.sh gs://YOUR-STAGING-LOCATION

We are going to check our by-minute aggregations and look for big drops in retail activity.

Step 1:

Install pip and virtualenv if you do not already have them. You may want to refer to the Python Development Environment Setup Guide for Google Cloud Platform for instructions.

Step 2:

Create a virtualenv:

cd python
virtualenv env
source env/bin/activate

Step 3: Install Requirements

pip install -r requirements.txt

Step 4: Complete the TODO if familiar with python

Python library docs are here: https://googlecloudplatform.github.io/google-cloud-python/latest/bigtable/data-api.html

Hint: Iterate over all the cells in the rollups family for column name "". Add them all to a list and check that list for any value that drops more than 50% from the previous one. Print something out with the two values and two timestamps (both accessible from the cell).

Step 5:

Run it!

python ex3.py <your project> datasme-cbt TrainingTable

or

python ex3_solution.py <your project> datasme-cbt TrainingTable

Delete the following resources: