In this lab, you learn how to use Dataflow to aggregate records received in real-time in Cloud Pub/Sub. The aggregate statistics will then be streamed into BigQuery and analyzed even as the data are streaming in.

What you need

To complete this lab, you need:

Access to a supported Internet browser:

A Google Cloud Platform project

What you learn

In this lab, you learn how to:

The goal of this lab is to learn how to use Pub/Sub as a real-time streaming source into Dataflow and BigQuery as a streaming sink.

Step 1

Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI. Next click on the blue arrow next to your project name (on the left-hand panel) and click on Create new dataset and if you do not have a dataset named demos, please create one.

Step 2

Back on you Cloud Console, visit the Pub/Sub section of GCP Console and click on Create Topic. Give your new topic the name streamdemo and select Create.

Step 1

Start CloudShell and navigate to the directory for this lab:

cd ~/training-data-analyst/courses/data_analysis/lab2 

If this directory doesn't exist, you may need to git clone the repository:

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

Step 2

View the pipeline code using nano and answer the following questions:

cd ~/training-data-analyst/courses/data_analysis/lab2/javahelp
nano src/main/java/com/google/cloud/training/dataanalyst/javahelp/StreamDemoConsumer.java

Step 3

What are the fields in the BigQuery table? _______________________________

Step 4

What is the pipeline source? ________________________________________________

Step 5

How often will aggregates be computed? ___________________________________________

Aggregates will be computed over what time period? _________________________________

Step 6

What aggregate is being computed in this pipeline? ____________________________

How would you change it to compute the average number of words in each message over the time period? ____________________________

Step 7

What is the output sink for the pipeline? ____________________________

Step 1

If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.

Step 2

Execute the pipeline by typing in (make sure to replace <YOUR-BUCKET-NAME> with the bucket name you created in the previous step):

cd ~/training-data-analyst/courses/data_analysis/lab2/javahelp
./run_oncloud4.sh <PROJECT> <YOUR-BUCKET-NAME>

Monitor the job from the GCP console from the Dataflow section. Note that this pipeline will not exit.

Step 3

Visit the Pub/Sub section of GCP Console and click on your streamdemo topic. Notice that it has a Dataflow subscription. Click on the Publish button and type in a message (any message) and click Publish:

Step 4

Publish a few more messages.

Step 1

Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI. Compose a new query and type in (change your PROJECTID appropriately):

SELECT timestamp, num_words from [PROJECTID:demos.streamdemo] LIMIT 10

Step 1

Cancel the job from the GCP console from the Dataflow section.

Step 2

Delete the streamdemo topic from the Pub/Sub section of GCP Console

Step 3

Delete the streamdemo table from the left-panel of BigQuery console

In this lab, you:

┬ęGoogle, Inc. or its affiliates. All rights reserved. Do not distribute.