In this lab, you learn how to use Dataflow to aggregate records received in real-time in Cloud Pub/Sub. The aggregate statistics will then be streamed into BigQuery and analyzed even as the data are streaming in.
To complete this lab, you need:
Access to a supported Internet browser:
A Google Cloud Platform project
In this lab, you learn how to:
The goal of this lab is to learn how to use Pub/Sub as a real-time streaming source into Dataflow and BigQuery as a streaming sink.
Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI. Next click on the blue arrow next to your project name (on the left-hand panel) and click on Create new dataset and if you do not have a dataset named
demos, please create one.
Back on you Cloud Console, visit the Pub/Sub section of GCP Console and click on Create Topic. Give your new topic the name
streamdemo and select Create.
Start CloudShell and navigate to the directory for this lab:
If this directory doesn't exist, you may need to git clone the repository:
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
View the pipeline code using nano and answer the following questions:
cd ~/training-data-analyst/courses/data_analysis/lab2/javahelp nano src/main/java/com/google/cloud/training/dataanalyst/javahelp/StreamDemoConsumer.java
What are the fields in the BigQuery table? _______________________________
What is the pipeline source? ________________________________________________
How often will aggregates be computed? ___________________________________________
Aggregates will be computed over what time period? _________________________________
What aggregate is being computed in this pipeline? ____________________________
How would you change it to compute the average number of words in each message over the time period? ____________________________
What is the output sink for the pipeline? ____________________________
If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.
Execute the pipeline by typing in (make sure to replace
<YOUR-BUCKET-NAME> with the bucket name you created in the previous step):
cd ~/training-data-analyst/courses/data_analysis/lab2/javahelp ./run_oncloud4.sh <PROJECT> <YOUR-BUCKET-NAME>
Monitor the job from the GCP console from the Dataflow section. Note that this pipeline will not exit.
Visit the Pub/Sub section of GCP Console and click on your streamdemo topic. Notice that it has a Dataflow subscription. Click on the Publish button and type in a message (any message) and click Publish:
Publish a few more messages.
Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI. Compose a new query and type in (change your PROJECTID appropriately):
SELECT timestamp, num_words from [PROJECTID:demos.streamdemo] LIMIT 10
Cancel the job from the GCP console from the Dataflow section.
streamdemo topic from the Pub/Sub section of GCP Console
streamdemo table from the left-panel of BigQuery console
In this lab, you:
©Google, Inc. or its affiliates. All rights reserved. Do not distribute.