Google Cloud Dataflow

Last Updated: 2020-May-26

What is Dataflow?

Dataflow is a managed service for executing a wide variety of data processing patterns. The documentation on this site shows you how to deploy your batch and streaming data processing pipelines using Dataflow, including directions for using service features.

The Apache Beam SDK is an open source programming model that enables you to develop both batch and streaming pipelines. You create your pipelines with an Apache Beam program and then run them on the Dataflow service. The Apache Beam documentation provides in-depth conceptual information and reference material for the Apache Beam programming model, SDKs, and other runners.

Streaming data analytics with speed

Dataflow enables fast, simplified streaming data pipeline development with lower data latency.

Simplify operations and management

Allow teams to focus on programming instead of managing server clusters as Dataflow's serverless approach removes operational overhead from data engineering workloads.

Reduce total cost of ownership

Resource autoscaling paired with cost-optimized batch processing capabilities means Dataflow offers virtually limitless capacity to manage your seasonal and spiky workloads without overspending.

Key features

Automated resource management and dynamic work rebalancing

Dataflow automates provisioning and management of processing resources to minimize latency and maximize utilization so that you do not need to spin up instances or reserve them by hand. Work partitioning is also automated and optimized to dynamically rebalance lagging work. No need to chase down "hot keys" or preprocess your input data.

Horizontal autoscaling

Horizontal autoscaling of worker resources for optimum throughput results in better overall price-to-performance.

Flexible resource scheduling pricing for batch processing

For processing with flexibility in job scheduling time, such as overnight jobs, flexible resource scheduling (FlexRS) offers a lower price for batch processing. These flexible jobs are placed into a queue with a guarantee that they will be retrieved for execution within a six-hour window.

What you will run as part of this

In this codelab, you're going to begin using Dataflow SQL by submitting a SQL statement through the Dataflow SQL UI. You will then explore the pipeline running by using the Dataflow monitoring UI.

What you'll learn

What you'll need

Ensure that you have the Dataflow API and Cloud Pub/Sub API enabled. You can verify this by checking on the API's & Services page.

Go to API's & Services Page

The Dataflow SQL UI is a BigQuery web UI setting for creating Dataflow SQL jobs. You can access the Dataflow SQL UI from the BigQuery web UI.

1. Go to the BigQuery web UI.

Go to BigQuery web UI

2. Switch to the Cloud Dataflow engine.

The More drop-down menu in the BigQuery web UI with the Query settings option selected

The Query settings menu with the Enable APIs prompt

You can also access the Dataflow SQL UI from the Dataflow monitoring interface.

Go to the Dataflow monitoring interface

Writing Dataflow SQL queries

Dataflow SQL queries use the Dataflow SQL query syntax. The Dataflow SQL query syntax is similar to BigQuery standard SQL. You can use the Dataflow SQL streaming extensions to aggregate data from continuously updating Dataflow sources like Pub/Sub. For example, the following query counts the passengers in a Pub/Sub stream of taxi rides every minute:

  TUMBLE_START('INTERVAL 1 MINUTE') as period_start,
  SUM(passenger_count) AS pickup_count
FROM pubsub.topic.`pubsub-public-data`.`taxirides-realtime`
  ride_status = "pickup"
  TUMBLE(event_timestamp, 'INTERVAL 1 MINUTE')

Running Dataflow SQL queries

When you run a Dataflow SQL query, Dataflow turns the query into an Apache Beam pipeline and executes the pipeline.

You can run a Dataflow SQL query using the Cloud Console or gcloud command-line tool.

To run a Dataflow SQL query, use the Dataflow SQL UI.

For more information about querying data and writing Dataflow SQL query results, see Using data sources and destinations.

When you execute your pipeline using the Dataflow managed service, you can view that job and any others by using Dataflow's web-based monitoring user interface. The monitoring interface lets you see and interact with your Dataflow jobs.

You can access the Dataflow monitoring interface by using the Google Cloud Console. The monitoring interface can show you:

You can view job monitoring charts within the Dataflow monitoring interface. These charts display metrics over the duration of a pipeline job and include the following information:

Accessing the Dataflow monitoring interface

To access the Dataflow monitoring interface, follow these steps:

Go to the Dataflow monitoring interface

A list of Dataflow jobs appears along with their status.

A list of Dataflow jobs in the Cloud Console with jobs in the Running, Failed, and Succeeded states.

A job can have the following statuses:

Look for the job with "dfsql" as part of the job title and click on its name.

The Job details page, which contains the following:

Within the Job details page, you can switch your job view with the Job graph and Job metrics tab.

To stop Dataflow SQL jobs, use the Cancel command. Stopping a Dataflow SQL job with Drain is not supported.