Google Cloud Dataflow
Last Updated: 2020-May-26
Dataflow is a managed service for executing a wide variety of data processing patterns. The documentation on this site shows you how to deploy your batch and streaming data processing pipelines using Dataflow, including directions for using service features.
The Apache Beam SDK is an open source programming model that enables you to develop both batch and streaming pipelines. You create your pipelines with an Apache Beam program and then run them on the Dataflow service. The Apache Beam documentation provides in-depth conceptual information and reference material for the Apache Beam programming model, SDKs, and other runners.
Dataflow enables fast, simplified streaming data pipeline development with lower data latency.
Allow teams to focus on programming instead of managing server clusters as Dataflow's serverless approach removes operational overhead from data engineering workloads.
Resource autoscaling paired with cost-optimized batch processing capabilities means Dataflow offers virtually limitless capacity to manage your seasonal and spiky workloads without overspending.
Dataflow automates provisioning and management of processing resources to minimize latency and maximize utilization so that you do not need to spin up instances or reserve them by hand. Work partitioning is also automated and optimized to dynamically rebalance lagging work. No need to chase down "hot keys" or preprocess your input data.
Horizontal autoscaling of worker resources for optimum throughput results in better overall price-to-performance.
For processing with flexibility in job scheduling time, such as overnight jobs, flexible resource scheduling (FlexRS) offers a lower price for batch processing. These flexible jobs are placed into a queue with a guarantee that they will be retrieved for execution within a six-hour window.
In this codelab, you're going to begin using Dataflow SQL by submitting a SQL statement through the Dataflow SQL UI. You will then explore the pipeline running by using the Dataflow monitoring UI.
Ensure that you have the Dataflow API and Cloud Pub/Sub API enabled. You can verify this by checking on the API's & Services page.
The Dataflow SQL UI is a BigQuery web UI setting for creating Dataflow SQL jobs. You can access the Dataflow SQL UI from the BigQuery web UI.
1. Go to the BigQuery web UI.
2. Switch to the Cloud Dataflow engine.
You can also access the Dataflow SQL UI from the Dataflow monitoring interface.
Dataflow SQL queries use the Dataflow SQL query syntax. The Dataflow SQL query syntax is similar to BigQuery standard SQL. You can use the Dataflow SQL streaming extensions to aggregate data from continuously updating Dataflow sources like Pub/Sub. For example, the following query counts the passengers in a Pub/Sub stream of taxi rides every minute:
SELECT TUMBLE_START('INTERVAL 1 MINUTE') as period_start, SUM(passenger_count) AS pickup_count FROM pubsub.topic.`pubsub-public-data`.`taxirides-realtime` WHERE ride_status = "pickup" GROUP BY TUMBLE(event_timestamp, 'INTERVAL 1 MINUTE')
When you run a Dataflow SQL query, Dataflow turns the query into an Apache Beam pipeline and executes the pipeline.
You can run a Dataflow SQL query using the Cloud Console or gcloud command-line tool.
To run a Dataflow SQL query, use the Dataflow SQL UI.
For more information about querying data and writing Dataflow SQL query results, see Using data sources and destinations.
When you execute your pipeline using the Dataflow managed service, you can view that job and any others by using Dataflow's web-based monitoring user interface. The monitoring interface lets you see and interact with your Dataflow jobs.
You can access the Dataflow monitoring interface by using the Google Cloud Console. The monitoring interface can show you:
You can view job monitoring charts within the Dataflow monitoring interface. These charts display metrics over the duration of a pipeline job and include the following information:
To access the Dataflow monitoring interface, follow these steps:
A list of Dataflow jobs appears along with their status.
A list of Dataflow jobs in the Cloud Console with jobs in the Running, Failed, and Succeeded states.
A job can have the following statuses:
Look for the job with "dfsql" as part of the job title and click on its name.
The Job details page, which contains the following:
Within the Job details page, you can switch your job view with the Job graph and Job metrics tab.
To stop Dataflow SQL jobs, use the Cancel command. Stopping a Dataflow SQL job with Drain is not supported.