Google Cloud Dataflow
Last Updated: 2020-May-28
Dataflow is a managed service for executing a wide variety of data processing patterns. The documentation on this site shows you how to deploy your batch and streaming data processing pipelines using Dataflow, including directions for using service features.
The Apache Beam SDK is an open source programming model that enables you to develop both batch and streaming pipelines. You create your pipelines with an Apache Beam program and then run them on the Dataflow service. The Apache Beam documentation provides in-depth conceptual information and reference material for the Apache Beam programming model, SDKs, and other runners.
Dataflow enables fast, simplified streaming data pipeline development with lower data latency.
Allow teams to focus on programming instead of managing server clusters as Dataflow's serverless approach removes operational overhead from data engineering workloads.
Resource autoscaling paired with cost-optimized batch processing capabilities means Dataflow offers virtually limitless capacity to manage your seasonal and spiky workloads without overspending.
Dataflow automates provisioning and management of processing resources to minimize latency and maximize utilization so that you do not need to spin up instances or reserve them by hand. Work partitioning is also automated and optimized to dynamically rebalance lagging work. No need to chase down "hot keys" or preprocess your input data.
Horizontal autoscaling of worker resources for optimum throughput results in better overall price-to-performance.
For processing with flexibility in job scheduling time, such as overnight jobs, flexible resource scheduling (FlexRS) offers a lower price for batch processing. These flexible jobs are placed into a queue with a guarantee that they will be retrieved for execution within a six-hour window.
Using the Apache Beam interactive runner with JupyterLab notebooks lets you iteratively develop pipelines, inspect your pipeline graph, and parse individual PCollections in a read-eval-print-loop (REPL) workflow. These Apache Beam notebooks are made available through AI Platform Notebooks, a managed service that hosts notebook virtual machines pre-installed with the latest data science and machine learning frameworks.
This codelab focuses on the functionality introduced by Apache Beam notebooks.
Ensure that you have the Dataflow API and Cloud Pub/Sub API enabled. You can verify this by checking on the API's & Services page.
Navigate to File > New > Notebook and select a kernel that is Apache Beam 2.20 or later.
Apache Beam is installed on your notebook instance, so include the interactive_runner and interactive_beam modules in your notebook.
import apache_beam as beam from apache_beam.runners.interactive import interactive_runner import apache_beam.runners.interactive.interactive_beam as ib
If your notebook uses other Google APIs, add the following import statements:
from apache_beam.options import pipeline_options from apache_beam.options.pipeline_options import GoogleCloudOptions import google.auth
The following sets the data capture duration to 60 seconds.
ib.options.capture_duration = timedelta(seconds=60)
For additional interactive options, see the interactive_beam.options class.
Initialize the pipeline using an InteractiveRunner object.
options = pipeline_options.PipelineOptions() # Set the pipeline mode to stream the data from Pub/Sub. options.view_as(pipeline_options.StandardOptions).streaming = True p = beam.Pipeline(interactive_runner.InteractiveRunner(), options=options)
The following example shows a Apache Beam pipeline that creates a subscription to the given Pub/Sub topic and reads from the subscription.
words = p | "read" >> beam.io.ReadFromPubSub(topic=topic)
The pipeline counts the words by windows from the source. It creates fixed windowing with each window being 10 seconds in duration.
windowed_words = (words | "window" >> beam.WindowInto(beam.window.FixedWindows(10)))
After the data is windowed, the words are counted by window.
windowed_words_counts = (windowed_words | "count" >> beam.combiners.Count.PerElement())
The show() method visualizes the resulting PCollection in the notebook.
To display visualizations of your data, pass visualize_data=True into the show() method. You can apply multiple filters to your visualizations. The following visualization allows you to filter by label and axis:
Another useful visualization in Apache Beam notebooks is a Pandas DataFrame. The following example first converts the words to lowercase and then computes the frequency of each word.
windowed_lower_word_counts = (windowed_words | beam.Map(lambda word: word.lower()) | "count" >> beam.combiners.Count.PerElement())
The collect() method provides the output in a Pandas DataFrame.
from apache_beam.runners.interactive import interactive_runner import apache_beam.runners.interactive.interactive_beam as ib
from apache_beam.runners import DataflowRunner
p = beam.Pipeline(DataflowRunner(), options=options)
For an example on how to perform this conversion on an interactive notebook, see the Dataflow Word Count notebook in your notebook instance.
Alternatively, you can export your notebook as an executable script, modify the generated .py file using the previous steps, and then deploy your pipeline to the Dataflow service.
Notebooks you create are saved locally in your running notebook instance. If you reset or shut down the notebook instance during development, those new notebooks are deleted. To keep your notebooks for future use, download them locally to your workstation, save them to GitHub, or export them to a different file format.
After you've finished using your Apache Beam notebook instance, clean up the resources you created on Google Cloud by shutting down the notebook instance.