Google Cloud Dataflow
Last Updated: 2020-Sept-22
What is Dataflow?
Dataflow is a managed service for executing a wide variety of data processing patterns. The documentation on this site shows you how to deploy your batch and streaming data processing pipelines using Dataflow, including directions for using service features.
The Apache Beam SDK is an open source programming model that enables you to develop both batch and streaming pipelines. You create your pipelines with an Apache Beam program and then run them on the Dataflow service. The Apache Beam documentation provides in-depth conceptual information and reference material for the Apache Beam programming model, SDKs, and other runners.
Streaming data analytics with speed
Dataflow enables fast, simplified streaming data pipeline development with lower data latency.
Simplify operations and management
Allow teams to focus on programming instead of managing server clusters as Dataflow's serverless approach removes operational overhead from data engineering workloads.
Reduce total cost of ownership
Resource autoscaling paired with cost-optimized batch processing capabilities means Dataflow offers virtually limitless capacity to manage your seasonal and spiky workloads without overspending.
Automated resource management and dynamic work rebalancing
Dataflow automates provisioning and management of processing resources to minimize latency and maximize utilization so that you do not need to spin up instances or reserve them by hand. Work partitioning is also automated and optimized to dynamically rebalance lagging work. No need to chase down "hot keys" or preprocess your input data.
Horizontal autoscaling of worker resources for optimum throughput results in better overall price-to-performance.
Flexible resource scheduling pricing for batch processing
For processing with flexibility in job scheduling time, such as overnight jobs, flexible resource scheduling (FlexRS) offers a lower price for batch processing. These flexible jobs are placed into a queue with a guarantee that they will be retrieved for execution within a six-hour window.
What you will run as part of this
Using the Apache Beam interactive runner with JupyterLab notebooks lets you iteratively develop pipelines, inspect your pipeline graph, and parse individual PCollections in a read-eval-print-loop (REPL) workflow. These Apache Beam notebooks are made available through AI Platform Notebooks, a managed service that hosts notebook virtual machines pre-installed with the latest data science and machine learning frameworks.
This codelab focuses on the functionality introduced by Apache Beam notebooks.
What you'll learn
- How to create a notebook instance
- Creating a basic pipeline
- Reading data from unbounded source
- Visualizing the data
- Launching a Dataflow Job from the notebook
- Saving a notebook
What you'll need
- A Google Cloud Platform project with Billing enabled.
- Google Cloud Dataflow and Google Cloud PubSub enabled.
2. Getting set up
- In the Cloud Console, on the project selector page, select or create a Cloud project.
Ensure that you have the Dataflow API and Cloud Pub/Sub API enabled. You can verify this by checking on the API's & Services page.
3. Getting started with Apache Beam notebooks
Launching an Apache Beam notebooks instance
- Launch Dataflow on the Console:
- In the toolbar, click add New Instance.
- Select Apache Beam.
- On the New notebook instance page, select a network for the notebook VM and click Create.
- Click Open JupyterLab when the link becomes active. AI Platform Notebooks creates a new Apache Beam notebook instance.
4. Creating the pipeline
Creating a notebook instance
Navigate to File > New > Notebook and select a kernel that is Apache Beam 2.20 or later.
Start adding code to your notebook
- Copy and paste the code from each section within a new cell in your notebook
- Run the cell
Apache Beam is installed on your notebook instance, so include the interactive_runner and interactive_beam modules in your notebook.
import apache_beam as beam from apache_beam.runners.interactive import interactive_runner import apache_beam.runners.interactive.interactive_beam as ib
If your notebook uses other Google APIs, add the following import statements:
from apache_beam.options import pipeline_options from apache_beam.options.pipeline_options import GoogleCloudOptions import google.auth
Setting interactivity options
The following sets the data capture duration to 60 seconds.
ib.options.capture_duration = timedelta(seconds=60)
For additional interactive options, see the interactive_beam.options class.
Initialize the pipeline using an InteractiveRunner object.
options = pipeline_options.PipelineOptions() # Set the pipeline mode to stream the data from Pub/Sub. options.view_as(pipeline_options.StandardOptions).streaming = True p = beam.Pipeline(InteractiveRunner(), options=options)
Reading and visualizing the data
The following example shows a Apache Beam pipeline that creates a subscription to the given Pub/Sub topic and reads from the subscription.
words = p | "read" >> beam.io.ReadFromPubSub(topic=topic)
The pipeline counts the words by windows from the source. It creates fixed windowing with each window being 10 seconds in duration.
windowed_words = (words | "window" >> beam.WindowInto(beam.window.FixedWindows(10)))
After the data is windowed, the words are counted by window.
windowed_words_counts = (windowed_words | "count" >> beam.combiners.Count.PerElement())
Visualizing the data
The show() method visualizes the resulting PCollection in the notebook.
To display visualizations of your data, pass visualize_data=True into the show() method. You can apply multiple filters to your visualizations. The following visualization allows you to filter by label and axis:
5. Using a Pandas Dataframe
Another useful visualization in Apache Beam notebooks is a Pandas DataFrame. The following example first converts the words to lowercase and then computes the frequency of each word.
windowed_lower_word_counts = (windowed_words | beam.Map(lambda word: word.lower()) | "count" >> beam.combiners.Count.PerElement())
The collect() method provides the output in a Pandas DataFrame.
6. (Optional) Launching Dataflow jobs from your notebook
- (Optional) Before using your notebook to run Dataflow jobs, restart the kernel, rerun all cells, and verify the output.
- Remove the following import statements:
from apache_beam.runners.interactive import interactive_runner import apache_beam.runners.interactive.interactive_beam as ib
- Add the following import statement:
from apache_beam.runners import DataflowRunner
- Pass in your pipeline options.
# Set up Apache Beam pipeline options. options = pipeline_options.PipelineOptions() # Set the project to the default project in your current Google Cloud # environment. _, options.view_as(GoogleCloudOptions).project = google.auth.default() # Set the Google Cloud region to run Dataflow. options.view_as(GoogleCloudOptions).region = 'us-central1' # Choose a Cloud Storage location. dataflow_gcs_location = 'gs://<change me>/dataflow' # Set the staging location. This location is used to stage the # Dataflow pipeline and SDK binary. options.view_as(GoogleCloudOptions).staging_location = '%s/staging' % dataflow_gcs_location # Set the temporary location. This location is used to store temporary files # or intermediate results before outputting to the sink. options.view_as(GoogleCloudOptions).temp_location = '%s/temp' % dataflow_gcs_location # Set the SDK location. This is used by Dataflow to locate the # SDK needed to run the pipeline. options.view_as(pipeline_options.SetupOptions).sdk_location = ( '/root/apache-beam-custom/packages/beam/sdks/python/dist/apache-beam-%s0.tar.gz' % beam.version.__version__)
- You can adjust the parameter values. For example, you can change the region value from us-central1.
- In the constructor of beam.Pipeline(), replace InteractiveRunner with DataflowRunner.
runner = DataflowRunner() runner.run_pipeline(p, options=options)
p is the pipeline object from Creating your pipeline.
- Remove the interactive calls from your code. For example, remove show(), collect(), head(), show_graph(), and watch() from your code.
- Add p.run() at the end of your pipeline code.
For an example on how to perform this conversion on an interactive notebook, see the Dataflow Word Count notebook in your notebook instance.
Alternatively, you can export your notebook as an executable script, modify the generated .py file using the previous steps, and then deploy your pipeline to the Dataflow service.
7. Saving your notebook
Notebooks you create are saved locally in your running notebook instance. If you reset or shut down the notebook instance during development, those new notebooks are deleted. To keep your notebooks for future use, download them locally to your workstation, save them to GitHub, or export them to a different file format.
8. Cleaning up
After you've finished using your Apache Beam notebook instance, clean up the resources you created on Google Cloud by shutting down the notebook instance.