Running BigQuery jobs in parallel with Workflows

1. Introduction

1c05e3d0c2bd2b45.png 74be7b376d45258a.png

Workflows is a fully-managed orchestration service that executes Google Cloud or external services in the order that you define.

BigQuery is a fully managed enterprise data warehouse that helps you manage and analyze terabytes of data with built-in features like machine learning, geospatial analysis, and business intelligence.

In this codelab, you will run some BigQuery queries against the public Wikipedia dataset. You will then see how to run multiple BigQuery queries one after another in a serial way, as part of a Workflows orchestration. Finally, you will parallelize the queries using the parallel iteration feature of Workflows for up to 5x speed improvement.

What you'll learn

  • How to run BigQuery queries against the Wikipedia dataset.
  • How to run multiple queries as part of a Workflows orchestration serially.
  • How to parallelize queries using Workflows parallel iteration for up to 5x speed improvement.

2. Setup and Requirements

Self-paced environment setup

  1. Sign-in to the Google Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one.

b35bf95b8bf3d5d8.png

a99b7ace416376c4.png

bd84a6d3004737c5.png

  • The Project name is the display name for this project's participants. It is a character string not used by Google APIs. You can update it at any time.
  • The Project ID must be unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference the Project ID (it is typically identified as PROJECT_ID). If you don't like the generated ID, you may generate another random one. Alternatively, you can try your own and see if it's available. It cannot be changed after this step and will remain for the duration of the project.
  • For your information, there is a third value, a Project Number which some APIs use. Learn more about all three of these values in the documentation.
  1. Next, you'll need to enable billing in the Cloud Console to use Cloud resources/APIs. Running through this codelab shouldn't cost much, if anything at all. To shut down resources so you don't incur billing beyond this tutorial, you can delete the resources you created or delete the whole project. New users of Google Cloud are eligible for the $300 USD Free Trial program.

Start Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this codelab you will be using Google Cloud Shell, a command line environment running in the Cloud.

From the Google Cloud Console, click the Cloud Shell icon on the top right toolbar:

55efc1aaa7a4d3ad.png

It should only take a few moments to provision and connect to the environment. When it is finished, you should see something like this:

7ffe5cbb04455448.png

This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on Google Cloud, greatly enhancing network performance and authentication. All of your work in this codelab can be done within a browser. You do not need to install anything.

3. Explore the Wikipedia dataset

First, explore the Wikipedia dataset in BigQuery.

Go to the BigQuery section of Google Cloud Console:

ea75ab12a7c012a4.png

Under bigquery-samples, you should see various public datasets, including some Wikipedia related datasets:

c9484e305b8e1438.png

Under the wikipedia_pageviews dataset, you can see various tables for page views from different years:

c540a4162640cbb3.png

You can select one of the tables (eg. 201207) and preview the data:

b5b2a334cd6f63c0.png

You can also run queries against the table. For example, this query selects the top 100 titles with the most views:

SELECT TITLE, SUM(views)
FROM bigquery-samples.wikipedia_pageviews.201207h
GROUP BY TITLE
ORDER BY SUM(VIEWS) DESC
LIMIT 100

Once you run the query, it takes about 20 seconds to load the data:

1df3877aed1653b4.png

4. Define a workflow to run multiple queries

Running a query against a single table is easy. However, running multiple queries against multiple tables and collating the results can get quite tedious. To assist with this, Workflows can help with its iteration syntax!

Inside the Cloud Shell, create a workflow-serial.yaml file to build a workflow to run multiple queries against multiple tables:

touch workflow-serial.yaml

You can then edit the file with the editor in Cloud Shell:

33bf9325b078ad8.png

Inside the workflow-serial.yaml file, in the first init step, create a results map to keep track of each iteration keyed by table names. Also define a tables array with the list of tables you want to run queries against. In this case, we are choosing 5 tables:

main:
    steps:
    - init:
        assign:
            - results : {}
            - tables:
                - 201201h
                - 201202h
                - 201203h
                - 201204h
                - 201205h

Next, define a runQueries step. This step iterates over each table and uses Workflows' BigQuery connector to run a query to find the top 100 titles with most page views in each table. It then saves the top title and views from each table in the results map:

    - runQueries:
        for:
            value: table
            in: ${tables}
            steps:
            - runQuery:
                call: googleapis.bigquery.v2.jobs.query
                args:
                    projectId: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
                    body:
                        useLegacySql: false
                        useQueryCache: false
                        timeoutMs: 30000
                        # Find the top 100 titles with most views on Wikipedia
                        query: ${
                            "SELECT TITLE, SUM(views)
                            FROM `bigquery-samples.wikipedia_pageviews." + table + "`
                            WHERE LENGTH(TITLE) > 10
                            GROUP BY TITLE
                            ORDER BY SUM(VIEWS) DESC
                            LIMIT 100"
                            }
                result: queryResult
            - returnResult:
                assign:
                    # Return the top title from each table
                    - results[table]: {}
                    - results[table].title: ${queryResult.rows[0].f[0].v}
                    - results[table].views: ${queryResult.rows[0].f[1].v}

In the final step, return the results map:

    - returnResults:
        return: ${results}

5. Run multiple queries with Workflows

Before you can deploy and run the workflow, you need to make sure the Workflows API is enabled. You can enable it from Google Cloud Console or using gcloud in Cloud Shell:

gcloud services enable workflows.googleapis.com

Create a service account for Workflows:

SERVICE_ACCOUNT=workflows-bigquery-sa
gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="Workflows BigQuery service account"

Make sure the service account has the roles to log and run BigQuery jobs:

PROJECT_ID=your-project-id
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --role roles/logging.logWriter \
  --role roles/bigquery.jobUser \
  --member serviceAccount:$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com

Deploy the workflow with the service account:

gcloud workflows deploy bigquery-serial \
    --source=workflow-serial.yaml \
    --service-account=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com

Finally, you're ready to run the workflow.

Find the bigquery-serial workflow under the Workflows section of Cloud Console and hit Execute button:

b6afa4747680334f.png

Alternatively, you can also run the workflow with gcloud in Cloud Shell:

gcloud workflows run bigquery-serial

You should see the workflow execution lasting about 1 minute (20 seconds for each of the 5 tables).

In the end, you will see the output from each table with top titles and views:

304d11a5bffdada4.png

baf31533d3671c9e.png

6. Parallelize multiple queries with parallel steps

The workflow in the previous step took around 1 minute because it ran 5 queries that took 20 seconds each. Since these are independent queries, you can actually run them in parallel using the parallel iteration feature of Workflows.

Copy the workflow-serial.yaml file into a new workflow-parallel.yaml file. In the new file, you will make a couple of changes to turn the serial steps into parallel steps.

In the workflow-parallel.yaml file, change the runQueries step. First, add the parallel keyword. This allows each iteration of the for loop to run in parallel. Second, declare the results variable as a shared variable. This allows the variable to be writable by a branch. We will append each result to this variable.

- runQueries:
    parallel:
        shared: [results]
        for:
            value: table
            in: ${tables}

Deploy the parallel workflow:

gcloud workflows deploy bigquery-parallel \
    --source=workflow-parallel.yaml \
    --service-account=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com

Run the workflow:

gcloud workflows run bigquery-parallel

You should see the workflow execution lasting about 20 seconds. This is due to all 5 queries running in parallel. Up to 5x speed improvement with only a couple of lines of code change!

In the end, you will see the same output from each table with top titles and views but with much shorter execution time:

1825d49ef225c828.png

7. Congratulations

Congratulations, you finished the codelab! To learn more, check out Workflows documentation on parallel steps.

What we've covered

  • How to run BigQuery queries against the Wikipedia dataset.
  • How to run multiple queries as part of a Workflows orchestration serially.
  • How to parallelize queries using Workflows parallel iteration for up to a 5x speed improvement.