1. Introduction
Workflows is a fully-managed orchestration service that executes Google Cloud or external services in the order that you define.
BigQuery is a fully managed enterprise data warehouse that helps you manage and analyze terabytes of data with built-in features like machine learning, geospatial analysis, and business intelligence.
In this codelab, you will run some BigQuery queries against the public Wikipedia dataset. You will then see how to run multiple BigQuery queries one after another in a serial way, as part of a Workflows orchestration. Finally, you will parallelize the queries using the parallel iteration feature of Workflows for up to 5x speed improvement.
What you'll learn
- How to run BigQuery queries against the Wikipedia dataset.
- How to run multiple queries as part of a Workflows orchestration serially.
- How to parallelize queries using Workflows parallel iteration for up to 5x speed improvement.
2. Setup and Requirements
Self-paced environment setup
- Sign-in to the Google Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one.
- The Project name is the display name for this project's participants. It is a character string not used by Google APIs. You can update it at any time.
- The Project ID must be unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference the Project ID (it is typically identified as
PROJECT_ID
). If you don't like the generated ID, you may generate another random one. Alternatively, you can try your own and see if it's available. It cannot be changed after this step and will remain for the duration of the project. - For your information, there is a third value, a Project Number which some APIs use. Learn more about all three of these values in the documentation.
- Next, you'll need to enable billing in the Cloud Console to use Cloud resources/APIs. Running through this codelab shouldn't cost much, if anything at all. To shut down resources so you don't incur billing beyond this tutorial, you can delete the resources you created or delete the whole project. New users of Google Cloud are eligible for the $300 USD Free Trial program.
Start Cloud Shell
While Google Cloud can be operated remotely from your laptop, in this codelab you will be using Google Cloud Shell, a command line environment running in the Cloud.
From the Google Cloud Console, click the Cloud Shell icon on the top right toolbar:
It should only take a few moments to provision and connect to the environment. When it is finished, you should see something like this:
This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on Google Cloud, greatly enhancing network performance and authentication. All of your work in this codelab can be done within a browser. You do not need to install anything.
3. Explore the Wikipedia dataset
First, explore the Wikipedia dataset in BigQuery.
Go to the BigQuery section of Google Cloud Console:
Under bigquery-samples
, you should see various public datasets, including some Wikipedia related datasets:
Under the wikipedia_pageviews
dataset, you can see various tables for page views from different years:
You can select one of the tables (eg. 201207
) and preview the data:
You can also run queries against the table. For example, this query selects the top 100 titles with the most views:
SELECT TITLE, SUM(views) FROM bigquery-samples.wikipedia_pageviews.201207h GROUP BY TITLE ORDER BY SUM(VIEWS) DESC LIMIT 100
Once you run the query, it takes about 20 seconds to load the data:
4. Define a workflow to run multiple queries
Running a query against a single table is easy. However, running multiple queries against multiple tables and collating the results can get quite tedious. To assist with this, Workflows can help with its iteration syntax!
Inside the Cloud Shell, create a workflow-serial.yaml
file to build a workflow to run multiple queries against multiple tables:
touch workflow-serial.yaml
You can then edit the file with the editor in Cloud Shell:
Inside the workflow-serial.yaml
file, in the first init
step, create a results
map to keep track of each iteration keyed by table names. Also define a tables
array with the list of tables you want to run queries against. In this case, we are choosing 5 tables:
main: steps: - init: assign: - results : {} - tables: - 201201h - 201202h - 201203h - 201204h - 201205h
Next, define a runQueries
step. This step iterates over each table and uses Workflows' BigQuery connector to run a query to find the top 100 titles with most page views in each table. It then saves the top title and views from each table in the results map:
- runQueries: for: value: table in: ${tables} steps: - runQuery: call: googleapis.bigquery.v2.jobs.query args: projectId: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")} body: useLegacySql: false useQueryCache: false timeoutMs: 30000 # Find the top 100 titles with most views on Wikipedia query: ${ "SELECT TITLE, SUM(views) FROM `bigquery-samples.wikipedia_pageviews." + table + "` WHERE LENGTH(TITLE) > 10 GROUP BY TITLE ORDER BY SUM(VIEWS) DESC LIMIT 100" } result: queryResult - returnResult: assign: # Return the top title from each table - results[table]: {} - results[table].title: ${queryResult.rows[0].f[0].v} - results[table].views: ${queryResult.rows[0].f[1].v}
In the final step, return the results
map:
- returnResults: return: ${results}
5. Run multiple queries with Workflows
Before you can deploy and run the workflow, you need to make sure the Workflows API is enabled. You can enable it from Google Cloud Console or using gcloud
in Cloud Shell:
gcloud services enable workflows.googleapis.com
Create a service account for Workflows:
SERVICE_ACCOUNT=workflows-bigquery-sa gcloud iam service-accounts create $SERVICE_ACCOUNT \ --display-name="Workflows BigQuery service account"
Make sure the service account has the roles to log and run BigQuery jobs:
PROJECT_ID=your-project-id gcloud projects add-iam-policy-binding $PROJECT_ID \ --role roles/logging.logWriter \ --role roles/bigquery.jobUser \ --member serviceAccount:$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com
Deploy the workflow with the service account:
gcloud workflows deploy bigquery-serial \ --source=workflow-serial.yaml \ --service-account=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com
Finally, you're ready to run the workflow.
Find the bigquery-serial
workflow under the Workflows section of Cloud Console and hit Execute
button:
Alternatively, you can also run the workflow with gcloud
in Cloud Shell:
gcloud workflows run bigquery-serial
You should see the workflow execution lasting about 1 minute (20 seconds for each of the 5 tables).
In the end, you will see the output from each table with top titles and views:
6. Parallelize multiple queries with parallel steps
The workflow in the previous step took around 1 minute because it ran 5 queries that took 20 seconds each. Since these are independent queries, you can actually run them in parallel using the parallel iteration feature of Workflows.
Copy the workflow-serial.yaml
file into a new workflow-parallel.yaml
file. In the new file, you will make a couple of changes to turn the serial steps into parallel steps.
In the workflow-parallel.yaml
file, change the runQueries
step. First, add the parallel
keyword. This allows each iteration of the for loop to run in parallel. Second, declare the results
variable as a shared
variable. This allows the variable to be writable by a branch. We will append each result to this variable.
- runQueries: parallel: shared: [results] for: value: table in: ${tables}
Deploy the parallel workflow:
gcloud workflows deploy bigquery-parallel \ --source=workflow-parallel.yaml \ --service-account=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com
Run the workflow:
gcloud workflows run bigquery-parallel
You should see the workflow execution lasting about 20 seconds. This is due to all 5 queries running in parallel. Up to 5x speed improvement with only a couple of lines of code change!
In the end, you will see the same output from each table with top titles and views but with much shorter execution time:
7. Congratulations
Congratulations, you finished the codelab! To learn more, check out Workflows documentation on parallel steps.
What we've covered
- How to run BigQuery queries against the Wikipedia dataset.
- How to run multiple queries as part of a Workflows orchestration serially.
- How to parallelize queries using Workflows parallel iteration for up to a 5x speed improvement.