Partitioning and Clustering in BigQuery

1. Introduction

BigQuery is a fully-managed, petabyte-scale, low-cost enterprise data warehouse for analytics. BigQuery is serverless. You do not need to set up and manage clusters.

A BigQuery dataset resides in a GCP project and contains one or more tables. You can query these datasets with SQL.

In this codelab, you will use the BigQuery web UI in the GCP Console to understand partitioning and clustering in BigQuery. BigQuery's table partitioning and clustering helps structuring your data to match common data access patterns. Partition and clustering is key to fully maximize BigQuery performance and cost when querying over a specific data range. It results in scanning less data per query, and pruning is determined before query start time.

For more information about BigQuery, see BigQuery documentation.

What you'll learn

  • How to create and query partitioned and clustered tables
  • Compare query performance with partitioned and clustered tables

What you'll need

To complete this lab, you need:

  • The latest version of Google Chrome
  • A Google Cloud Platform billing account

2. Getting set up

To work with BigQuery, you need to create a GCP project or select an existing project.

Create a project

To create a new project, follow these steps:

  1. If you don't already have a Google Account (Gmail or Google Apps), create one.
  2. Sign-in to Google Cloud Platform console ( console.cloud.google.com) and create a new project.
  3. If you don't have any projects, click the create project button:

870a3cbd6541ee86.png

Otherwise, create a new project from the project selection menu:

f6dff3437a20cf2.png

  1. Enter a project name and select Create. Note the project ID, is a unique name across all Google Cloud projects.

1884405a64ce5765.png

3. Working with public datasets

BigQuery allows you to work with public datasets, including BBC News, GitHub repos, Stack Overflow, and the US National Oceanic and Atmospheric Administration (NOAA) datasets. You do not need to load these datasets into BigQuery. You just need to open the datasets to browse and query them in BigQuery. In this codelab, you will work with the Stack Overflow public dataset.

Browse the Stack Overflow dataset

The Stack Overflow dataset contains information about posts, tags, badges, comments, users, and more. To browse the Stack Overflow dataset in the BigQuery web UI, follow these steps:

  1. Open the Stack Overflow dataset. The BigQuery web UI opens in the GCP Console and displays information about the Stackoverflow dataset.
  2. In the navigation panel , select bigquery-public-data. The menu expands to list public datasets. Each dataset comprises one or more tables.
  3. Scroll down and select stackoverflow. The menu expands to list the tables in the Stack Overflow dataset.
  4. Select badges to see the schema for the badges table. Note the names of the fields in the table.
  5. Above the Field names, click Preview to see sample data for the badges table.

For more information about all the public datasets available in BigQuery, see Google BigQuery Public Datasets.

Query the Stackoverflow dataset

Browsing a dataset is a good way to understand the data that you are working with, but querying datasets is where BigQuery really shines. This section teaches you how to run BigQuery queries. You do not need to know any SQL at this point. You can copy and paste the queries below.

To run a query, complete the following steps:

  1. Near the top right of the GCP console, select Compose new query.
  2. In the Query editor text area, copy and paste the following SQL query. BigQuery validates the query and the web UI displays a green checkmark below the text area to indicate that the syntax is valid.
SELECT
  EXTRACT(YEAR FROM creation_date) AS creation_year,
  COUNT(*) AS total_posts
FROM `bigquery-public-data.stackoverflow.posts_questions`
GROUP BY creation_year
ORDER BY total_posts DESC
LIMIT 10
  1. Select Run. The query returns the number of Stack Overflow posts or questions posted every year.

4. Creating a New Table

In the previous section, you queried public datasets that BigQuery makes available to you. In this section, you will create a new table in BigQuery from an existing table. You will create a new table with data sampled from the Stack Overflow public dataset posts_questions table and then query the table.

Create a New Dataset

To create and load table data into BigQuery, first create a BigQuery dataset to hold the data by completing the following steps:

  1. In the GCP console navigation panel, select the project name created as part of the setup.
  2. On the right side, in the details panel, select Create dataset.

acc6378c49622323.png

  1. In the Create dataset dialog, for Dataset ID, type stackoverflow. Leave all of the other default settings in place and click OK.

7a2dfd8bcb8f259a.png

Create a New Table with 2018 StackOverflow Posts

Now that you have created a BigQuery dataset, you can create a new table in BigQuery. To create a table with data from an existing table, you will query the 2018 Stack Overflow posts dataset and write the results to a new table, by completing the following steps:

  1. Near the top right of the GCP console, select Compose new query.

9ca55f544e8da8bd.png

  1. In the Query editor text area, copy and paste the following SQL query to create a new table, which is a DDL statement.
CREATE OR REPLACE TABLE `stackoverflow.questions_2018` AS
SELECT id, title, accepted_answer_id, creation_date, answer_count , comment_count , favorite_count, view_count, tags
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE creation_date BETWEEN '2018-01-01' AND '2019-01-01';
  1. Select Run. The query creates a new table questions_2018 in the stackoverflow dataset in your project with data resulting from running a query on the BigQuery Stack Overflow dataset bigquery-public-data.stackoverflow.posts_questions.

Query the New Table with 2018 Stack Overflow Posts

Now that you have created a BigQuery table, let's run a query to return Stack Overflow posts with questions and titles along with a few other statistics such as number of answers, comments, views and favorites. Complete the following steps:

  1. Near the top right of the GCP console, select Compose new query.
  2. In the Query editor text area, copy and paste the following SQL query
SELECT id, title, accepted_answer_id, creation_date, answer_count , comment_count , favorite_count, view_count 
FROM  `stackoverflow.questions_2018` 
WHERE creation_date BETWEEN '2018-01-01' AND '2018-02-01'
AND tags = 'android';
  1. Select Run. The query returns Stack Overflow questions created in the month of January 2018 that are tagged as android along with the question and a few other statistics.
  2. By default, BigQuery caches the query results. Run the same query and you will see BigQuery took way less time to return the results because it returns results from the cache.
  3. Run the same query again but this time with BigQuery caching disabled. We will disable the cache for the rest of the lab to be fair in performance comparison against partitioned and clustered tables, that will be run in next sections. In the query editor, click More and select Query settings. Query settings
  4. Under Cache preference, uncheck Use cached results. Cached results option
  5. In the query results, you should see the time it took for the query to complete and volume of data processed to get the results.

f197b022b4276338.png

5. Creating and Querying a Partitioned Table

In the previous section, you created a new table in BigQuery with data from posts_questions table using the Stack Overflow public dataset. We queried this dataset with caching disabled and observed the query performance. In this section, you will create a new partitioned table from the same Stack Overflow public dataset's posts_questions table and observe the query performance.

A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. You can typically split large tables into many smaller partitions using data ingestion time or TIMESTAMP/DATE column or an INTEGER column. We will create a DATE partitioned table.

Learn more about partitioned tables here.

Create a New Partitioned Table with 2018 StackOverflow Posts

To create a partitioned table with data from an existing table or query, you will query the 2018 Stackoverflow posts dataset and write results to a new table, complete the following steps:

b9d0ca4df0881f58.png

  1. Near the top right of the GCP console, select Compose new query.

9ca55f544e8da8bd.png

  1. In the Query editor text area, copy and paste the following SQL query to create a new table, which is a DDL statement.
CREATE OR REPLACE TABLE `stackoverflow.questions_2018_partitioned` 
PARTITION BY DATE(creation_date) AS
SELECT id, title, accepted_answer_id, creation_date, answer_count , comment_count , favorite_count, view_count, tags
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE creation_date BETWEEN '2018-01-01' AND '2019-01-01';
  1. Select Run. The query creates a new table questions_2018_partitioned in the stackoverflow dataset in your project with data resulting from running a query on the BigQuery Stack Overflow dataset bigquery-public-data.stackoverflow.posts_questions

Query the Partitioned Table with 2018 Stack Overflow Posts

Now that you have created a BigQuery partitioned table, let's run the same query, this time on the partitioned table, to return Stack Overflow posts with questions and titles along with a few other statistics such as number of answers, comments, views and favorites. Complete the following steps:

  1. Near the top right of the GCP console, select Compose new query.
  2. In the Query editor text area, copy and paste the following SQL query
SELECT id, title, accepted_answer_id, creation_date, answer_count , comment_count , favorite_count, view_count 
FROM  `stackoverflow.questions_2018_partitioned` 
WHERE creation_date BETWEEN '2018-01-01' AND '2018-02-01'
AND tags = 'android';
  1. Select Run with BigQuery caching disabled (check previous section for disabling BigQuery cache). The query returns Stack Overflow questions created in the month of January 2018 that are tagged as android along with the question and a few other statistics.
  2. In the query results, you should see the time it took for the query to complete and volume of data processed to get the results.

ef01144374069823.png

You should see that the performance of the query with partitioned table is better than the non-partitioned table since BigQuery prunes the partitions i.e. scans only the required partitions processing less data and running faster. This optimizes the query costs and query performance.

6. Creating and Querying a Clustered Table

In the previous section, you created a partitioned table in BigQuery with data from the posts_questions table in the Stack Overflow public dataset. We queried this table with caching disabled and observed the query performance with both non-partitioned and partitioned tables. In this section, you will create a new clustered table from the same Stack Overflow public dataset's posts_questions table and observe the query performance.

When a table is clustered in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table's schema. The columns you specify are used to collocate related data. When data is written to a clustered table, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. The order of clustered columns determines the sort order of the data. When new data is added to a table or a specific partition, BigQuery performs automatic re-clustering in the background to restore the sort property of the table or partition.

Learn more about working with clustered tables here.

Create a New Clustered Table with 2018 Stack Overflow Posts

In this section, you will create a new table partitioned on creation_date and clustered on the tags column based on the query access pattern. To create a clustered table with data from an existing table or query, you will query the 2018 Stack Overflow posts table and write the results to a new table, by completing the following steps:

e7d9acc0dc3b9d79.png

  1. Near the top right of the GCP console, select Compose new query.

9ca55f544e8da8bd.png

  1. In the Query editor text area, copy and paste the following SQL query to create a new table, which is a DDL statement.
#standardSQL
CREATE OR REPLACE TABLE `stackoverflow.questions_2018_clustered`
PARTITION BY
  DATE(creation_date)
CLUSTER BY
  tags AS
SELECT
  id, title, accepted_answer_id, creation_date, answer_count , comment_count , favorite_count, view_count, tags
FROM
  `bigquery-public-data.stackoverflow.posts_questions`
WHERE
  creation_date BETWEEN '2018-01-01' AND '2019-01-01';
  1. Select Run. The query creates a new table questions_2018_clustered in the stackoverflow dataset in your project with data resulting from running a query on the BigQuery Stack Overflow tablebigquery-public-data.stackoverflow.posts_questions. The new table is partitioned on creation_date and clustered on the tags column.

Query the Clustered Table with 2018 Stack Overflow Posts

Now that you have created a BigQuery clustered table, let's run the same query again, this time on the partitioned and clustered table, to return Stack Overflow posts with questions and titles along with a few other statistics such as number of answers, comments, views and favorites. Complete the following steps:

  1. Near the top right of the GCP console, select Compose new query.
  2. In the Query editor text area, copy and paste the following SQL query
SELECT id, title, accepted_answer_id, creation_date, answer_count , comment_count , favorite_count, view_count 
FROM  `stackoverflow.questions_2018_clustered` 
WHERE creation_date BETWEEN '2018-01-01' AND '2018-02-01'
AND tags = 'android';
  1. Select Run with BigQuery caching disabled (check previous section for disabling BigQuery cache). The query returns Stack Overflow questions created in the month of January 2018 that are tagged as android along with the question and a few other statistics.
  2. In the query results, you should see the time it took for the query to complete and volume of data processed to get the results.

85e3c30d6fb3d547.png

With a partitioned and clustered table, the query scanned less data than a partitioned table or a non-partitioned table. The way data is organized by partitioning and clustering minimizes the amount of data scanned by slot workers thereby improving query performance and optimizing costs.

7. Cleaning up

Unless you plan to continue working with your stackoverflow dataset, you should delete it and delete the project that you created for this codelab.

Delete the BigQuery dataset

To delete the BigQuery dataset, perform the following steps:

  1. Select the stackoverflow dataset from the left side navigation panel in BigQuery .
  2. In the details panel, select Delete dataset. 67b0f5cb740cb2ec.png
  3. In the Delete dataset dialog, enter stackoverflow and select Delete to confirm that you want to delete the dataset.

Delete the project

To delete the GCP project that you created for this codelab, perform the following steps:

  1. In the GCP navigation menu, select IAM & Admin.
  2. In the navigation panel, select Settings.
  3. In the details panel, confirm that your current project is the project you created for this codelab and select Shut down.
  4. In the Shut down project dialog, enter the project ID (not project name) for your project and select Shut down to confirm.

Congratulations! You have now learned

  • How to use the BigQuery web UI to create a new table from existing tables
  • How to create and query partitioned and clustered tables
  • How partitioning and clustering optimizes query performance and costs

Note that you did not have to set up or manage clusters to work with datasets.