Using BigQuery with Python

1. Overview

BigQuery is Google's fully managed, petabyte scale, low cost analytics data warehouse. BigQuery is NoOps—there is no infrastructure to manage and you don't need a database administrator—so you can focus on analyzing data to find meaningful insights, use familiar SQL, and take advantage of our pay-as-you-go model.

In this codelab, you will use Google Cloud Client Libraries for Python to query BigQuery public datasets with Python.

What you'll learn

  • How to use Cloud Shell
  • How to enable the BigQuery API
  • How to authenticate API requests
  • How to install the Python client library
  • How to query the works of Shakespeare
  • How to query the GitHub dataset
  • How to adjust caching and display statistics

What you'll need

  • A Google Cloud Project
  • A Browser, such as Chrome or Firefox
  • Familiarity using Python

Survey

How will you use this tutorial?

Read it through only Read it and complete the exercises

How would you rate your experience with Python?

Novice Intermediate Proficient

How would you rate your experience with using Google Cloud services?

Novice Intermediate Proficient

2. Setup and requirements

Self-paced environment setup

  1. Sign-in to the Google Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one.

b35bf95b8bf3d5d8.png

a99b7ace416376c4.png

bd84a6d3004737c5.png

  • The Project name is the display name for this project's participants. It is a character string not used by Google APIs, and you can update it at any time.
  • The Project ID must be unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference the Project ID (and it is typically identified as PROJECT_ID), so if you don't like it, generate another random one, or, you can try your own and see if it's available. Then it's "frozen" after the project is created.
  • There is a third value, a Project Number which some APIs use. Learn more about all three of these values in the documentation.
  1. Next, you'll need to enable billing in the Cloud Console in order to use Cloud resources/APIs. Running through this codelab shouldn't cost much, if anything at all. To shut down resources so you don't incur billing beyond this tutorial, follow any "clean-up" instructions found at the end of the codelab. New users of Google Cloud are eligible for the $300 USD Free Trial program.

Start Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this codelab you will be using Google Cloud Shell, a command line environment running in the Cloud.

Activate Cloud Shell

  1. From the Cloud Console, click Activate Cloud Shell 853e55310c205094.png.

55efc1aaa7a4d3ad.png

If you've never started Cloud Shell before, you're presented with an intermediate screen (below the fold) describing what it is. If that's the case, click Continue (and you won't ever see it again). Here's what that one-time screen looks like:

9c92662c6a846a5c.png

It should only take a few moments to provision and connect to Cloud Shell.

9f0e51b578fecce5.png

This virtual machine is loaded with all the development tools you need. It offers a persistent 5GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with simply a browser or your Chromebook.

Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your project ID.

  1. Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list

Command output

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`
  1. Run the following command in Cloud Shell to confirm that the gcloud command knows about your project:
gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If it is not, you can set it with this command:

gcloud config set project <PROJECT_ID>

Command output

Updated property [core/project].

3. Enable the API

BigQuery API should be enabled by default in all Google Cloud projects. You can check whether this is true with the following command in the Cloud Shell: You should be BigQuery listed:

gcloud services list

You should see BigQuery listed:

NAME                              TITLE
bigquery.googleapis.com           BigQuery API

...

In case the BigQuery API is not enabled, you can use the following command in the Cloud Shell to enable it:

gcloud services enable bigquery.googleapis.com

4. Authenticate API requests

In order to make requests to the BigQuery API, you need to use a Service Account. A Service Account belongs to your project and it is used by the Google Cloud Python client library to make BigQuery API requests. Like any other user account, a service account is represented by an email address. In this section, you will use the Cloud SDK to create a service account and then create credentials you will need to authenticate as the service account.

First, set a PROJECT_ID environment variable:

export PROJECT_ID=$(gcloud config get-value core/project)

Next, create a new service account to access the BigQuery API by using:

gcloud iam service-accounts create my-bigquery-sa \
  --display-name "my bigquery service account"

Next, create credentials that your Python code will use to login as your new service account. Create these credentials and save it as a JSON file ~/key.json by using the following command:

gcloud iam service-accounts keys create ~/key.json \
  --iam-account my-bigquery-sa@${PROJECT_ID}.iam.gserviceaccount.com

Finally, set the GOOGLE_APPLICATION_CREDENTIALS environment variable, which is used by the BigQuery Python client library, covered in the next step, to find your credentials. The environment variable should be set to the full path of the credentials JSON file you created, by using:

export GOOGLE_APPLICATION_CREDENTIALS=~/key.json

You can read more about authenticating the BigQuery API.

5. Set up access control

BigQuery uses Identity and Access Management (IAM) to manage access to resources. BigQuery has a number of predefined roles (user, dataOwner, dataViewer etc.) that you can assign to your service account you created in the previous step. You can read more about Access Control in the BigQuery docs.

Before you can query public datasets, you need to make sure the service account has at least the roles/bigquery.user role. In Cloud Shell, run the following command to assign the user role to the service account:

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
  --member "serviceAccount:my-bigquery-sa@${PROJECT_ID}.iam.gserviceaccount.com" \
  --role "roles/bigquery.user"

You can run the following command to verify that the service account has the user role:

gcloud projects get-iam-policy $PROJECT_ID

You should see the following:

bindings:
- members:
  - serviceAccount:my-bigquery-sa@<PROJECT_ID>.iam.gserviceaccount.com
  role: roles/bigquery.user
...

6. Install the client library

Install the BigQuery Python client library:

pip3 install --user --upgrade google-cloud-bigquery

You're now ready to code with the BigQuery API!

7. Query the works of Shakespeare

A public dataset is any dataset that's stored in BigQuery and made available to the general public. There are many other public datasets available for you to query. While some datasets are hosted by Google, most are hosted by third parties. For more info see the Public Datasets page.

In addition to public datasets, BigQuery provides a limited number of sample tables that you can query. These tables are contained in the bigquery-public-data:samples dataset. The shakespeare table in the samples dataset contains a word index of the works of Shakespeare. It gives the number of times each word appears in each corpus.

In this step, you will query the shakespeare table.

First, in Cloud Shell create a simple Python application that you'll use to run the Translation API samples.

mkdir bigquery-demo
cd bigquery-demo
touch app.py

Open the code editor from the top right side of the Cloud Shell:

b648141af44811a3.png

Navigate to the app.py file inside the bigquery-demo folder and replace the code with the following.

from google.cloud import bigquery

client = bigquery.Client()

query = """
    SELECT corpus AS title, COUNT(word) AS unique_words
    FROM `bigquery-public-data.samples.shakespeare`
    GROUP BY title
    ORDER BY unique_words
    DESC LIMIT 10
"""
results = client.query(query)

for row in results:
    title = row['title']
    unique_words = row['unique_words']
    print(f'{title:<20} | {unique_words}')

Take a minute or two to study the code and see how the table is being queried.

Back in Cloud Shell, run the app:

python3 app.py

You should see a list of words and their occurrences:

hamlet               | 5318
kinghenryv           | 5104
cymbeline            | 4875
troilusandcressida   | 4795
kinglear             | 4784
kingrichardiii       | 4713
2kinghenryvi         | 4683
coriolanus           | 4653
2kinghenryiv         | 4605
antonyandcleopatra   | 4582

8. Query the GitHub dataset

To get more familiar with BigQuery, you'll now issue a query against the GitHub public dataset. You will find the most common commit messages on GitHub. You'll also use BigQuery's Web console to preview and run ad-hoc queries.

To see what the data looks like, open the GitHub dataset in the BigQuery web UI:

Open the github_repos table

Click the Preview button to see what the data looks like:

d3f0dc7400fbe678.png

Navigate to the app.py file inside the bigquery_demo folder and replace the code with the following.

from google.cloud import bigquery

client = bigquery.Client()

query = """
    SELECT subject AS subject, COUNT(*) AS num_duplicates
    FROM bigquery-public-data.github_repos.commits
    GROUP BY subject
    ORDER BY num_duplicates
    DESC LIMIT 10
"""
results = client.query(query)

for row in results:
    subject = row['subject']
    num_duplicates = row['num_duplicates']
    print(f'{subject:<20} | {num_duplicates:>9,}')

Take a minute or two to study the code and see how the table is being queried for the most common commit messages.

Back in Cloud Shell, run the app:

python3 app.py

You should see a list of commit messages and their occurrences:

Update README.md     | 1,685,515
Initial commit       | 1,577,543
update               |   211,017
                     |   155,280
Create README.md     |   153,711
Add files via upload |   152,354
initial commit       |   145,224
first commit         |   110,314
Update index.html    |    91,893
Update README        |    88,862

9. Caching and statistics

BigQuery caches the results of queries. As a result, subsequent queries take less time. It's possible to disable caching with query options. BigQuery also keeps track of stats about queries such as creation time, end time, total bytes processed.

In this step, you will disable caching and also display stats about the queries.

Navigate to the app.py file inside the bigquery_demo folder and replace the code with the following.

from google.cloud import bigquery

client = bigquery.Client()

query = """
    SELECT subject AS subject, COUNT(*) AS num_duplicates
    FROM bigquery-public-data.github_repos.commits
    GROUP BY subject
    ORDER BY num_duplicates
    DESC LIMIT 10
"""
job_config = bigquery.job.QueryJobConfig(use_query_cache=False)
results = client.query(query, job_config=job_config)

for row in results:
    subject = row['subject']
    num_duplicates = row['num_duplicates']
    print(f'{subject:<20} | {num_duplicates:>9,}')

print('-'*60)
print(f'Created: {results.created}')
print(f'Ended:   {results.ended}')
print(f'Bytes:   {results.total_bytes_processed:,}')

A couple of things to note about the code. First, caching is disabled by introducing QueryJobConfig and setting use_query_cache to false. Second, you accessed the statistics about the query from the job object.

Back in Cloud Shell, run the app:

python3 app.py

Like before, you should see a list of commit messages and their occurrences. In addition, you should also see some stats about the query in the end:

Update README.md     | 1,685,515
Initial commit       | 1,577,543
update               |   211,017
                     |   155,280
Create README.md     |   153,711
Add files via upload |   152,354
initial commit       |   145,224
first commit         |   110,314
Update index.html    |    91,893
Update README        |    88,862
------------------------------------------------------------
Created: 2020-04-03 13:30:08.801000+00:00
Ended:   2020-04-03 13:30:15.334000+00:00
Bytes:   2,868,251,894

10. Loading data into BigQuery

If you want to query your own data, you need to load your data into BigQuery. BigQuery supports loading data from many sources including Cloud Storage, other Google services, and other readable sources. You can even stream your data using streaming inserts. For more info see the Loading data into BigQuery page.

In this step, you will load a JSON file stored on Cloud Storage into a BigQuery table. The JSON file is located at gs://cloud-samples-data/bigquery/us-states/us-states.json

If you're curious about the contents of the JSON file, you can use gsutil command line tool to download it in the Cloud Shell:

gsutil cp gs://cloud-samples-data/bigquery/us-states/us-states.json .

You can see that it contains the list of US states and each state is a JSON document on a separate line:

head us-states.json
{"name": "Alabama", "post_abbr": "AL"}
{"name": "Alaska", "post_abbr":  "AK"}
...

To load this JSON file into BigQuery, navigate to the app.py file inside the bigquery_demo folder and replace the code with the following.

from google.cloud import bigquery

client = bigquery.Client()

gcs_uri = 'gs://cloud-samples-data/bigquery/us-states/us-states.json'

dataset = client.create_dataset('us_states_dataset')
table = dataset.table('us_states_table')

job_config = bigquery.job.LoadJobConfig()
job_config.schema = [
    bigquery.SchemaField('name', 'STRING'),
    bigquery.SchemaField('post_abbr', 'STRING'),
]
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON

load_job = client.load_table_from_uri(gcs_uri, table, job_config=job_config)

print('JSON file loaded to BigQuery')

Take a minute of two to study how the code loads the JSON file and creates a table with a schema under a dataset.

Back in Cloud Shell, run the app:

python3 app.py

A dataset and a table are created in BigQuery.

To verify that the dataset was created, go to the BigQuery console. You should see a new dataset and table. Switch to the preview tab of the table to see your data:

8c7d2621820a5ac4.png

11. Congratulations!

You learned how to use BigQuery with Python!

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial:

  • In the Cloud Console, go to the Manage resources page.
  • In the project list, select your project then click Delete.
  • In the dialog, type the project ID and then click Shut down to delete the project.

Learn more

License

This work is licensed under a Creative Commons Attribution 2.0 Generic License.