BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes of data without needing a database administrator or any infrastructure to manage. BigQuery uses familiar SQL and a pay-only-for-what-you-use charging model. BigQuery allows you to focus on analyzing data to find meaningful insights.

In this lab we'll see how to query the GitHub public dataset, one of many available public datasets available on BigQuery.

What you'll learn

What you'll need

Open the GitHub dataset in the BigQuery web UI.

https://bigquery.cloud.google.com/table/bigquery-public-data:github_repos.commits

Get a quick preview of how the data looks.

Click "Compose Query"

enter this query into the resulting query entry box, to find the most common commit messages,

#standardSQL
SELECT subject AS subject,
  COUNT(*) AS num_duplicates
FROM `bigquery-public-data.github_repos.commits`
GROUP BY subject
ORDER BY num_duplicates DESC
LIMIT 100

and click Run Query:

In a few seconds, the result will be listed in the bottom, and it'll also tell you how much data was processed and how long it took:

In seconds, we queried a 750 GB table, but we only needed to process 8.7GB of data to get the result! This is because BigQuery is a columnar database. Because we only queried a single column, the total amount of data processed is significantly less than the total table size.

Now try running a query on your own.

For example, this query finds the repository names with the largest number of authors.

#standardSQL
SELECT COUNT(DISTINCT author.email) AS num_authors,
  REGEXP_EXTRACT(repo_name[ORDINAL(1)], r"([^/]+)$") AS repo
FROM `bigquery-public-data.github_repos.commits`
GROUP BY repo
ORDER BY num_authors DESC
LIMIT 1000

Now try this query to find the most popular languages on Github based on number of pull requests in the GitHub Archive dataset:

#standardSQL
SELECT COUNT(*) pr_count,
   JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') lang 
FROM `githubarchive.month.201801` 
WHERE JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') 
IS NOT NULL 
GROUP BY lang 
ORDER BY pr_count DESC 
LIMIT 10

Now try querying one of the other public datasets.

BigQuery is that simple! With basic SQL knowledge, you are now able to query terabytes of data!

What we've covered

Learn More