BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes of data without needing a database administrator or any infrastructure to manage. BigQuery uses familiar SQL and a pay-only-for-what-you-use charging model. BigQuery allows you to focus on analyzing data to find meaningful insights.
In this lab we'll see how to query the GitHub public dataset, one of many available public datasets available on BigQuery.
Click "Compose Query"
enter this query into the resulting query entry box, to find the most common commit messages,
#standardSQL SELECT subject AS subject, COUNT(*) AS num_duplicates FROM `bigquery-public-data.github_repos.commits` GROUP BY subject ORDER BY num_duplicates DESC LIMIT 100
and click Run Query:
In a few seconds, the result will be listed in the bottom, and it'll also tell you how much data was processed and how long it took:
In seconds, we queried a 750 GB table, but we only needed to process 8.7GB of data to get the result! This is because BigQuery is a columnar database. Because we only queried a single column, the total amount of data processed is significantly less than the total table size.
Now try running a query on your own.
For example, this query finds the repository names with the largest number of authors.
#standardSQL SELECT COUNT(DISTINCT author.email) AS num_authors, REGEXP_EXTRACT(repo_name[ORDINAL(1)], r"([^/]+)$") AS repo FROM `bigquery-public-data.github_repos.commits` GROUP BY repo ORDER BY num_authors DESC LIMIT 1000
Now try this query to find the most popular languages on Github based on number of pull requests in the GitHub Archive dataset:
#standardSQL SELECT COUNT(*) pr_count, JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') lang FROM `githubarchive.month.201801` WHERE JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') IS NOT NULL GROUP BY lang ORDER BY pr_count DESC LIMIT 10
Now try querying one of the other public datasets.
BigQuery is that simple! With basic SQL knowledge, you are now able to query terabytes of data!