BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes of data without needing a database administrator or any infrastructure to manage. BigQuery uses familiar SQL and a pay-only-for-what-you-use charging model. BigQuery allows you to focus on analyzing data to find meaningful insights.

In this lab we'll see how to query the GitHub public dataset, one of many available public datasets available on BigQuery.

What you'll learn

What you'll need

Open the GitHub dataset in the BigQuery web UI.

https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=github_repos&t=commits&page=table

Get a quick preview of how the data looks.

Open the Query editor,

enter this query to find the most common commit messages in the GitHub public dataset,

SELECT subject AS subject,
  COUNT(*) AS num_duplicates
FROM `bigquery-public-data.github_repos.commits`
GROUP BY subject
ORDER BY num_duplicates DESC
LIMIT 100

and click the Run query button.

In a few seconds, the result will be listed in the bottom, and it'll also tell you how much data was processed and how long it took:

In seconds, we queried a 730 GB table, but we only needed to process 9.26 GB of data to get the result! This is because BigQuery is a columnar database. Because we only queried a single column, the total amount of data processed is significantly less than the total table size.

Now try querying another dataset, such as one of the other public datasets.

For example, this query finds popular deprecated or unmaintained projects in the Libraries.io public dataset that are still used as a dependency in other projects.

SELECT
  name,
  dependent_projects_count,
  language,
  status
FROM
  `bigquery-public-data.libraries_io.projects_with_repository_fields`
WHERE status IN ('Deprecated', 'Unmaintained')
ORDER BY dependent_projects_count DESC
LIMIT 100

Other organizations have also made their data available publicly on BigQuery. For example, the GitHub Archive dataset can be used to analyze public events on GitHub such as pull requests, repository stars, and issues opened. The Python Software Foundation's PyPI dataset can be used to analyze download requests for Python packages.

BigQuery is that simple! With basic SQL knowledge, you are now able to query terabytes of data!

What we've covered

Learn More