BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes of data without needing a database administrator or any infrastructure to manage. BigQuery uses familiar SQL and a pay-only-for-what-you-use charging model. BigQuery allows you to focus on analyzing data to find meaningful insights.
In this lab we'll see how to query the GitHub public dataset, one of many available public datasets available on BigQuery.
If you don't already have a Google Account (Gmail or Google Apps), you must create one.
This codelab uses BigQuery resources withing the BigQuery sandbox limits. A billing account is not required. If you later want to remove the sandbox limits, you can add a billing account by signing up for the Google Cloud Platform free trial.
Open the Query editor,
enter this query to find the most common commit messages in the GitHub public dataset,
SELECT subject AS subject, COUNT(*) AS num_duplicates FROM `bigquery-public-data.github_repos.sample_commits` GROUP BY subject ORDER BY num_duplicates DESC LIMIT 100
Since the GitHub dataset is large, it helps to use a smaller sample dataset while you're experimenting to save on costs. Use the bytes processed below the editor to estimate the query cost.
Click the Run query button.
In a few seconds, the result will be listed in the bottom, and it'll also tell you how much data was processed and how long it took:
Even though, the
sample_commits table is 2.49 GB, the query only processed 35.8 MB. BigQuery only processes the bytes from the columns which are used in the query, so the total amount of data processed can be significantly less than the table size. With clustering and partitioning, the amount of data processed can be reduced even further.
Now try querying another dataset, such as one of the other public datasets.
For example, this query finds popular deprecated or unmaintained projects in the Libraries.io public dataset that are still used as a dependency in other projects.
SELECT name, dependent_projects_count, language, status FROM `bigquery-public-data.libraries_io.projects_with_repository_fields` WHERE status IN ('Deprecated', 'Unmaintained') ORDER BY dependent_projects_count DESC LIMIT 100
Other organizations have also made their data available publicly on BigQuery. For example, the GitHub Archive dataset can be used to analyze public events on GitHub such as pull requests, repository stars, and issues opened. The Python Software Foundation's PyPI dataset can be used to analyze download requests for Python packages.
You've used BigQuery and SQL to query the GitHub public dataset. You have the power to query petabyte-scale datasets!