In this lab you use some advanced SQL concepts to answer the question: what programming languages do open-source programmers program in on weekends?

What you need

To complete this lab, you need:

Access to a supported Internet browser:

A Google Cloud Platform project

What you learn

In this lab, you write a query that uses advanced SQL concepts:

In this lab, you use some advanced SQL concepts to answer the question: what programming languages do open-source programmers program in on weekends?

To answer this question, we will use a BigQuery public dataset that has information on all GitHub commits.

In this section, you will learn how to work with nested fields.

Step 1

Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI.

Step 2

Compose a new query, making sure that the "Legacy SQL" option is not checked (you are using Standard SQL).

SELECT
  author.email,
  diff.new_path AS path,
  author.date
FROM
  `bigquery-public-data.github_repos.commits`,
  UNNEST(difference) diff
WHERE
  EXTRACT(YEAR
  FROM
    author.date)=2016
LIMIT 10

Step 3

Play a little with the query above to understand what it is doing. For example, instead of author.email, try just author. What type of field is author?

Step 4

Change diff.new_path to difference.new_path. Why does it not work? Replace difference.new_path by difference[OFFSET(0)].new_path. Does this work? Why? What is the UNNEST doing?

In this section, you will learn how to use regular expressions. Let's assume that the filename extension is the programming language, i.e., a file that ends in .py has the language "py". How will you pull out the extension from the path?

Step 1

Type the following query:

SELECT
  author.email,
  LOWER(REGEXP_EXTRACT(diff.new_path, r'\.([^\./\(~_ \- #]*)$')) lang,
  diff.new_path AS path,
  author.date
FROM
  `bigquery-public-data.github_repos.commits`,
  UNNEST(difference) diff
WHERE
  EXTRACT(YEAR
  FROM
    author.date)=2016
LIMIT
  10

Step 2

Modify the query above to only use lang if the language consists purely of letters and has a length that is fewer than 8 characters.

Step 3

Modify the query above to group by language and list in descending order of the number of commits. Here's a potential solution:

WITH
  commits AS (
  SELECT
    author.email,
    LOWER(REGEXP_EXTRACT(diff.new_path, r'\.([^\./\(~_ \- #]*)$')) lang,
    diff.new_path AS path,
    author.date
  FROM
    `bigquery-public-data.github_repos.commits`,
    UNNEST(difference) diff
  WHERE
    EXTRACT(YEAR
    FROM
      author.date)=2016 )
SELECT
  lang,
  COUNT(path) AS numcommits
FROM
  commits
WHERE
  LENGTH(lang) < 8
  AND lang IS NOT NULL
  AND REGEXP_CONTAINS(lang, '[a-zA-Z]')
GROUP BY
  lang
HAVING
  numcommits > 100
ORDER BY
  numcommits DESC

Now, group the commits based on whether or not it happened on a weekend. How would you do it?

Step 1

Modify the query above to extract the day of the week from author.date. Days 2 to 6 are weekdays.

Step 2

Here's a potential solution:

WITH
  commits AS (
  SELECT
    author.email,
    EXTRACT(DAYOFWEEK
    FROM
      author.date) BETWEEN 2
    AND 6 is_weekday,
    LOWER(REGEXP_EXTRACT(diff.new_path, r'\.([^\./\(~_ \- #]*)$')) lang,
    diff.new_path AS path,
    author.date
  FROM
    `bigquery-public-data.github_repos.commits`,
    UNNEST(difference) diff
  WHERE
    EXTRACT(YEAR
    FROM
      author.date)=2016)
SELECT
  lang,
  is_weekday,
  COUNT(path) AS numcommits
FROM
  commits
WHERE
  LENGTH(lang) < 8
  AND lang IS NOT NULL
  AND REGEXP_CONTAINS(lang, '[a-zA-Z]')
GROUP BY
  lang,
  is_weekday
HAVING
  numcommits > 100
ORDER BY
  numcommits DESC

Ignoring file extensions that do not correspond to programming languages, it appears that the most popular weekend programming languages are JavaScript, PHP and C.

In this lab, you wrote a complex query using several advanced SQL features.

Acknowledgment:

This lab (and query) is based on an article by Felipe Hoffa: https://medium.com/@hoffa/the-top-weekend-languages-according-to-githubs-code-6022ea2e33e8#.8oj2rp804

┬ęGoogle, Inc. or its affiliates. All rights reserved. Do not distribute.