BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage and don't need a database administrator. BigQuery uses familiar SQL and it can take advantage of pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.
In this lab, we will explore the Wikipedia dataset using BigQuery.
If you don't already have a Google Account (Gmail or G Suite), you must create one. Sign-in to Google Cloud Platform console (console.cloud.google.com) and create a new project from the Manage resources page :
Remember the project ID, a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!). It will be referred to later in this codelab as
Next, you'll need to enable billing in the Developers Console in order to use Google Cloud resources.
Running through this codelab shouldn't cost you more than a few dollars, but it could be more if you decide to use more resources or if you leave them running (see "cleanup" section at the end of this document). The Google Cloud Platform pricing calculator is available here.
New users of Google Cloud Platform are eligible for a $300 free trial.
If you see a "request account button" at the top of the main Codelabs window, click it to obtain a temporary account. Otherwise ask one of the staff for a coupon with username/password.
These temporary accounts have existing projects that are set up with billing so that there are no costs associated for you with running this codelab.
Note that all these accounts will be disabled soon after the codelab is over.
Use these credentials to log into the machine or to open a new Google Cloud Console window https://console.cloud.google.com/. Accept the new account Terms of Service and any updates to Terms of Service.
Here's what you should see once logged in:
When presented with this console landing page, please select the only project available. Alternatively, from the console home page, click on "Select a Project" :
In the Google Developer Console, click the Menu icon on the top left of the screen.
Scroll down the menu to the bottom, and click BigQuery:
This will open up the BigQuery console in a new browser tab, that looks like this:
But, there is nothing in here! Luckily, there are tons of Open Datasets available in BigQuery for you to query. We'll query the Wikipedia dataset in the next section.
We need a dataset to query with. We'll talk about loading data into BigQuery in another lab. In this section, we'll query the Wikimedia pageviews dataset that's part of many Public Datasets available in BigQuery today, including Wikimedia, Hacker News, GitHub, GDELT (News events), and many more.
For this lab, we'll use the Wikimedia public data set. To add the data set, visit this URL:
bigquery-samples:wikimedia_pageviews, this is the dataset we will use.
Scroll down and find and click the table 201112:
You can see the table schema in the Schema view on the right:
You can find out how much data is in the table, by navigating to the Details view on the right:
Alright! Let's write a query to see what's the most popular Wikipedia page in December, 2011 using BigQuery.
Click Compose Query on the top left:
This will bring up the New Query view:
Let's find out the total number of Wikimedia views in December, 2011, by writing this query:
SELECT SUM(views) FROM `bigquery-samples.wikimedia_pageviews.201112`
Before we run the query, for the purpose of this lab, let's disable data caching so that we are not using any cached results. Click Show Options:
Then, uncheck Use Cached Results:
Also, uncheck Use legacy SQL:
Click Run Query:
In a few seconds, the result will be listed in the bottom, and it'll also tell you how much data was proccessed:
In seconds, we queried over a 105 GB table, but we only needed to process 12.2GB of data to get to the result! This is because BigQuery is a columnar database. Because we only queried a single column, the total amount of data processed is significantly less than the total table size.
The Wikimedia dataset contains page views for all of the Wikimedia projects (including Wikipedia, Wikitionary, Wikibooks, Wikiquotes, etc). Let's narrow down the query to just Wikipedia pages by adding a
SELECT SUM(views), wikimedia_project FROM `bigquery-samples.wikimedia_pageviews.201112` WHERE wikimedia_project = "wp" GROUP BY wikimedia_project
Notice that, by querying an additional column, wikimedia_project, the amount of data processed increased from 12.2 GB to 18 GB.
BigQuery supports many of the familiar SQL clauses, such as
order by, and a number of aggregation functions. In addition, you can also use regular expressions to query text fields! Let's try one:
SELECT title, SUM(views) views FROM `bigquery-samples.wikimedia_pageviews.201112` WHERE wikimedia_project = "wp" AND REGEXP_CONTAINS(title, 'Red.*t') GROUP BY title ORDER BY views DESC
You can select a range of tables to form the union using a wildcard table. Let's query over the entire year of 2011 by querying tables with "2011" as a prefix:
This query will query over a total dataset size of 1 TB, but process only 672 GB of data.
SELECT title, SUM(views) views FROM `bigquery-samples.wikimedia_pageviews.2011*` WHERE wikimedia_project = "wp" AND REGEXP_CONTAINS(title, 'Red.*t') GROUP BY title ORDER BY views DESC
That wasn't too hard to query that much data!
You can filter the tables more selectively with the _TABLE_SUFFIX pseudo column. This query will limit to tables corresponding with the last two months of the year.
SELECT title, SUM(views) views FROM `bigquery-samples.wikimedia_pageviews.2011*` WHERE (_TABLE_SUFFIX = '11' OR _TABLE_SUFFIX = '12') AND wikimedia_project = "wp" AND REGEXP_CONTAINS(title, 'Red.*t') GROUP BY title ORDER BY views DESC
Because the query limited the tables scanned with the _TABLE_SUFFIX pseudo column, it only processed 113 GB of data.
BigQuery is that simple! With basic SQL knowledge, you are now able to query terabytes and terabytes of data!