In this lab, you load a CSV file into a BigQuery table. After loading the data, you query it using the BigQuery web user interface, the CLI, and the BigQuery shell.

What you need

To complete this lab, you need:

Access to a supported Internet browser:

A Google Cloud Platform project

What you learn

In this lab, you:

Using BigQuery involves interacting with a number of Google Cloud Platform resources, including projects, datasets, tables, and jobs. This lab introduces you to some of these resources and this brief introduction summarizes their role in interacting with BigQuery.

Projects. A project contains information such as subscribed service API(s), authentication information, billing information and Access Control Lists (ACLs) that determine access to the datasets and the jobs. Projects are created and managed using the APIs Console. For information about the related API type, see Projects.

Datasets. A dataset is a grouping mechanism that holds zero or more tables. A dataset is the lowest level unit of access control. You cannot control access at the table level. Datasets are owned by projects, which control billing and serve as a global namespace root - all of the object names in BigQuery are relative to the project. Each dataset can be shared with individual users. Datasets are also referenced in the SQL statements when interacting with BigQuery. For information about the related API type, see Datasets.

Tables. Row-column structures that contain actual data. They belong to a dataset. You cannot control access at the table level, you do it at dataset level. For information about the related API type, see Tables.

Before you can query your data, you must first load it into BigQuery or set up the data as a federated data source. You can bulk load the data by using a job, or stream records individually. Alternately, you can skip the loading process by setting up a table as a federated data source.

Load jobs support three data sources:

  1. Objects in Google Cloud Storage
  2. Data sent with the job or streaming insert
  3. A Google Cloud Datastore backup

Loaded data can be added to a new table, appended to a table, or can overwrite a table. Data can be represented as a flat or nested/repeated schema.

Jobs. Jobs are used to start all potentially long-running actions, such as queries, table import, and export requests. Shorter actions, such as list or get requests, are not managed by a job resource. For information about the related API type, see Jobs. Each job has a job id. A very good use of job id is when you load a large dataset. BigQuery rejects load job with the same job id. Therefore, guaranteeing that data would not be loaded twice.

You do not need to duplicate data across many clusters to achieve resource separation. Just use ACLs.

The following diagram illustrates the relationship between projects, datasets, tables, and jobs.

BigQuery supports the following data formats when loading data into tables: CSV, JSON, AVRO, or Cloud Datastore backups. Data can be loaded into BigQuery using a job or by streaming records individually. Load jobs support three data sources:

In this lab, you load the contents of a CSV file (from Google Cloud Storage) into a BigQuery table using the BigQuery web UI. After loading the data, you query it using the web UI, the CLI, and the BigQuery shell.

The CSV file used in the lab contains the top 1,000 most popular baby names in the United States from 2014. You can download an archive of the data compiled for each year by the US government from the Social Security Online website.

To load the data in the CSV file into a BigQuery table:

Step 1

Open the Google Cloud Platform Console, and if necessary, select the cp100 project.

Step 2

Click Big Data > BigQuery.


Step 3

Click the blue arrow to the right of your project name and choose Create new dataset.

Step 4

In the Create Dataset dialog, for Dataset ID, type cp100 and then click OK.

Step 5

When the dataset is created, to the right of cp100, click the add table icon (the blue + icon).

Step 6

In the Source Data section:

Step 7

In the Destination Table section:

Step 8

In the Specify schema section:

Click Add field.

Click Add field.

Step 9

In the Options section:

Step 10

Once the load job is complete, click cp100 > namedata.

Step 11

On the Table Details page, click Details to view the table properties and then click Preview to view the table data.

Screen Shot 2016-09-26 at 1.27.53 PM.png

Dremel turns your SQL query into a dynamic execution tree consisting of a Master and a number of shards (groupings of compute resources).

The Master determines the number of shards needed to retrieve data and performs record assembly, ordering, and final filtering. The shards read data and perform aggregate and scalar functions against the data.

BigQuery relies on Colossus, Google's latest generation distributed file system. Each Google datacenter has its own Colossus cluster, and each Colossus cluster has enough disks to give every BigQuery user thousands of dedicated disks at a time. Colossus also handles replication, recovery (when disks crash) and distributed management (so there is no single point of failure). Colossus is fast enough to allow BigQuery to provide similar performance to many in-memory databases, but leveraging much cheaper yet highly parallelized, scalable, durable and performant infrastructure.

BigQuery leverages a columnar storage format and compression algorithm to store data in Colossus in the most optimal way for reading large amounts of structured data. Colossus allows BigQuery users to scale to dozens of Petabytes in storage seamlessly, without paying the penalty of attaching much more expensive compute resources — typical with most traditional databases.

To give you thousands of CPU cores dedicated to processing your task, BigQuery takes advantage of Borg, Google's large-scale cluster management system. Borg clusters run on dozens of thousands of machines and hundreds of thousands of cores.

Besides obvious needs for resource coordination and compute resources, Big Data workloads are often throttled by networking throughput. Google's Jupiter network can deliver 1 Petabit/sec of total bisection bandwidth, allowing us to efficiently and quickly distribute large workloads. Jupiter networking infrastructure might be the single biggest differentiator in Google Cloud Platform. It provides enough bandwidth to allow 100,000 machines to communicate with any other machine at 10 Gbs.

To get started querying data, BigQuery includes a number of public datasets.

In this section of the lab you query the namedata table you created previously using the BigQuery web UI, the CLI, and the BigQuery shell.

To query your data:

Step 1

In the BigQuery web UI, click Compose Query.

Step 2

In the New Query window, type the following query to display the 5 most popular girls' names in the United States in 2014.

SELECT
  name,
  COUNT
FROM
  cp100.namedata
WHERE
  gender = 'F'
ORDER BY
  COUNT DESC
LIMIT
  5

Notice the Query Validator tells you the query syntax is valid (indicated by the green check mark) and indicates how much data the query will process. The amount of data processed allows you to determine the price of the query using the Cloud Platform Pricing Calculator.

Step 3

Click Run Query and examine the results.

Step 4

Open the Google Cloud Platform Console.

Step 5

In the top right corner of the console window, click the Activate Google Cloud Shell button (Screen Shot 2016-09-22 at 12.43.31 PM.png).

Step 6

Type the following query using the bq CLI. This query displays the 5 least popular boys' names (out of the top 1,000 included in the dataset) in 2014.

bq query \
"SELECT name,count FROM cp100.namedata WHERE gender = 'M' ORDER BY count ASC LIMIT 5"

Step 7

Type the following command to launch the BigQuery shell.

bq shell

Step 8

Type the following command to display the 5 most popular boys' names in 2014.

SELECT name,count FROM cp100.namedata WHERE gender = 'M' ORDER BY count DESC LIMIT 5

Notice that when you use the BigQuery shell, you do not need to preface the query with bq query and you do not need to place the query in quotes.

Step 9

Type exit to close the BigQuery shell.

Step 10

Type exit to close the Cloud Shell window.

Step 11

Leave the BigQuery web UI and the Cloud Platform Console open.

To delete the resources used in the lab (and the remaining resources used in the course):

Step 1

Switch to the BigQuery web UI.

Step 2

To the right of the cp100 dataset, click the blue, drop-down arrow icon and choose Delete dataset.

Step 3

In the ‘Delete Dataset' dialog, type the dataset ID in the field (cp100) and then click OK.

Step 4

Close the BigQuery web UI.

Step 5

(Optional) Follow the remaining steps to delete the project you created for this course.

In in the top bar of the Google Cloud Platform Console, the name of the current project, cp100, is displayed. Click cp100, then click on the Manage resources icon (shown just to the right of the Search projects and folders bar).

Step 6

Select the cp100 project from the list, then click Delete.

Step 7

In the ‘Shut down project...' dialog, type the project ID and then click Shut down.

Step 8

(Optional) Follow the remaining steps if you wish to deactivate your billing account.

In the Google Cloud Platform Console's left-side menu, click Billing.

Step 9

If you have more than one billing account, click the billing account linked to your project.

Step 10

On the Overview page, at the top of the window, click Close Billing Account.

Step 11

Click Close billing account to confirm the action.

Step 12

Close the Cloud Platform Console.

Step 13

(Optional) You may also delete the zip file containing the Bookshelf cover images from your computer.

©Google, Inc. or its affiliates. All rights reserved. Do not distribute.