Load and query data with the bq command-line tool for BigQuery

BigQuery is Google's fully managed, NoOps, low-cost analytics database. With BigQuery, you can query terabytes of data without a database administrator or infrastructure. BigQuery uses familiar SQL and a pay-only-for-what-you-use charging model. BigQuery allows you to focus on analyzing data to find meaningful insights. In this codelab, you'll use the bq command-line tool to load a local CSV file into a new BigQuery table.

What you'll learn

  • How to use the bq command-line tool for BigQuery
  • How to load local data files into a BigQuery table

What you'll need

Enable BigQuery

If you don't already have a Google Account (Gmail or Google Apps), you must create one.

  • Sign-in to Google Cloud Platform console ( console.cloud.google.com) and navigate to BigQuery. You can also open the BigQuery web UI directly by entering the following URL in your browser.
https://console.cloud.google.com/bigquery
  • Accept the terms of service.
  • Before you can use BigQuery, you must create a project. Follow the prompts to create your new project.

Choose a project name and make note of the project ID. 5dHf3myqCTd3rm-fowZ_aU3An-T_NTgNnIZtQILio27us0xB3StjnSNnQraAnllEQCH4N2nMwLU1mnELwbNN85tbwNC_DbIdbxU8ufzJYW1MWpYu0hnbSrAajpAaRNs8UBeWFu68Aw

The project ID is a unique name across all Google Cloud projects. It will be referred to later in this codelab as PROJECT_ID.

This codelab uses BigQuery resources withing the BigQuery sandbox limits. A billing account is not required. If you later want to remove the sandbox limits, you can add a billing account by signing up for the Google Cloud Platform free trial.

Cloud Shell

You will use Cloud Shell, a command-line environment running in Google Cloud.

Activate Cloud Shell

  1. From the Cloud Console, click Activate Cloud Shell H7JlbhKGHITmsxhQIcLwoe5HXZMhDlYue4K-SPszMxUxDjIeWfOHBfxDHYpmLQTzUmQ7Xx8o6OJUlANnQF0iBuUyfp1RzVad_4nCa0Zz5LtwBlUZFXFCWFrmrWZLqg1MkZz2LdgUDQ.

zlNW0HehB_AFW1qZ4AyebSQUdWm95n7TbnOr7UVm3j9dFcg6oWApJRlC0jnU1Mvb-IQp-trP1Px8xKNwt6o3pP6fyih947sEhOFI4IRF0W7WZk6hFqZDUGXQQXrw21GuMm2ecHrbzQ

If you've never started Cloud Shell before, you'll be presented with an intermediate screen (below the fold) describing what it is. If that's the case, click Continue (and you won't ever see it again). Here's what that one-time screen looks like:

kEPbNAo_w5C_pi9QvhFwWwky1cX8hr_xEMGWySNIoMCdi-Djx9AQRqWn-__DmEpC7vKgUtl-feTcv-wBxJ8NwzzAp7mY65-fi2LJo4twUoewT1SUjd6Y3h81RG3rKIkqhoVlFR-G7w

It should only take a few moments to provision and connect to Cloud Shell.

pTv5mEKzWMWp5VBrg2eGcuRPv9dLInPToS-mohlrqDASyYGWnZ_SwE-MzOWHe76ZdCSmw0kgWogSJv27lrQE8pvA5OD6P1I47nz8vrAdK7yR1NseZKJvcxAZrPb8wRxoqyTpD-gbhA

This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with simply a browser or your Chromebook.

Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your project ID.

  1. Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list

Command output

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`
gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If it is not, you can set it with this command:

gcloud config set project <PROJECT_ID>

Command output

Updated property [core/project].

Create a dataset to contain your tables.

What is a dataset?

A BigQuery dataset is a collection of tables. All tables in a dataset are stored in the same data location. You can also attach custom access controls to limit access to a dataset and its tables.

Create a dataset

In Cloud Shell, use the bq mk command to create a dataset called "bq_load_codelab."

bq mk bq_load_codelab

View dataset properties

Verify that you created the dataset by viewing the dataset's properties with the bq show command.

bq show bq_load_codelab

You should see output similar to the following:

Dataset my-project:bq_load_codelab

   Last modified           ACLs          Labels
 ----------------- -------------------- --------
  15 Jun 14:12:49   Owners:
                      projectOwners,
                      your-email@example.com
                    Writers:
                      projectWriters
                    Readers:
                      projectReaders

BigQuery can load data from several data formats, including newline-delimited JSON, Avro, and CSV. For simplicity, you'll use CSV.

Create a CSV file

In Cloud Shell, create an empty CSV file.

touch customer_transactions.csv

Open the CSV file in code editor in Cloud Shell by running the cloudshell edit command, which will open a new browser window with a code editor and Cloud Shell panel.

cloudshell edit customer_transactions.csv

In the code editor, enter some comma-separated values to load into BigQuery.

ID,Zipcode,Timestamp,Amount,Feedback,SKU
c123,78757,2018-02-14 17:01:39Z,1.20,4.7,he4rt5
c456,10012,2018-03-14 15:09:26Z,53.60,3.1,ppiieee
c123,78741,2018-04-01 05:59:47Z,5.98,2.0,ch0c0

Save the CSV file by clicking File > Edit.

Use the bq load command to load your CSV file into a BigQuery table.

bq load \
    --source_format=CSV \
    --skip_leading_rows=1 \
    bq_load_codelab.customer_transactions \
    ./customer_transactions.csv \
    id:string,zip:string,ttime:timestamp,amount:numeric,fdbk:float,sku:string

You used the following options:

  • --source_format=CSV uses the CSV data format when parsing the data file.
  • --skip_leading_rows=1 skips the first line in the CSV file because it is a header row.
  • Bq_load_codelab.customer_transactions—the first positional argument—defines which table the data should be loaded into.
  • ./customer_transactions.csv—the second positional argument—defines which file to load. In addition to local files, the bq load command can load files from Cloud Storage with gs://my_bucket/path/to/file URIs.
  • A schema, which can be defined in a JSON schema file or as a comma-separated list. (You used a comma-separated list for simplicity.)

You used the following schema in the customer_transactions table:

  • Id:string: A customer identifier
  • Zip:string: A US postal zip code
  • Ttime:timestamp: The date and time that the transaction took place
  • Amount:numeric: The amount of a transaction (A numeric column stores data in decimal form, which is useful for monetary values.)
  • Fdbk:float: The rating from a feedback survey about the transaction
  • Sku:string: An identifier for the item purchased

Get the table details

Verify that the table loaded by showing the table properties.

bq show bq_load_codelab.customer_transactions

Output:

Table my-project:bq_load_codelab.customer_transactions

   Last modified          Schema          Total Rows   Total Bytes
 ----------------- --------------------- ------------ -------------
  15 Jun 15:13:55   |- id: string         3            159
                    |- zip: string
                    |- ttime: timestamp
                    |- amount: numeric
                    |- fdbk: float
                    |- sku: string

Now that your data is loaded, you can query it by using the BigQuery web UI, the bq command, or the API. Your queries can join your data against any dataset (or datasets, so long as they're in the same location) that you have permission to read.

Run a standard SQL query that joins your dataset with the US ZIP codes dataset and sums up transactions by US state. Use the bq query command to execute the query.

bq query --nouse_legacy_sql '
SELECT SUM(c.amount) AS amount_total, z.state_code AS state_code
FROM `bq_load_codelab.customer_transactions` c
JOIN `bigquery-public-data.utility_us.zipcode_area` z
ON c.zip = z.zipcode
GROUP BY state_code
'

That command should output something like this:

Waiting on bqjob_r26...05a15b38_1 ... (1s) Current status: DONE   
+--------------+------------+
| amount_total | state_code |
+--------------+------------+
|         53.6 | NY         |
|         7.18 | TX         |
+--------------+------------+

The query that you ran used a public dataset and your private dataset. Learn more by reading this commented version of the same query:

#standardSQL
SELECT
  /* Total of all transactions in the state. */
  SUM(c.amount) AS amount_total,

  /* State corresponding to the transaction's zipcode. */
  z.state_code AS state_code

/* Query the table you just constructed.
 * Note: If you omit the project from the table ID,
 *       the dataset is read from your project. */
FROM `bq_load_codelab.customer_transactions` c

/* Join the table to the zipcode public dataset. */
JOIN `bigquery-public-data.utility_us.zipcode_area` z

/* Find the state corresponding to the transaction's zipcode. */
ON c.zip = z.zipcode

/* Group over all transactions by state. */
GROUP BY state_code

Delete the dataset that you created with the bq rm command. Use the -r flag to remove any tables that it contains.

bq rm -r bq_load_codelab

You uploaded a table to BigQuery and queried it!

What you covered

  • Using the bq command-line tool to interact with BigQuery.
  • Joining your data and a public dataset with a BigQuery query.

What's next

Learn more about: