You learn how to work with batch ingestion, and do some query analysis. The interesting twist is we're working with structured data, but we don't know the nature of the data.

What you learn

In this lab, you learn how to:

Step 1: Create dataset

Create a new BigQuery dataset using the web UI and name it sme_import_codelab:

Step 2: Load the items data

Creating an items table, by ingesting a CSV file from cloud storage.

CSV Source data: gs://retail-training-example/items.csv

Table name: items

Schema: autodetect

Step 3: Preview the Data

Once the load job completes, we should have our items table present. Click on it in the nav, and hit the "preview" button to get a sample of the records:

It's entirely possible that you'll see an entirely different sample of data.

We have some suspicions about what these fields might contain. Let's probe those suspicions with some query analysis.

Step 1: Total records

How many records are there? ____________________________

(Hint: click on the Details button or write a SQL query on the table meta data.)

Step 2: Uniqueness

Is the int64_field_0 is an identifier column? ________________

(Hint: Is there a unique ID for each column? What are the bounds on the ID? Write a SQL query to discover the answer.

Hint 2: Consider the DISTINCT keyword.)

Step 3: Cardinality

What's the approximate cardinality for the three string columns? _________

(Hint: are there approximation functions that you can use? Standard SQL function reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators

)

What's the min/avg/max values for double_field_4? _________________

Step 1: Find tip to reimport

Click on the table in the left nav. You should see a helpful tip in the main pane to recommend a reloading the data with new names.

Step 2: Reimport

Use that link to reimport the file naming the columns, in order:

Step 3: Reload

Run the load job again, and it should overwrite the "items" table with the same data using the new column names.

Step 1: Load actions

Load the data for a second table, which we'll call the "actions" table.

CSV Source File: gs://retail-training-example/actions.csv

Table name: actions

Schema, name and types:

The source data came in with a timestamp, but it was represented as epoch seconds. What we'd like is that to be in the native timestamp format, so let's write a new table called actions_v2 using a query to convert the field type and name. This is sometimes referred to as an ELT workflow (extract, load, transform). We're also going to leverage DDL support in BigQuery to setup additional aspects of the new table, including establishing partitioning and expiration:

Step 2: Transform

Run the following query:

#StandardSQL
CREATE TABLE sme_import_codelab.actions_v2 
PARTITION BY DATE(ts)
OPTIONS(
  partition_expiration_days=365
)
AS
SELECT 
  TIMESTAMP_SECONDS(epoch_secs) as ts,
  * EXCEPT(epoch_secs)
FROM sme_import_codelab.actions  

CSV Source File: gs://retail-training-example/users.csv

Table name: users

Schema, name and types: