You learn how to work with batch ingestion, and do some query analysis. The interesting twist is we're working with structured data, but we don't know the nature of the data.
In this lab, you learn how to:
Create a new BigQuery dataset using the web UI and name it sme_import_codelab:
Creating an items table, by ingesting a CSV file from cloud storage.
CSV Source data: gs://retail-training-example/items.csv
Table name: items
Once the load job completes, we should have our items table present. Click on it in the nav, and hit the "preview" button to get a sample of the records:
It's entirely possible that you'll see an entirely different sample of data.
We have some suspicions about what these fields might contain. Let's probe those suspicions with some query analysis.
How many records are there? ____________________________
(Hint: click on the Details button or write a SQL query on the table meta data.)
Is the int64_field_0 is an identifier column? ________________
(Hint: Is there a unique ID for each column? What are the bounds on the ID? Write a SQL query to discover the answer.
Hint 2: Consider the DISTINCT keyword.)
What's the approximate cardinality for the three string columns? _________
(Hint: are there approximation functions that you can use? Standard SQL function reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators
What's the min/avg/max values for double_field_4? _________________
Click on the table in the left nav. You should see a helpful tip in the main pane to recommend a reloading the data with new names.
Use that link to reimport the file naming the columns, in order:
Run the load job again, and it should overwrite the "items" table with the same data using the new column names.
Load the data for a second table, which we'll call the "actions" table.
CSV Source File: gs://retail-training-example/actions.csv
Table name: actions
Schema, name and types:
The source data came in with a timestamp, but it was represented as epoch seconds. What we'd like is that to be in the native timestamp format, so let's write a new table called actions_v2 using a query to convert the field type and name. This is sometimes referred to as an ELT workflow (extract, load, transform). We're also going to leverage DDL support in BigQuery to setup additional aspects of the new table, including establishing partitioning and expiration:
Run the following query:
#StandardSQL CREATE TABLE sme_import_codelab.actions_v2 PARTITION BY DATE(ts) OPTIONS( partition_expiration_days=365 ) AS SELECT TIMESTAMP_SECONDS(epoch_secs) as ts, * EXCEPT(epoch_secs) FROM sme_import_codelab.actions
CSV Source File: gs://retail-training-example/users.csv
Table name: users
Schema, name and types: