In this lab, you learn how to use DataFlow and the Import/Export features in Cloud Spanner to import, query and optimize a database.

What you learn

In this lab, you will:

Prerequisites

You should be familiar with Google Cloud Storage (GCS), BigQuery and Dataflow in the Console UI.

Step 1

Request membership in the Google Groups cloud-spanner-da-sme. This will give you IAM access to the DA SME demo instance.

Step 2

After your group membership is approved, make sure you can access the Cloud Spanner demo instance for this SME event at go/cloud-spanner-da-sme-demo-instance.

Step 3

Try creating an empty database. Click .

Name your database namespaced by your own username ${USER}_empty and click . For example, biswa_empty.

Toggle and type in the following:

CREATE TABLE temp_table (
        col1 INT64,
) PRIMARY KEY (col1)

Click .

You have now set up a database in the demo instance.

Step 4

Try out the Query Runner UI.

Click on the database page.

Enter this query:

SELECT * FROM temp_table;

Click Run Query.

You should see the output No Results.

Step 5

Delete your database.

Click on your database name in the left sidebar and click on .

Type your database name into the text field to confirm, and then click Delete.

Step 6

If you haven't already done so, you can pin Cloud Spanner to the top of your Navigation menu as follows.

Click on the pin icon to the right of Spanner.

Scenario

A retail store has a set of BigQuery analytical data that has to be loaded into Cloud Spanner. Your job is to achieve this using the Import/Export feature.

For this lab, we will be creating 2 different databases: One with a common SQL schema, and a second one with interleaved tables, a feature of Spanner.

Step 1

Access the retail training data set (sme_training) that has been uploaded to BigQuery. There should be 3 tables:

  1. Users
  2. Items
  3. Actions

Click on each table in the left sidebar to look at the data schema and their data types.

Step 2

Next you will export the data from the Retail dataset to Google Cloud Storage (GCS).

Create your own folder called ${USER}_export in the bucket gs://spanner-training/.

For each of the tables, use the BigQuery console to export to your GCS directory in Avro format.

Step 3

Create a file spanner-export.json on your local machine with the contents:

{
  "tables": [
   {
    "name": "items",
    "dataFiles": ["items"]
   },
   {
    "name": "users",
    "dataFiles": ["users"]
   },
   {
    "name": "actions",
    "dataFiles": ["actions"]
   }
  ]
}

This contains the metadata that is necessary to import the Avro files to Cloud Spanner by mapping Avro file names to their table counterparts.

Upload the file to the GCS directory you created at by clicking .

You can also use

gsutil cp spanner-export.json gs://spanner-training/${USER}_export/spanner-export.json

to do the copy in a shell terminal. Remember to update your gcloud project to google.com:cloud-spanner-demo beforehand.

Step 4

Create an empty tmp/ directory in your GCS directory for Dataflow temp objects. The resulting directory should look like this:

Step 5

Create your own Cloud Spanner databases ${USER}_retail and ${USER}_retail_interleaved using the GCP console with the following schemas:

${USER}_retail

CREATE TABLE items (
  item_ID INT64 NOT NULL,
  size STRING(MAX) NOT NULL,
  color STRING(MAX) NOT NULL,
  type STRING(MAX) NOT NULL,
  price FLOAT64 NOT NULL,
) PRIMARY KEY(item_ID);

CREATE TABLE users (
  user_ID INT64 NOT NULL,
  first_name STRING(MAX) NOT NULL,
  last_name STRING(MAX) NOT NULL,
  country STRING(MAX) NOT NULL,
) PRIMARY KEY(user_ID);

CREATE TABLE actions (
  user_ID INT64 NOT NULL,
  timestamp TIMESTAMP NOT NULL,
  action STRING(MAX) NOT NULL,
  item_ID INT64 NOT NULL,
) PRIMARY KEY(user_ID, timestamp, action, item_ID);

${USER}_retail_interleaved

CREATE TABLE items (
  item_ID INT64 NOT NULL,
  size STRING(MAX) NOT NULL,
  color STRING(MAX) NOT NULL,
  type STRING(MAX) NOT NULL,
  price FLOAT64 NOT NULL,
) PRIMARY KEY(item_ID);
​
CREATE TABLE users (
  user_ID INT64 NOT NULL,
  first_name STRING(MAX) NOT NULL,
  last_name STRING(MAX) NOT NULL,
  country STRING(MAX) NOT NULL,
) PRIMARY KEY(user_ID);
​
CREATE TABLE actions (
  user_ID INT64 NOT NULL,
  timestamp TIMESTAMP NOT NULL,
  action STRING(MAX) NOT NULL,
  item_ID INT64 NOT NULL,
) PRIMARY KEY(user_ID, timestamp, action, item_ID),
  INTERLEAVE IN PARENT users ON DELETE NO ACTION

Step 6

Run Cloud Dataflow jobs to import the Avro data from GCS into both of the databases with the Console UI.

Click from the navigation bar.

Click .

Fill in the fields:

Job name: ${USER}-import

Cloud Dataflow template: GCS Avro to Cloud Spanner

Regional endpoint: Pick one closest to your current region

Cloud Spanner instance id: You can find this on the Spanner instance Console UI.

Cloud Spanner database id: ${USER}_retail

Cloud storage input directory: gs://spanner-training/${USER}_export

Temporary Location: gs://spanner-training/${USER}_export/tmp

Example

Click Run job.

Repeat the above for your interleaved database using a different job name. The jobs will take about 30 minutes to finish, so we will come back to it before the 2nd lab. You can monitor the progress of the job in the Dataflow UI.

Scenario A

You have a database containing users, items and all actions take by the users on the item inventory.

Management wants to run a cron job to do daily dumps of the latest 100 BUY actions for further analysis. Each of the actions should contain the user ID, first name, last name, item ID, and price.

Use database ${USER}_retail for this.

Step 1

Use the Query Runner in GCP console UI to run your queries. You can start with the following query.

SELECT users.user_ID, users.first_name, users.last_name, items.item_ID, items.price, actions.timestamp 
FROM actions
JOIN users ON users.user_ID=actions.user_ID
JOIN items ON items.item_ID=actions.item_ID
WHERE actions.action='BUY'
ORDER BY timestamp DESC LIMIT 100;

Step 2

You can use the Explanation tab to check query times and use the query plan to optimize your queries.

Try to get the query elapsed time to less than 10 seconds before going to Step 3.

Step 3

Create an index on the actions table, if you haven't already done so.

Click on the actions table on the left sidebar under your own database.

Click on to define an index optimized for your use case.

Try to get the query elapsed time to less than 100 milliseconds.

Scenario B

You are now tasked with querying all actions by the user with user_ID value 1234. Use both of your databases for this.

Step 1

Run the following query on both databases:

SELECT * FROM actions WHERE user_ID=1234;

Step 2

Note the elapsed time for the query on both databases. Which one was faster? Why?

Use the documentation at https://cloud.google.com/spanner/docs/dml-tasks for DML SQL reference.

Scenario 1

Your manager just realized that the POS system has a bug that wouldn't allow a user with the last name Google to register. You have been tasked with inserting the following information to the database:

User ID - 1337133713371337

First Name - Best

Last Name - Google

Country - Singapore

Scenario 2

Your manager realized the information you were given on the user is incorrect. The first name should have been Okay. Update the row you just added with the new information.

Scenario 3

Thanos snapped his fingers and transported everyone with last name Dana to another dimension. Don't worry the new dimension is awesome.

Alas, we no longer require the dimensional travellers in our system. Delete all rows with last name Dana from your users table.

Delete your GCS directory.

Stop any ongoing Dataflow jobs you have created.

Delete your Spanner databases created during this codelab. If you are in the Data SME course, you may leave everything as is. The demo instance will be deleted right after the event, thus erasing all codelab databases of participants.