Image Data Classification with BigQuery ML

1. Introduction

In this codelab, we will discuss the use case of storing and analyzing images of Yoga Poses in BigQuery and implementing a classification model with BigQuery ML to label the poses using only SQL constructs and no other form of code.

BigQuery and BQML

BigQuery is a serverless, multi-cloud data warehouse that can scale from bytes to petabytes with zero operational overhead. This makes it a great choice for storing ML training data. Besides, the built-in BigQuery Machine Learning ( BQML) and analytics capabilities allow you to create no-code predictions using just SQL queries. And you can access data from external sources with federated queries, eliminating the need for complicated ETL pipelines. You can read more about everything BigQuery has to offer in the BigQuery page.

So far, we know BigQuery as this fully managed cloud data warehouse that helps users analyze structured and semi-structured data. But,

  • BigQuery has expanded to perform all analytics and ML on unstructured data as well
  • We can use SQL queries to perform insightful analysis, analytics and ML on images, videos, audio etc. at scale without having to write additional code
  • We have the ability to combine structured and unstructured data as if they all existed together in a table

We will discuss these in our Yoga Pose Classification use case covered in the next section.

Image Data Classification with BigQuery ML

Having the ability to process and analyze images using structured queries as if they were structured data is first of its kind. Now we can even predict results using machine learning classification models using BigQuery ML. I have narrowed down the stages involved into 5 steps for easy understanding:

fe97945bce996e1.jpeg

The above steps could be complicated if we are just looking at them as labels. The details of each of the components involved like BigQuery Dataset, BigLake connection, Cloud Storage Buckets (Containers), Object Table (External data source), BQML etc. are all defined in the implementation section. So don't be discouraged if you are not familiar with these terms yet.

What you'll build

You'll create an Image Data Classification model with BQML covering the below:

  • A BigQuery Dataset to contain the table and model components
  • Google Cloud Storage (GCS) Bucket to store Yoga Images for the model
  • An External Table for accessing Cloud Storage Images
  • A BigLake connection for the external table to access the images in the GCS
  • ResNet Model in BigQuery ML
  • Inference using the created model
  • BigQuery SQL for analyzing image data
  • BigQuery SQL to query the structured and unstructured data together

What you'll learn

  • How to create a Cloud Storage Bucket and store images
  • How to create BigQuery dataset, table and connection
  • How to create an image data classification model using BQML
  • How to predict with the created model using BigQuery ML
  • How to query images and combine with structured data using BigQuery SQLs

2. Requirements

  • A browser, such as Chrome or Firefox
  • A Google Cloud project with billing enabled that contains your BigQuery, Cloud Storage and BigLake Connection services
  • The next section has the list of steps to create the image data classification application

3. Create Dataset and a BigLake connection

For our use case of image detection of 5 Yoga poses, I have used publicly available dataset and you can access the dataset from this repo. The Yoga poses we are identifying are limited to Downdog, Goddess, Plank, Tree and Warrior2. Before you begin with the BigQuery Dataset creation, make sure to select or create a Google Cloud Project and check if billing is enabled on the project. Enable BigQuery API and BigQuery Connection API. Please note all the services used in this implementation should be in the same region of choice.

a. Create the dataset "yoga_set" using the steps shown below:

Go to BigQuery Editor and type the command:

CREATE SCHEMA `<<project_id>>.yoga_set`;

b. BigLake Connection allows us to connect the external data source while retaining fine-grained BigQuery access control and security, which in our case is the Cloud Storage for the image data. We will use this connection to read objects from Cloud Storage. Follow steps below to create the BigLake Connection.

Click ADD DATA on the Explorer pane of the BigQuery page:

4cb42b1245bb0ba6.pngBigQuery "Add External Data" screen

Click Connections to external data sources and select BigLake and Remote functions option:

9ffec2b2bfcc3cd5.pngConfigure External Data Source Connection

Provide Connection Id and create the connection. Remember to make a note of the Service Account id that will be displayed on the screen once connection is created <<SERVICE_ACCOUNT>>. In our example, the connection id is "yoga-pose-conn". Remember to note the region.

4. Create Google Cloud Storage Bucket and Grant Permissions

We are going to use Google Cloud Storage bucket to contain the image files of Yoga poses that we want to create the model on. Buckets are Cloud Storage containers to contain the images we are going to analyze.

a. Go to Google Cloud Storage by searching for it in console and then click Buckets to land on the Buckets home page and click CREATE

a6f6b26cffb53ae0.pngGoogle Cloud Storage Buckets Page

b. On the Create a bucket page, enter your bucket information (a unique name) and continue, make sure it is in the same region as the dataset and the connection discussed in above steps and click create

1280366a42b7bdf6.pngGoogle Cloud Storage Create a Bucket Page

Before going to the next step, make sure you have noted your service account, bucket name and path.

c. Once the bucket is created, store your images (through console or Cloud Shell commands or programmatically) and grant necessary permissions for the connection's service account (that we saved earlier) to access the images

> export sa=<<"SERVICE_ACCOUNT">>
> gsutil iam ch serviceAccount:$sa:objectViewer "gs://<<bucket>>"

5. Create an Object Table

Create an external object table from BigQuery to access the unstructured data in the bucket using the connection we created. Run the below CREATE SQL from BigQuery editor:

CREATE OR REPLACE EXTERNAL TABLE `<<dataset>>.<<table_name>>`
WITH CONNECTION `us.<<connection-name>>`
OPTIONS(
object_metadata="SIMPLE", uris=["gs://<<bucket>>/<<folder_if_exists>>/*.jpg"]);

External Table is created as shown below:

bda48f566e0c292f.png

Let's quickly query a pose from the newly created external table:

SELECT data , uri
FROM `yoga_set.yoga_poses`
WHERE REGEXP_CONTAINS(uri, 'gs://yoga_images/Downdog')
Limit 1;

As you can see in the screenshot below, you can create and operate on unstructured images as if they are structured data:

7d1784122b5013f.png

Now let's export the query result from above into a small Python snippet to visualize the result:

Click SAVE RESULTS and select the "CSV Localfile" option to export the result. Then open your Colab Notebook (or create one) and type the code below

from IPython.display import display
from PIL import Image
import io
import pandas as pd
import base64
df = pd.read_csv('/content/sample_data/<<your_csv>>')
imgdata = base64.b64decode(str(df.data[0]))
image = Image.open(io.BytesIO(imgdata))
display(image)

Execute to see the result as below:

b8edd68cb281786a.png

Now that we have created the external table and accessed images from Cloud Storage only using SQL queries, let us move on to the next section that is to create the Classification Model.

6. Create the Model and Upload it to Google Cloud Storage

For this implementation, we are going to use the pre-trained ResNet 50 Model to run inference on the object table we just created. The ResNet 50 model analyzes image files and outputs a batch of vectors representing the likelihood that an image belongs to the corresponding class (logits).

Before moving on to this step, make sure you have all the necessary permissions in place. Then follow the below steps:

  1. Download the model from this location and save it in your local
  2. It should unpackage into saved_model.pb and a variables folder
  3. Upload these two (the file and the folder) into the bucket we created in previous section

2629ff3eda214946.pngGoogle Cloud Storage Bucket "yoga_images" with the ResNet Model files uploaded

Once this step is completed, your model related files should be present in the same bucket as your images as seen in the image above.

7. Load the Model into BQML and Infer

In this step, we are going to load the model into the same BigQuery Dataset as the external table we created earlier and apply it for the images we have stored in the Cloud Storage.

a. From BigQuery Editor, run the following SQL statement

CREATE MODEL `<<Dataset>>.<<Model_Name>>`
OPTIONS(
model_type = 'TENSORFLOW',
model_path = 'gs://<<Bucket>>/*');

Once the execution is completed (which could take a while depending on your dataset), you would see the model listed in your Dataset section in BigQuery.

435fa0919aeb57a6.pngBigQuery Dataset listing the created model

b. Inspect the model to see its input and output fields.

Expand the dataset and click on the model we just created "yoga_poses_resnet". Click the Schema tab:

e88928764f10f6ff.pngBigQuery Model Definition Schema Tab

In the Labels section, you see the "activation_49" field that represents the output field. In the Features section, you can see "input_1" that represents the field that is expected to be input to the model. You will reference "input_1" in your inference query (or prediction query) as the field you are passing in for your "test" data.

c. Infer your Yoga Pose!

Let's use the model we just created to classify our test image data. Make sure you have some test images (Yoga poses) identified from your Cloud Storage Bucket that made it into the External Table when we created it. We are going to selectively query for those test images in BigQuery to perform the inference using the BQML model we just created. Use the below query to trigger the test.

SELECT *
FROM ML.PREDICT(
MODEL yoga_set.yoga_poses_resnet,
(SELECT uri, ML.DECODE_IMAGE(data) AS input_1
FROM yoga_set.yoga_poses where REGEXP_CONTAINS(uri,
'gs://yoga_images/Downdog/00000097.jpg')));

In the above query, we select one test image that is identified to contain a specific URI value (00000097.jpg) in the external table. Also, the SELECT part uses the ML.DECODE_IMAGE construct as field "input_1" in order for the ML.PREDICT function to work.

Once execution is completed, you will see the result as shown below:

867018993845e943.png

Now for those who know the ResNet model in depth, this should help understand the classification. Otherwise, let's code a small snippet to understand the classification visually.

d. Flattening the result

One way of visualizing the above output is to flatten the activation_49 field values using BigQuery SQL's UNNEST construct. Please refer to the query below for flattening the result from the earlier step. If you want to further textually label the resulting class, you can introduce the logic in place of the placeholder <<LABEL_LOGIC>> in the query (uncomment when using).

with predictions as (
SELECT
Uri, data, SPLIT(uri, "/")[OFFSET(ARRAY_LENGTH(SPLIT(uri, "/")) - 1)] as img,  
i as label_i,
<<LABEL_LOGIC>> label,
Score
FROM ML.PREDICT(
MODEL yoga_set.yoga_poses_resnet,
(SELECT data, uri, ML.DECODE_IMAGE(data) AS input_1
FROM yoga_set.yoga_poses 
WHERE
REGEXP_CONTAINS(uri,'gs://yoga_images/Goddess/00000007.jpg'))),
UNNEST(activation_49) as score WITH OFFSET i)
SELECT * FROM predictions 
ORDER BY score DESC
LIMIT  5;

Without the class labeling logic, below is the output to the query:

71f580f41f0811f3.png

However in my case, I have applied a sample logic and below is the result:

1c6df6ecd14fba1.png

You can read further about the model and apply the logic that works best with your data and the model output.

e. Visualizing the inference

Finally, a quick Python snippet to visualize the result from the classification! Export the above query result to a CSV file and reference it in the Python code.

68756e7e4b8d7a29.png

The above image output refers to the Yoga Pose "Downward Dog" which is exactly the same test input we passed into the ML.PREDICT query for classification using BQML!

8. Unifying Structured and Unstructured Data

Lastly, my favorite part of this implementation is to unify the fields from my structured relational table with this unstructured image data. I created a structured BigQuery table in the same dataset as the external table to hold the pose and its health related data.

125bdf848c86fbe.pngBigQuery Structured Table "yoga_health" Schema

The image above represents the schema of the structured data table named "yoga_health" and the fields are pose, focus, health_benefit and breath. The query below joins both Structured and Unstructured data:

SELECT SPLIT(uri, "/")[OFFSET(ARRAY_LENGTH(SPLIT(uri, "/")) - 2)] as pose,
a.health_benefit, breath, focus, data
FROM `abis-345004.yoga_set.yoga_health` a, yoga_set.yoga_poses b
WHERE a.pose = SPLIT(uri, "/")[OFFSET(ARRAY_LENGTH(SPLIT(uri, "/")) - 2)];

Below is the result:

469bdfcffa9e19fd.png

Note: All of the queries we have covered in this blog can be run directly from your Python Notebook using the BigQuery Magic commands.

9. Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this post, follow these steps.

  1. In the Google Cloud console, go to the Manage resources page
  2. In the project list, select the project that you want to delete, and then click Delete
  3. In the dialog, type the project ID, and then click Shut down to delete the project

10. Congratulations

Congratulations! You have successfully stored, queried unstructured data in BigQuery, created a Classification Model using BQML and predicted test yoga poses with the model. If you like to implement this, get started with your Google Cloud project. Also, if you like to learn more about databases or other end to end application implementations in Google Cloud, please head to my blogs.