Automated Classification of Data Uploaded to Cloud Storage with the DLP API and Cloud Functions

1. Overview

In the modern day organization, there is an ever-increasing amount of data coming in from a variety of sources. This often requires quarantining and classifying that data in order to strategically store and protect it – a task that will rapidly become costly and impossible if it remains manual..

In this codelab, we'll see how we can automatically classify data uploaded to Cloud Storage and move it to a corresponding storage bucket. We'll accomplish this using Cloud Pub/Sub, Cloud Functions, Cloud Data Loss Prevention, and Cloud Storage.

What you'll do

Create Cloud Storage buckets to be used as part of the quarantine and classification pipeline.
Create a simple Cloud Function that invokes the DLP API when files are uploaded.
Create a Pub/Sub topic and subscription to notify you when file processing is completed.
Upload sample files to the quarantine bucket to invoke a Cloud Function
Use the DLP API to inspect and classify the files and move them to the appropriate bucket.

What you'll need

A Google Cloud project with billing set up. If you don't have one you'll have to create one.

2. Getting set up

Throughout this codelab, we'll provision and manage different cloud resources and services using the command line via Cloud Shell. The following will open Cloud Shell along with Cloud Shell Editor and clone the companion project repository:

Make sure you're using the correct project by setting it with gcloud config set project [PROJECT_ID]

Enable API's

Enable the required APIs on your Google Cloud project:

Cloud Functions API - Manages lightweight user-provided functions executed in response to events.
Cloud Data Loss Prevention (DLP) API - Provides methods for detection, risk analysis, and de-identification of privacy-sensitive fragments in text, images, and Google Cloud Platform storage repositories.
Cloud Storage - Google Cloud Storage is a RESTful service for storing and accessing your data on Google's infrastructure.

Service Accounts Permissions

A service account is a special type of account that is used by applications and virtual machines to make authorized API calls.

App Engine Default Service Account

The App Engine default service account is used to execute tasks in your Cloud project on behalf of your apps running in App Engine. This service account exists in your project by default with the Editor role assigned.

First, we'll grant our service account the DLP Administrator role that is required to administer data loss prevention jobs:

gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \
--member serviceAccount:$GOOGLE_CLOUD_PROJECT@appspot.gserviceaccount.com \
--role roles/dlp.admin

And finally, grant the DLP API Service Agent role that will allow the service account permissions for bigquery, storage, datastore, pubsub and key management service:

gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \
--member serviceAccount:$GOOGLE_CLOUD_PROJECT@appspot.gserviceaccount.com \
--role roles/dlp.serviceAgent

DLP Service Account

In addition to the App Engine service account, we'll also use a DLP service account. This service account was automatically created when the DLP API was enabled and is initially granted no roles. Let's grant it the viewer role:

gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \
--member serviceAccount:service-`gcloud projects list --filter="PROJECT_ID:$GOOGLE_CLOUD_PROJECT" --format="value(PROJECT_NUMBER)"`@dlp-api.iam.gserviceaccount.com \
--role roles/viewer

3. Cloud Storage Buckets

Now we'll need to create 3 Cloud Storage buckets to store our data:

Quarantine bucket: our data will initially be uploaded here.
Sensitive data bucket: the data determined by the DLP API to be sensitive will be moved here.
Non-sensitive data bucket: the data determined by the DLP API not to be sensitive will be moved here

We can use the gsutil command to create all three of our buckets in one swoop:

gsutil mb gs://[YOUR_QUARANTINE_BUCKET] \
gs://[YOUR_SENSITIVE_DATA_BUCKET] \
gs://[YOUR_NON_SENSITIVE_DATA_BUCKET]

Take note of the names of the buckets you just created - we'll need them later on.

4. Pub/Sub Topic and Subscription

Cloud Pub/Sub provides many to many asynchronous messaging between applications. A publisher will create a message and publish it to a feed of messages called a topic. A subscriber will receive these messages by way of a subscription. Based on that subscription, in our case, we'll have a Cloud Function move files to their respective buckets after a DLP job runs.

First, let's create a topic. A message will be published here each time a file is added to our quarantine storage bucket. We'll name it ‘classify-topic'

gcloud pubsub topics create classify-topic

A subscription will be notified when the topic publishes a message. Let's create a pubsub subscription named ‘classify-sub':

gcloud pubsub subscriptions create classify-sub --topic classify-topic

That subscription will trigger a second Cloud Function that will initiate a DLP job that will inspect the file and move it to its proper place.

5. Cloud Functions

Cloud Functions allow us to deploy lightweight, event-based, asynchronous single-purpose functions without the need to manage a server or a runtime environment. We're going to deploy 2 Cloud functions using the provided main.py file, located in dlp-cloud-functions-tutorials/gcs-dlp-classification-python/

Replace Variables

Before we can create our functions, we'll need to replace some variables in our main.py file.

In the Cloud Shell Editor, adjust main.py by replacing the values for the project ID and the bucket variables on lines 28 to 34 using the corresponding buckets created earlier:

main.py

PROJECT_ID = '[PROJECT_ID_HOSTING_STAGING_BUCKET]'
"""The bucket the to-be-scanned files are uploaded to."""
STAGING_BUCKET = '[YOUR_QUARANTINE_BUCKET]'
"""The bucket to move "sensitive" files to."""
SENSITIVE_BUCKET = '[YOUR_SENSITIVE_DATA_BUCKET]'
"""The bucket to move "non sensitive" files to."""
NONSENSITIVE_BUCKET = '[YOUR_NON_SENSITIVE_DATA_BUCKET]'

Additionally, replace the value for the pub/sub topic variable with the pub/sub topic created in the previous step:

""" Pub/Sub topic to notify once the  DLP job completes."""
PUB_SUB_TOPIC = 'classify-topic'

Deploy Functions

In your Cloud Shell, change directories to gcs-dlp-classification-python where the main.py file exists:

cd ~/cloudshell_open/dlp-cloud-functions-tutorials/gcs-dlp-classification-python

It's time to deploy some functions.

First up, deploy the create_DLP_job function, replacing [YOUR_QUARANTINE_BUCKET] with the correct bucket name. This function is triggered when new files are uploaded to the designated Cloud Storage quarantine bucket and will create a DLP job for each uploaded file:

gcloud functions deploy create_DLP_job --runtime python37 \
--trigger-event google.storage.object.finalize \
--trigger-resource [YOUR_QUARANTINE_BUCKET]

Next up, deploy the resolve_DLP function, indicating our topic as the trigger for it. This function listens to the pub/sub notification initiated from the subsequent DLP job from the function above. As soon as it gets pub/sub notification, it picks up results from the DLP job and moves the file to sensitive bucket or nonsensitive bucket accordingly:

gcloud functions deploy resolve_DLP --runtime python37 \
--trigger-topic classify-topic

Verify

Verify that both of our cloud functions were successfully deployed with the gcloud functions describe command:

gcloud functions describe create_DLP_job

gcloud functions describe resolve_DLP

The output will read ACTIVE for the status when it's been deployed successfully.

6. Test with Sample Data

With all the parts in place, we can now test things out with some sample files. In your Cloud Shell change your current working directory to sample_data:

cd ~/cloudshell_open/dlp-cloud-functions-tutorials/sample_data

Our sample files consist of txt and csv files containing various pieces of data. The files prefixed with ‘sample_s' will contain sensitive data while those prefixed with ‘sample_n' will not. For example, sample_s20.csv contains what is formatted to look like US social security numbers:

sample_s20.csv

Name,SSN,metric 1,metric 2
Maria Johnson,284-73-5110,5,43
Tyler Parker,284-73-5110,8,17
Maria Johnson,284-73-5110,54,63
Maria Johnson,245-25-8698,53,19
Tyler Parker,475-15-8499,6,67
Maria Johnson,719-12-6560,75,83
Maria Johnson,616-69-3226,91,13
Tzvika Roberts,245-25-8698,94,61

On the other hand, the data in sample_n15.csv would not be considered sensitive:

sample_n15.csv

record id,metric 1,metric 2,metric 3
1,59,93,100
2,53,13,17
3,59,67,53
4,52,93,34
5,14,22,88
6,18,88,3
7,32,49,5
8,93,46,14

To see how our set up will treat our files, let's upload all of our test files to our quarantine

bucket:

gsutil -m cp * gs://[YOUR_QUARANTINE_BUCKET]

Initially, our files will sit in the quarantine bucket that we uploaded them to. To verify this, immediately after uploading the files, list the content of the quarantine bucket:

gsutil ls gs://[YOUR_QUARANTINE_BUCKET]

To check out the series of events we've kicked off, start by navigating to the Cloud Functions page:

Click the Actions menu for the create_DLP_job function, and select View Logs:

In our log for this function we see at least 4 entries for each of our files indicating:

The function execution started
The function had been triggered for a particular file
A job had been created
The function had finished executing

Once the create_DLP_job function completes for each file, a corresponding DLP job is initiated. Navigate to the DLP Jobs Page to see a list of the DLP jobs in the queue:

You'll see a list of jobs Pending, Running, or Done. Each of them correspond to one of the files we've uploaded:

You can click the ID of any of these jobs to see more details.

If you go back to the Cloud Functions page and check the logs out for the resolve_DLP function, you'll see at least 8 entries for each file, indicating:

The function execution started
A pub/sub notification was received
The name of the corresponding DLP job
A status code
The number of instances of sensitive data (if any)
The bucket that the file will be moved to
The DLP job has finished parsing the file
The function had finished executing

As soon as all of the calls to the resolve_DLP function have finished running, check out the contents of the quarantine bucket once again:

gsutil ls gs://[YOUR_QUARANTINE_BUCKET]

This time, it should be completely empty. If you run the same command above for the other buckets, however, you'll find our files perfectly separated into their corresponding buckets!

7. Cleanup

Now that we've seen how to use the DLP API in conjunction with Cloud Functions to classify data, let's clean up our project of all the resources we've created.

Delete the Project

If you prefer, you can delete the entire project. In the GCP Console, go to the Cloud Resource Manager page:

In the project list, select the project we've been working in and click Delete. You'll be prompted to type in the project ID. Enter it and click Shut Down.

Alternatively, you can delete the entire project directly from Cloud Shell with gcloud:

gcloud projects delete [PROJECT_ID]

If you prefer to delete the different components one by one, proceed to the next section.

Cloud Functions

Delete both of our cloud functions with gcloud:

gcloud functions delete -q create_DLP_job && gcloud functions delete -q resolve_DLP

Storage Buckets

Remove all of the uploaded files and delete the buckets with gsutil:

gsutil rm -r gs://[YOUR_QUARANTINE_BUCKET] \
gs://[YOUR_SENSITIVE_DATA_BUCKET] \
gs://[YOUR_NON_SENSITIVE_DATA_BUCKET]

Pub/Sub

First delete the pub/sub subscription with gcloud:

gcloud pubsub subscriptions delete classify-sub

And finally, delete the pub/sub topic with gcloud:

gcloud pubsub topics delete classify-topic

8. Congratulations!

Woo hoo! You did it. You've learned how to utilize the DLP API along with Cloud Functions to automate the classification of files!

What we've covered

We created Cloud Storage Buckets to store our sensitive and non-sensitive data
We created a Pub/Sub topic and subscription to trigger a cloud function
We created Cloud Functions designed to kick off a DLP job that categorizes files based on sensitive data contained in them
We uploaded test data and checked out our Cloud Functions' Stackdriver logs to see the process in action