In the modern day organization, there is an ever-increasing amount of data coming in from a variety of sources. This often requires quarantining and classifying that data in order to strategically store and protect it – a task that will rapidly become costly and impossible if it remains manual..
In this codelab, we'll see how we can automatically classify data uploaded to Cloud Storage and move it to a corresponding storage bucket. We'll accomplish this using Cloud Pub/Sub, Cloud Functions, Cloud Data Loss Prevention, and Cloud Storage.
What you'll do
- Create Cloud Storage buckets to be used as part of the quarantine and classification pipeline.
- Create a simple Cloud Function that invokes the DLP API when files are uploaded.
- Create a Pub/Sub topic and subscription to notify you when file processing is completed.
- Upload sample files to the quarantine bucket to invoke a Cloud Function
- Use the DLP API to inspect and classify the files and move them to the appropriate bucket.
What you'll need
- A Google Cloud project with billing set up. If you don't have one you'll have to create one.
Throughout this codelab, we'll provision and manage different cloud resources and services using the command line via Cloud Shell. The following will open Cloud Shell along with Cloud Shell Editor and clone the companion project repository:
Make sure you're using the correct project by setting it with
gcloud config set project [PROJECT_ID]
Enable the required APIs on your Google Cloud project:
- Cloud Functions API - Manages lightweight user-provided functions executed in response to events.
- Cloud Data Loss Prevention (DLP) API - Provides methods for detection, risk analysis, and de-identification of privacy-sensitive fragments in text, images, and Google Cloud Platform storage repositories.
- Cloud Storage - Google Cloud Storage is a RESTful service for storing and accessing your data on Google's infrastructure.
Service Accounts Permissions
A service account is a special type of account that is used by applications and virtual machines to make authorized API calls.
App Engine Default Service Account
The App Engine default service account is used to execute tasks in your Cloud project on behalf of your apps running in App Engine. This service account exists in your project by default with the Editor role assigned.
First, we'll grant our service account the DLP Administrator role that is required to administer data loss prevention jobs:
gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \ --member serviceAccount:$GOOGLE_CLOUD_PROJECT@appspot.gserviceaccount.com \ --role roles/dlp.admin
And finally, grant the DLP API Service Agent role that will allow the service account permissions for bigquery, storage, datastore, pubsub and key management service:
gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \ --member serviceAccount:$GOOGLE_CLOUD_PROJECT@appspot.gserviceaccount.com \ --role roles/dlp.serviceAgent
DLP Service Account
In addition to the App Engine service account, we'll also use a DLP service account. This service account was automatically created when the DLP API was enabled and is initially granted no roles. Let's grant it the viewer role:
gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \ --member serviceAccount:service-`gcloud projects list --filter="PROJECT_ID:$GOOGLE_CLOUD_PROJECT" --format="value(PROJECT_NUMBER)"`@dlp-api.iam.gserviceaccount.com \ --role roles/viewer
Now we'll need to create 3 Cloud Storage buckets to store our data:
- Quarantine bucket: our data will initially be uploaded here.
- Sensitive data bucket: the data determined by the DLP API to be sensitive will be moved here.
- Non-sensitive data bucket: the data determined by the DLP API not to be sensitive will be moved here
We can use the gsutil command to create all three of our buckets in one swoop:
gsutil mb gs://[YOUR_QUARANTINE_BUCKET] \ gs://[YOUR_SENSITIVE_DATA_BUCKET] \ gs://[YOUR_NON_SENSITIVE_DATA_BUCKET]
Take note of the names of the buckets you just created - we'll need them later on.
Cloud Pub/Sub provides many to many asynchronous messaging between applications. A publisher will create a message and publish it to a feed of messages called a topic. A subscriber will receive these messages by way of a subscription. Based on that subscription, in our case, we'll have a Cloud Function move files to their respective buckets after a DLP job runs.
First, let's create a topic. A message will be published here each time a file is added to our quarantine storage bucket. We'll name it ‘classify-topic'
gcloud pubsub topics create classify-topic
A subscription will be notified when the topic publishes a message. Let's create a pubsub subscription named ‘classify-sub':
gcloud pubsub subscriptions create classify-sub --topic classify-topic
That subscription will trigger a second Cloud Function that will initiate a DLP job that will inspect the file and move it to its proper place.
Cloud Functions allow us to deploy lightweight, event-based, asynchronous single-purpose functions without the need to manage a server or a runtime environment. We're going to deploy 2 Cloud functions using the provided
main.py file, located in
Before we can create our functions, we'll need to replace some variables in our
In the Cloud Shell Editor, adjust main.py by replacing the values for the project ID and the bucket variables on lines 28 to 34 using the corresponding buckets created earlier:
PROJECT_ID = '[PROJECT_ID_HOSTING_STAGING_BUCKET]' """The bucket the to-be-scanned files are uploaded to.""" STAGING_BUCKET = '[YOUR_QUARANTINE_BUCKET]' """The bucket to move "sensitive" files to.""" SENSITIVE_BUCKET = '[YOUR_SENSITIVE_DATA_BUCKET]' """The bucket to move "non sensitive" files to.""" NONSENSITIVE_BUCKET = '[YOUR_NON_SENSITIVE_DATA_BUCKET]'
Additionally, replace the value for the pub/sub topic variable with the pub/sub topic created in the previous step:
""" Pub/Sub topic to notify once the DLP job completes.""" PUB_SUB_TOPIC = 'classify-topic'
In your Cloud Shell, change directories to gcs-dlp-classification-python where the
main.py file exists:
It's time to deploy some functions.
First up, deploy the
create_DLP_job function, replacing [YOUR_QUARANTINE_BUCKET] with the correct bucket name. This function is triggered when new files are uploaded to the designated Cloud Storage quarantine bucket and will create a DLP job for each uploaded file:
gcloud functions deploy create_DLP_job --runtime python37 \ --trigger-event google.storage.object.finalize \ --trigger-resource [YOUR_QUARANTINE_BUCKET]
Next up, deploy the
resolve_DLP function, indicating our topic as the trigger for it. This function listens to the pub/sub notification initiated from the subsequent DLP job from the function above. As soon as it gets pub/sub notification, it picks up results from the DLP job and moves the file to sensitive bucket or nonsensitive bucket accordingly:
gcloud functions deploy resolve_DLP --runtime python37 \ --trigger-topic classify-topic
Verify that both of our cloud functions were successfully deployed with the
gcloud functions describe command:
gcloud functions describe create_DLP_job
gcloud functions describe resolve_DLP
The output will read
ACTIVE for the status when it's been deployed successfully.
With all the parts in place, we can now test things out with some sample files. In your Cloud Shell change your current working directory to
Our sample files consist of txt and csv files containing various pieces of data. The files prefixed with ‘sample_s' will contain sensitive data while those prefixed with ‘sample_n' will not. For example, sample_s20.csv contains what is formatted to look like US social security numbers:
Name,SSN,metric 1,metric 2 Maria Johnson,284-73-5110,5,43 Tyler Parker,284-73-5110,8,17 Maria Johnson,284-73-5110,54,63 Maria Johnson,245-25-8698,53,19 Tyler Parker,475-15-8499,6,67 Maria Johnson,719-12-6560,75,83 Maria Johnson,616-69-3226,91,13 Tzvika Roberts,245-25-8698,94,61
On the other hand, the data in sample_n15.csv would not be considered sensitive:
record id,metric 1,metric 2,metric 3 1,59,93,100 2,53,13,17 3,59,67,53 4,52,93,34 5,14,22,88 6,18,88,3 7,32,49,5 8,93,46,14
To see how our set up will treat our files, let's upload all of our test files to our quarantine
gsutil -m cp * gs://[YOUR_QUARANTINE_BUCKET]
Initially, our files will sit in the quarantine bucket that we uploaded them to. To verify this, immediately after uploading the files, list the content of the quarantine bucket:
gsutil ls gs://[YOUR_QUARANTINE_BUCKET]
To check out the series of events we've kicked off, start by navigating to the Cloud Functions page:
Click the Actions menu for the create_DLP_job function, and select View Logs:
In our log for this function we see at least 4 entries for each of our files indicating:
- The function execution started
- The function had been triggered for a particular file
- A job had been created
- The function had finished executing
Once the create_DLP_job function completes for each file, a corresponding DLP job is initiated. Navigate to the DLP Jobs Page to see a list of the DLP jobs in the queue:
You'll see a list of jobs Pending, Running, or Done. Each of them correspond to one of the files we've uploaded:
You can click the ID of any of these jobs to see more details.
If you go back to the Cloud Functions page and check the logs out for the resolve_DLP function, you'll see at least 8 entries for each file, indicating:
- The function execution started
- A pub/sub notification was received
- The name of the corresponding DLP job
- A status code
- The number of instances of sensitive data (if any)
- The bucket that the file will be moved to
- The DLP job has finished parsing the file
- The function had finished executing
As soon as all of the calls to the resolve_DLP function have finished running, check out the contents of the quarantine bucket once again:
gsutil ls gs://[YOUR_QUARANTINE_BUCKET]
This time, it should be completely empty. If you run the same command above for the other buckets, however, you'll find our files perfectly separated into their corresponding buckets!
Now that we've seen how to use the DLP API in conjunction with Cloud Functions to classify data, let's clean up our project of all the resources we've created.
Delete the Project
If you prefer, you can delete the entire project. In the GCP Console, go to the Cloud Resource Manager page:
In the project list, select the project we've been working in and click Delete. You'll be prompted to type in the project ID. Enter it and click Shut Down.
Alternatively, you can delete the entire project directly from Cloud Shell with gcloud:
gcloud projects delete [PROJECT_ID]
If you prefer to delete the different components one by one, proceed to the next section.
Delete both of our cloud functions with gcloud:
gcloud functions delete -q create_DLP_job && gcloud functions delete -q resolve_DLP
Remove all of the uploaded files and delete the buckets with gsutil:
gsutil rm -r gs://[YOUR_QUARANTINE_BUCKET] \ gs://[YOUR_SENSITIVE_DATA_BUCKET] \ gs://[YOUR_NON_SENSITIVE_DATA_BUCKET]
First delete the pub/sub subscription with gcloud:
gcloud pubsub subscriptions delete classify-sub
And finally, delete the pub/sub topic with gcloud:
gcloud pubsub topics delete classify-topic
Woo hoo! You did it. You've learned how to utilize the DLP API along with Cloud Functions to automate the classification of files!
What we've covered
- We created Cloud Storage Buckets to store our sensitive and non-sensitive data
- We created a Pub/Sub topic and subscription to trigger a cloud function
- We created Cloud Functions designed to kick off a DLP job that categorizes files based on sensitive data contained in them
- We uploaded test data and checked out our Cloud Functions' Stackdriver logs to see the process in action