1. Overview
Cloud Data Loss Prevention (DLP) is a fully managed service designed to help discover, classify, and protect sensitive information. This codelab will introduce some of the basic capabilities of the Cloud DLP API and demonstrate the various ways it can be used to protect data.
What you'll do
- Use DLP to inspect strings and files for matching info types
- Learn about deidentification techniques and use DLP to de-identify data
- Learn how to reidentify data that has been deidentified using format preserving encryption (FPE)
- Use DLP to redact info types from strings and images
What you'll need
- A Google Cloud project with billing set up. If you don't have one you'll have to create one.
2. Getting set up
This codelab can run completely on Google Cloud Platform without any local installation or configuration.
Cloud Shell
Throughout this codelab, we'll provision and manage different cloud resources and services using the command line via Cloud Shell.
Download the companion project repository:
git clone https://github.com/googleapis/nodejs-dlp
Once the project code is downloaded, change into the samples directory and install the required Node.js packages:
cd samples && npm install
Make sure you're using the correct project by setting it with the following gcloud command:
gcloud config set project [PROJECT_ID]
Enable API's
Here are the APIs we'll need to enable on our project:
- Cloud Data Loss Prevention API - Provides methods for detection, risk analysis, and de-identification of privacy-sensitive fragments in text, images, and Google Cloud Platform storage repositories
- Cloud Key Management Service (KMS) API - Google Cloud KMS allows customers to manage encryption keys and perform cryptographic operations with those keys.
Enable the required APIs with the following gcloud command:
gcloud services enable dlp.googleapis.com cloudkms.googleapis.com \ --project ${GOOGLE_CLOUD_PROJECT}
3. Inspect strings and files
The samples directory of the project downloaded in the preceding step contains several javascript files that make use of the different functionality of Cloud DLP. inspect.js
will inspect a provided string or file for sensitive info types.
To test this out, you can provide the string
option and a sample string with some potentially sensitive information:
node inspect.js -c $GOOGLE_CLOUD_PROJECT \ string 'My email address is jenny@somedomain.com and you can call me at 555-867-5309'
The output should tell us the findings for each matched info type, which includes:
Quote: The template specifies
InfoType: the information type detected for that part of the string. You'll find a full list of possible info types here. By default, inspect.js
will only inspect for info types CREDIT_CARD_NUMBER
, PHONE_NUMBER
, AND EMAIL_ADDRESS
Likelihood: the results are categorized based on how likely they each represent a match. Likelihood can range from VERY_UNLIKELY
to VERY_LIKELY
.
The findings for the command request above are:
Findings: Quote: jenny@somedomain.com Info type: EMAIL_ADDRESS Likelihood: LIKELY Quote: 555-867-5309 Info type: PHONE_NUMBER Likelihood: VERY_LIKELY
Similarly, we can inspect files for info types. Check out the sample accounts.txt
file:
resources/accounts.txt
My credit card number is 1234 5678 9012 3456, and my CVV is 789.
Run inspect.js
again, this time with the file option:
node inspect.js -c $GOOGLE_CLOUD_PROJECT file resources/accounts.txt
The results:
Findings: Quote: 5678 9012 3456 Info type: CREDIT_CARD_NUMBER Likelihood: VERY_LIKELY
For either kind of query, we could limit the results by likelihood or info type. For example:
node inspect.js -c $GOOGLE_CLOUD_PROJECT \ string 'Call 900-649-2568 or email me at anthony@somedomain.com' \ -m VERY_LIKELY
By specifying VERY_LIKELY
as the minimum likelihood, any matches less than VERY_LIKELY
are excluded:
Findings: Quote: 900-649-2568 Info type: PHONE_NUMBER Likelihood: VERY_LIKELY
The full results without the limitation would be:
Findings: Quote: 900-649-2568 Info type: PHONE_NUMBER Likelihood: VERY_LIKELY Quote: anthony@somedomain.com Info type: EMAIL_ADDRESS Likelihood: LIKELY
Similarly, we could specify the info type we're checking for:
node inspect.js -c $GOOGLE_CLOUD_PROJECT \ string 'Call 900-649-2568 or email me at anthony@somedomain.com' \ -t EMAIL_ADDRESS
Only the specified info type is returned if found:
Findings: Quote: anthony@somedomain.com Info type: EMAIL_ADDRESS Likelihood: LIKELY
Below is the asynchronous function that uses the API to inspect the input:
inspect.js
async function inspectString(
callingProjectId,
string,
minLikelihood,
maxFindings,
infoTypes,
customInfoTypes,
includeQuote
) {
...
}
The arguments provided for the parameters above are used to construct a request object. That request is then provided to the inspectContent
function to get a response that results in our output:
inspect.js
// Construct item to inspect
const item = {value: string};
// Construct request
const request = {
parent: dlp.projectPath(callingProjectId),
inspectConfig: {
infoTypes: infoTypes,
customInfoTypes: customInfoTypes,
minLikelihood: minLikelihood,
includeQuote: includeQuote,
limits: {
maxFindingsPerRequest: maxFindings,
},
},
item: item,
};
...
...
const [response] = await dlp.inspectContent(request);
4. Deidentification
Beyond inspecting and detecting sensitive data, Cloud DLP can perform deidentification. Deidentification is the process of removing identifying information from data. The API detects sensitive data as defined by info types, and then uses a de-identification transformation to mask, delete, or otherwise obscure the data.
deid.js
will demonstrate deidentification in several ways. The simplest method of deidentification is with a mask:
node deid.js deidMask -c $GOOGLE_CLOUD_PROJECT \ "My order number is F12312399. Email me at anthony@somedomain.com"
With a mask the API will replace the characters of the matching info type with a different character, * by default. The output will be:
My order number is F12312399. Email me at *****************************
Notice that the email address in the string is obfuscated while the arbitrary order number is intact. (Custom info types are possible but out of scope of this Codelab).
Let's see the function that uses the DLP API to deidentify with a mask:
deid.js
async function deidentifyWithMask(
callingProjectId,
string,
maskingCharacter,
numberToMask
) {
...
}
Once again, these arguments are used to construct a request object. This time it's provided to the deidentifyContent
function:
deid.js
// Construct deidentification request
const item = {value: string};
const request = {
parent: dlp.projectPath(callingProjectId),
deidentifyConfig: {
infoTypeTransformations: {
transformations: [
{
primitiveTransformation: {
characterMaskConfig: {
maskingCharacter: maskingCharacter,
numberToMask: numberToMask,
},
},
},
],
},
},
item: item,
};
...
...
const [response] = await dlp.deidentifyContent(request);
Deidentify with Format Preserving Encryption
The DLP API also offers the ability to encrypt sensitive data values using a cryptographic key.
We'll start by using Cloud KMS to create a key ring:
gcloud kms keyrings create dlp-keyring --location global
Now we can create a key that we'll use to encrypt the data:
gcloud kms keys create dlp-key \ --purpose='encryption' \ --location=global \ --keyring=dlp-keyring
The DLP API will accept a wrapped key encrypted with the KMS key we created. We can generate a random string that will be wrapped. We'll need this later to reidentify:
export AES_KEY=`head -c16 < /dev/random | base64 -w 0`
Now we can encrypt the string with our KMS key. This will generate a binary file that contains the encrypted string as ciphertext:
echo -n $AES_KEY | gcloud kms encrypt \ --location global \ --keyring dlp-keyring \ --key dlp-key \ --plaintext-file - \ --ciphertext-file ./ciphertext.bin
Using deid.js
we can now deidentify the phone number in the sample string below using encryption:
node deid.js deidFpe -c $GOOGLE_CLOUD_PROJECT \ "My client's cell is 9006492568" `base64 -w 0 ciphertext.bin` \ projects/${GOOGLE_CLOUD_PROJECT}/locations/global/keyRings/dlp-keyring/cryptoKeys/dlp-key \ -s PHONE_NUMBER
The output will return the string with the matched info types replaced by an encrypted string and preceded by the info type indicated by the -s flag:
My client's cell is PHONE_NUMBER(10):vSt55z79nR
Let's take a look at the function we're using to deidentify the string:
deid.js
async function deidentifyWithFpe(
callingProjectId,
string,
alphabet,
surrogateType,
keyName,
wrappedKey
) {
...
}
The arguments are used to construct a cryptoReplaceFfxFpeConfig
object:
deid.js
const cryptoReplaceFfxFpeConfig = {
cryptoKey: {
kmsWrapped: {
wrappedKey: wrappedKey,
cryptoKeyName: keyName,
},
},
commonAlphabet: alphabet,
};
if (surrogateType) {
cryptoReplaceFfxFpeConfig.surrogateInfoType = {
name: surrogateType,
};
}
The cryptoReplaceFfxFpeConfig
object is in turn used in the request to the API via the deidentifyContent
function:
deid.js
// Construct deidentification request
const item = {value: string};
const request = {
parent: dlp.projectPath(callingProjectId),
deidentifyConfig: {
infoTypeTransformations: {
transformations: [
{
primitiveTransformation: {
cryptoReplaceFfxFpeConfig: cryptoReplaceFfxFpeConfig,
},
},
],
},
},
item: item,
};
try {
// Run deidentification request
const [response] = await dlp.deidentifyContent(request);
Re-identify data
In order to re-identify the data, the DLP API will use the ciphertext we created in the previous step:
node deid.js reidFpe -c $GOOGLE_CLOUD_PROJECT \ "<YOUR_DEID_OUTPUT>" \ PHONE_NUMBER `base64 -w 0 ciphertext.bin` \ projects/${GOOGLE_CLOUD_PROJECT}/locations/global/keyRings/dlp-keyring/cryptoKeys/dlp-key
The output will be the original string with no redactions or surrogate type indicated:
My client's cell is 9006492568
The function used to reidentify data is similar to the one used to deidentify it:
deid.js
async function reidentifyWithFpe(
callingProjectId,
string,
alphabet,
surrogateType,
keyName,
wrappedKey
) {
...
}
And once again, the arguments are used in a request to the API, this time to the reidentifyContent
function:
deid.js
// Construct deidentification request
const item = {value: string};
const request = {
parent: dlp.projectPath(callingProjectId),
reidentifyConfig: {
infoTypeTransformations: {
transformations: [
{
primitiveTransformation: {
cryptoReplaceFfxFpeConfig: {
cryptoKey: {
kmsWrapped: {
wrappedKey: wrappedKey,
cryptoKeyName: keyName,
},
},
commonAlphabet: alphabet,
surrogateInfoType: {
name: surrogateType,
},
},
},
},
],
},
},
inspectConfig: {
customInfoTypes: [
{
infoType: {
name: surrogateType,
},
surrogateType: {},
},
],
},
item: item,
};
try {
// Run reidentification request
const [response] = await dlp.reidentifyContent(request);
Deidentify Dates with Date Shifting
In certain contexts, dates can be considered sensitive data that we might want to obfuscate. Date shifting lets us shift dates by a random increment while preserving the sequence and duration of a period of time. Each date in a set is shifted by an amount of time unique to that entry. To demonstrate deidentification via date shifting, first take a look at the sample CSV file that contains date data:
resources/dates.csv
name,birth_date,register_date,credit_card
Ann,01/01/1980,07/21/1996,4532908762519852
James,03/06/1988,04/09/2001,4301261899725540
Dan,08/14/1945,11/15/2011,4620761856015295
Laura,11/03/1992,01/04/2017,4564981067258901
The data contains two fields that we could apply a date shift to: birth_date
and register_date
. deid.js will accept a lower bound value and an upper bound value to define a range to select a random number of day by which to shift the dates:
node deid.js deidDateShift -c $GOOGLE_CLOUD_PROJECT resources/dates.csv datesShifted.csv 30 90 birth_date
A file called datesShifted.csv
will be generated with the dates randomly shifted by a number of days between 30 and 90. Here's an example of the generated output:
name,birth_date,register_date,credit_card
Ann,2/6/1980,7/21/1996,4532908762519852
James,5/18/1988,4/9/2001,4301261899725540
Dan,9/16/1945,11/15/2011,4620761856015295
Laura,12/16/1992,1/4/2017,4564981067258901
Notice that we were also able to specify which date column in the CSV file we wanted to shift. The birth_date
field The register_date
field remains unchanged.
Let's take a look a the function that handles deidentification with a dateshift:
deid.js
async function deidentifyWithDateShift(
callingProjectId,
inputCsvFile,
outputCsvFile,
dateFields,
lowerBoundDays,
upperBoundDays,
contextFieldId,
wrappedKey,
keyName
) {
...
}
Notice that this function could accept a wrapped key and a key name, similar to the deidentification with FPE, so that we have the option of providing an encryption key to reidentify a date shift. The arguments we provide build a dateShiftConfig object:
deid.js
// Construct DateShiftConfig
const dateShiftConfig = {
lowerBoundDays: lowerBoundDays,
upperBoundDays: upperBoundDays,
};
if (contextFieldId && keyName && wrappedKey) {
dateShiftConfig.context = {name: contextFieldId};
dateShiftConfig.cryptoKey = {
kmsWrapped: {
wrappedKey: wrappedKey,
cryptoKeyName: keyName,
},
};
} else if (contextFieldId || keyName || wrappedKey) {
throw new Error(
'You must set either ALL or NONE of {contextFieldId, keyName, wrappedKey}!'
);
}
// Construct deidentification request
const request = {
parent: dlp.projectPath(callingProjectId),
deidentifyConfig: {
recordTransformations: {
fieldTransformations: [
{
fields: dateFields,
primitiveTransformation: {
dateShiftConfig: dateShiftConfig,
},
},
],
},
},
item: tableItem,
};
5. Redact strings and images
Another method of obfuscating sensitive information is redaction. Redaction will replace a match with the info type it's identified to match with. redact.js
demonstrates redaction:
node redact.js -c $GOOGLE_CLOUD_PROJECT \ string "Please refund the purchase to my credit card 4012888888881881" \ -t 'CREDIT_CARD_NUMBER'
The output replaces the sample credit card number with the info type CREDIT_CARD_NUMBER
:
Please refund the purchase on my credit card [CREDIT_CARD_NUMBER]
This is useful if you'd like to hide sensitive information but still identify the type of information that's being removed. The DLP API can similarly redact information from images that contain text. To demonstrate, let's take a look at a sample image:
resources/test.png
To redact the phone number and email address from the image above:
node redact.js -c $GOOGLE_CLOUD_PROJECT \ image resources/test.png ./redacted.png \ -t PHONE_NUMBER -t EMAIL_ADDRESS
As specified, a new image named redacted.png will be generated with the requested information blacked out:
Here is the function that is used to redact from a string:
redact.js
async function redactText(
callingProjectId,
string,
minLikelihood,
infoTypes
) {
...}
And here is the request that will be provided to the deidentifyContent
function:
redact.js
const request = {
parent: dlp.projectPath(callingProjectId),
item: {
value: string,
},
deidentifyConfig: {
infoTypeTransformations: {
transformations: [replaceWithInfoTypeTransformation],
},
},
inspectConfig: {
minLikelihood: minLikelihood,
infoTypes: infoTypes,
},
};
Similarly, here is the function for redacting an image:
redact.js
async function redactImage(
callingProjectId,
filepath,
minLikelihood,
infoTypes,
outputPath
) {
...}
And here is the request that will be provided to the redactImage
function:
redact.js
// Construct image redaction request
const request = {
parent: dlp.projectPath(callingProjectId),
byteItem: {
type: fileTypeConstant,
data: fileBytes,
},
inspectConfig: {
minLikelihood: minLikelihood,
infoTypes: infoTypes,
},
imageRedactionConfigs: imageRedactionConfigs,
};
6. Clean up
We've explored how we can use the DLP API to mask, deidentify, and redact sensitive information from our data. Now it's time to clean up our project of any resources we've created.
Delete the Project
In the GCP Console, go to the Cloud Resource Manager page:
In the project list, select the project we've been working in and click Delete. You'll be prompted to type in the project ID. Enter it and click Shut Down.
Alternatively, you can delete the entire project directly from Cloud Shell with gcloud:
gcloud projects delete $GOOGLE_CLOUD_PROJECT
7. Congratulations!
Woo hoo! You did it! Cloud DLP is a powerful tool that provides access to a powerful sensitive data inspection, classification, and de-identification platform.
What we've covered
- We saw how the Cloud DLP API can be used to inspect strings and files for multiple info types
- We learned how the DLP API can deidentify strings with a mask to hide data matching info types
- We used the DLP API to use an encryption key to deidentify and then reidentify data
- We used the DLP API to redact data from a string as well as an image