Cloud Data Loss Prevention Overview

1. Overview

Cloud Data Loss Prevention (DLP) is a fully managed service designed to help discover, classify, and protect sensitive information. This codelab will introduce some of the basic capabilities of the Cloud DLP API and demonstrate the various ways it can be used to protect data.

What you'll do

Use DLP to inspect strings and files for matching info types
Learn about deidentification techniques and use DLP to de-identify data
Learn how to reidentify data that has been deidentified using format preserving encryption (FPE)
Use DLP to redact info types from strings and images

What you'll need

A Google Cloud project with billing set up. If you don't have one you'll have to create one.

2. Getting set up

This codelab can run completely on Google Cloud Platform without any local installation or configuration.

Cloud Shell

Throughout this codelab, we'll provision and manage different cloud resources and services using the command line via Cloud Shell.

Download the companion project repository:

git clone https://github.com/googleapis/nodejs-dlp

Once the project code is downloaded, change into the samples directory and install the required Node.js packages:

cd samples && npm install

Make sure you're using the correct project by setting it with the following gcloud command:

gcloud config set project [PROJECT_ID]

Enable API's

Here are the APIs we'll need to enable on our project:

Cloud Data Loss Prevention API - Provides methods for detection, risk analysis, and de-identification of privacy-sensitive fragments in text, images, and Google Cloud Platform storage repositories
Cloud Key Management Service (KMS) API - Google Cloud KMS allows customers to manage encryption keys and perform cryptographic operations with those keys.

Enable the required APIs with the following gcloud command:

gcloud services enable dlp.googleapis.com cloudkms.googleapis.com \
--project ${GOOGLE_CLOUD_PROJECT}

3. Inspect strings and files

The samples directory of the project downloaded in the preceding step contains several javascript files that make use of the different functionality of Cloud DLP. inspect.js will inspect a provided string or file for sensitive info types.

To test this out, you can provide the string option and a sample string with some potentially sensitive information:

node inspect.js -c $GOOGLE_CLOUD_PROJECT \
string 'My email address is jenny@somedomain.com and you can call me at 555-867-5309'

The output should tell us the findings for each matched info type, which includes:

Quote: The template specifies

InfoType: the information type detected for that part of the string. You'll find a full list of possible info types here. By default, inspect.js will only inspect for info types CREDIT_CARD_NUMBER, PHONE_NUMBER, AND EMAIL_ADDRESS

Likelihood: the results are categorized based on how likely they each represent a match. Likelihood can range from VERY_UNLIKELY to VERY_LIKELY.

The findings for the command request above are:

Findings:
        Quote: jenny@somedomain.com
        Info type: EMAIL_ADDRESS
        Likelihood: LIKELY
        Quote: 555-867-5309
        Info type: PHONE_NUMBER
        Likelihood: VERY_LIKELY

Similarly, we can inspect files for info types. Check out the sample accounts.txt file:

resources/accounts.txt

My credit card number is 1234 5678 9012 3456, and my CVV is 789.

Run inspect.js again, this time with the file option:

node inspect.js -c $GOOGLE_CLOUD_PROJECT file resources/accounts.txt

The results:

Findings:
        Quote: 5678 9012 3456
        Info type: CREDIT_CARD_NUMBER
        Likelihood: VERY_LIKELY

For either kind of query, we could limit the results by likelihood or info type. For example:

node inspect.js -c $GOOGLE_CLOUD_PROJECT \
string 'Call 900-649-2568 or email me at anthony@somedomain.com' \
-m VERY_LIKELY

By specifying VERY_LIKELY as the minimum likelihood, any matches less than VERY_LIKELY are excluded:

Findings:
        Quote: 900-649-2568
        Info type: PHONE_NUMBER
        Likelihood: VERY_LIKELY

The full results without the limitation would be:

Findings:
        Quote: 900-649-2568
        Info type: PHONE_NUMBER
        Likelihood: VERY_LIKELY
        Quote: anthony@somedomain.com
        Info type: EMAIL_ADDRESS
        Likelihood: LIKELY

Similarly, we could specify the info type we're checking for:

node inspect.js -c $GOOGLE_CLOUD_PROJECT \
string 'Call 900-649-2568 or email me at anthony@somedomain.com' \
-t EMAIL_ADDRESS

Only the specified info type is returned if found:

Findings:
        Quote: anthony@somedomain.com
        Info type: EMAIL_ADDRESS
        Likelihood: LIKELY

Below is the asynchronous function that uses the API to inspect the input:

inspect.js

async function inspectString(
  callingProjectId,
  string,
  minLikelihood,
  maxFindings,
  infoTypes,
  customInfoTypes,
  includeQuote
) {
...
}

The arguments provided for the parameters above are used to construct a request object. That request is then provided to the inspectContent function to get a response that results in our output:

inspect.js

  // Construct item to inspect
  const item = {value: string};

  // Construct request
  const request = {
    parent: dlp.projectPath(callingProjectId),
    inspectConfig: {
      infoTypes: infoTypes,
      customInfoTypes: customInfoTypes,
      minLikelihood: minLikelihood,
      includeQuote: includeQuote,
      limits: {
        maxFindingsPerRequest: maxFindings,
      },
    },
    item: item,
  };
...
...
 const [response] = await dlp.inspectContent(request);

4. Deidentification

Beyond inspecting and detecting sensitive data, Cloud DLP can perform deidentification. Deidentification is the process of removing identifying information from data. The API detects sensitive data as defined by info types, and then uses a de-identification transformation to mask, delete, or otherwise obscure the data.

deid.js will demonstrate deidentification in several ways. The simplest method of deidentification is with a mask:

node deid.js deidMask -c $GOOGLE_CLOUD_PROJECT \
"My order number is F12312399. Email me at anthony@somedomain.com"

With a mask the API will replace the characters of the matching info type with a different character, * by default. The output will be:

My order number is F12312399. Email me at *****************************

Notice that the email address in the string is obfuscated while the arbitrary order number is intact. (Custom info types are possible but out of scope of this Codelab).

Let's see the function that uses the DLP API to deidentify with a mask:

deid.js

async function deidentifyWithMask(
  callingProjectId,
  string,
  maskingCharacter,
  numberToMask
) {
...
}

Once again, these arguments are used to construct a request object. This time it's provided to the deidentifyContent function:

deid.js

  // Construct deidentification request
  const item = {value: string};
  const request = {
    parent: dlp.projectPath(callingProjectId),
    deidentifyConfig: {
      infoTypeTransformations: {
        transformations: [
          {
            primitiveTransformation: {
              characterMaskConfig: {
                maskingCharacter: maskingCharacter,
                numberToMask: numberToMask,
              },
            },
          },
        ],
      },
    },
    item: item,
  };
... 
... 
const [response] = await dlp.deidentifyContent(request);

Deidentify with Format Preserving Encryption

The DLP API also offers the ability to encrypt sensitive data values using a cryptographic key.

We'll start by using Cloud KMS to create a key ring:

gcloud kms keyrings create dlp-keyring --location global

Now we can create a key that we'll use to encrypt the data:

gcloud kms keys create dlp-key \
--purpose='encryption' \
--location=global \
--keyring=dlp-keyring

The DLP API will accept a wrapped key encrypted with the KMS key we created. We can generate a random string that will be wrapped. We'll need this later to reidentify:

export AES_KEY=`head -c16 < /dev/random | base64 -w 0`

Now we can encrypt the string with our KMS key. This will generate a binary file that contains the encrypted string as ciphertext:

echo -n $AES_KEY | gcloud kms encrypt \
--location global \
--keyring dlp-keyring  \
--key dlp-key \
--plaintext-file - \
--ciphertext-file ./ciphertext.bin

Using deid.js we can now deidentify the phone number in the sample string below using encryption:

node deid.js deidFpe -c $GOOGLE_CLOUD_PROJECT \
"My client's cell is 9006492568" `base64 -w 0 ciphertext.bin` \
projects/${GOOGLE_CLOUD_PROJECT}/locations/global/keyRings/dlp-keyring/cryptoKeys/dlp-key \
-s PHONE_NUMBER

The output will return the string with the matched info types replaced by an encrypted string and preceded by the info type indicated by the -s flag:

My client's cell is PHONE_NUMBER(10):vSt55z79nR

Let's take a look at the function we're using to deidentify the string:

deid.js

async function deidentifyWithFpe(
  callingProjectId,
  string,
  alphabet,
  surrogateType,
  keyName,
  wrappedKey
) {
...
}

The arguments are used to construct a cryptoReplaceFfxFpeConfig object:

deid.js

  const cryptoReplaceFfxFpeConfig = {
    cryptoKey: {
      kmsWrapped: {
        wrappedKey: wrappedKey,
        cryptoKeyName: keyName,
      },
    },
    commonAlphabet: alphabet,
  };
  if (surrogateType) {
    cryptoReplaceFfxFpeConfig.surrogateInfoType = {
      name: surrogateType,
    };
  }

The cryptoReplaceFfxFpeConfig object is in turn used in the request to the API via the deidentifyContent function:

deid.js

  // Construct deidentification request
  const item = {value: string};
  const request = {
    parent: dlp.projectPath(callingProjectId),
    deidentifyConfig: {
      infoTypeTransformations: {
        transformations: [
          {
            primitiveTransformation: {
              cryptoReplaceFfxFpeConfig: cryptoReplaceFfxFpeConfig,
            },
          },
        ],
      },
    },
    item: item,
  };

  try {
    // Run deidentification request
    const [response] = await dlp.deidentifyContent(request);

Re-identify data

In order to re-identify the data, the DLP API will use the ciphertext we created in the previous step:

node deid.js reidFpe -c $GOOGLE_CLOUD_PROJECT \
"<YOUR_DEID_OUTPUT>" \
PHONE_NUMBER `base64 -w 0 ciphertext.bin`  \
projects/${GOOGLE_CLOUD_PROJECT}/locations/global/keyRings/dlp-keyring/cryptoKeys/dlp-key

The output will be the original string with no redactions or surrogate type indicated:

My client's cell is 9006492568

The function used to reidentify data is similar to the one used to deidentify it:

deid.js

async function reidentifyWithFpe(
  callingProjectId,
  string,
  alphabet,
  surrogateType,
  keyName,
  wrappedKey
) {
...
}

And once again, the arguments are used in a request to the API, this time to the reidentifyContent function:

deid.js

  // Construct deidentification request
  const item = {value: string};
  const request = {
    parent: dlp.projectPath(callingProjectId),
    reidentifyConfig: {
      infoTypeTransformations: {
        transformations: [
          {
            primitiveTransformation: {
              cryptoReplaceFfxFpeConfig: {
                cryptoKey: {
                  kmsWrapped: {
                    wrappedKey: wrappedKey,
                    cryptoKeyName: keyName,
                  },
                },
                commonAlphabet: alphabet,
                surrogateInfoType: {
                  name: surrogateType,
                },
              },
            },
          },
        ],
      },
    },
    inspectConfig: {
      customInfoTypes: [
        {
          infoType: {
            name: surrogateType,
          },
          surrogateType: {},
        },
      ],
    },
    item: item,
  };

  try {
    // Run reidentification request
    const [response] = await dlp.reidentifyContent(request);

Deidentify Dates with Date Shifting

In certain contexts, dates can be considered sensitive data that we might want to obfuscate. Date shifting lets us shift dates by a random increment while preserving the sequence and duration of a period of time. Each date in a set is shifted by an amount of time unique to that entry. To demonstrate deidentification via date shifting, first take a look at the sample CSV file that contains date data:

resources/dates.csv

name,birth_date,register_date,credit_card
Ann,01/01/1980,07/21/1996,4532908762519852
James,03/06/1988,04/09/2001,4301261899725540
Dan,08/14/1945,11/15/2011,4620761856015295
Laura,11/03/1992,01/04/2017,4564981067258901

The data contains two fields that we could apply a date shift to: birth_date and register_date. deid.js will accept a lower bound value and an upper bound value to define a range to select a random number of day by which to shift the dates:

node deid.js deidDateShift -c $GOOGLE_CLOUD_PROJECT resources/dates.csv datesShifted.csv 30 90 birth_date

A file called datesShifted.csv will be generated with the dates randomly shifted by a number of days between 30 and 90. Here's an example of the generated output:

name,birth_date,register_date,credit_card
Ann,2/6/1980,7/21/1996,4532908762519852
James,5/18/1988,4/9/2001,4301261899725540
Dan,9/16/1945,11/15/2011,4620761856015295
Laura,12/16/1992,1/4/2017,4564981067258901

Notice that we were also able to specify which date column in the CSV file we wanted to shift. The birth_date field The register_date field remains unchanged.

Let's take a look a the function that handles deidentification with a dateshift:

deid.js

async function deidentifyWithDateShift(
  callingProjectId,
  inputCsvFile,
  outputCsvFile,
  dateFields,
  lowerBoundDays,
  upperBoundDays,
  contextFieldId,
  wrappedKey,
  keyName
) {
...
}

Notice that this function could accept a wrapped key and a key name, similar to the deidentification with FPE, so that we have the option of providing an encryption key to reidentify a date shift. The arguments we provide build a dateShiftConfig object:

deid.js

  // Construct DateShiftConfig
  const dateShiftConfig = {
    lowerBoundDays: lowerBoundDays,
    upperBoundDays: upperBoundDays,
  };

  if (contextFieldId && keyName && wrappedKey) {
    dateShiftConfig.context = {name: contextFieldId};
    dateShiftConfig.cryptoKey = {
      kmsWrapped: {
        wrappedKey: wrappedKey,
        cryptoKeyName: keyName,
      },
    };
  } else if (contextFieldId || keyName || wrappedKey) {
    throw new Error(
      'You must set either ALL or NONE of {contextFieldId, keyName, wrappedKey}!'
    );
  }

  // Construct deidentification request
  const request = {
    parent: dlp.projectPath(callingProjectId),
    deidentifyConfig: {
      recordTransformations: {
        fieldTransformations: [
          {
            fields: dateFields,
            primitiveTransformation: {
              dateShiftConfig: dateShiftConfig,
            },
          },
        ],
      },
    },
    item: tableItem,
  };

5. Redact strings and images

Another method of obfuscating sensitive information is redaction. Redaction will replace a match with the info type it's identified to match with. redact.js demonstrates redaction:

node redact.js -c $GOOGLE_CLOUD_PROJECT \
string "Please refund the purchase to my credit card 4012888888881881" \
-t 'CREDIT_CARD_NUMBER'

The output replaces the sample credit card number with the info type CREDIT_CARD_NUMBER:

Please refund the purchase on my credit card [CREDIT_CARD_NUMBER]

This is useful if you'd like to hide sensitive information but still identify the type of information that's being removed. The DLP API can similarly redact information from images that contain text. To demonstrate, let's take a look at a sample image:

resources/test.png

To redact the phone number and email address from the image above:

node redact.js -c $GOOGLE_CLOUD_PROJECT \
image resources/test.png ./redacted.png \
-t PHONE_NUMBER -t EMAIL_ADDRESS

As specified, a new image named redacted.png will be generated with the requested information blacked out:

Here is the function that is used to redact from a string:

redact.js

async function redactText(
  callingProjectId, 
  string,
  minLikelihood,
  infoTypes
) {
...}

And here is the request that will be provided to the deidentifyContent function:

redact.js

const request = {
    parent: dlp.projectPath(callingProjectId),
    item: {
      value: string,
    },
    deidentifyConfig: {
      infoTypeTransformations: {
        transformations: [replaceWithInfoTypeTransformation],
      },
    },
    inspectConfig: {
      minLikelihood: minLikelihood,
      infoTypes: infoTypes,
    },
  };

Similarly, here is the function for redacting an image:

redact.js

async function redactImage(
  callingProjectId,
  filepath,
  minLikelihood,
  infoTypes,
  outputPath
) {
...}

And here is the request that will be provided to the redactImage function:

redact.js

// Construct image redaction request
  const request = {
    parent: dlp.projectPath(callingProjectId),
    byteItem: {
      type: fileTypeConstant,
      data: fileBytes,
    },
    inspectConfig: {
      minLikelihood: minLikelihood,
      infoTypes: infoTypes,
    },
    imageRedactionConfigs: imageRedactionConfigs,
  };

6. Clean up

We've explored how we can use the DLP API to mask, deidentify, and redact sensitive information from our data. Now it's time to clean up our project of any resources we've created.

Delete the Project

In the GCP Console, go to the Cloud Resource Manager page:

In the project list, select the project we've been working in and click Delete. You'll be prompted to type in the project ID. Enter it and click Shut Down.

Alternatively, you can delete the entire project directly from Cloud Shell with gcloud:

gcloud projects delete $GOOGLE_CLOUD_PROJECT

7. Congratulations!

Woo hoo! You did it! Cloud DLP is a powerful tool that provides access to a powerful sensitive data inspection, classification, and de-identification platform.

What we've covered

We saw how the Cloud DLP API can be used to inspect strings and files for multiple info types
We learned how the DLP API can deidentify strings with a mask to hide data matching info types
We used the DLP API to use an encryption key to deidentify and then reidentify data
We used the DLP API to redact data from a string as well as an image