Cloud Data Loss Prevention 概览

剩余时间：20 分钟

关于此 Codelab

上次更新时间：10月 8, 2020

Roger Martinez 编写

1. 概览

Cloud Data Loss Prevention (DLP) 是一项全代管式服务，旨在帮助发现、分类和保护敏感信息。此 Codelab 将介绍 Cloud DLP API 的一些基本功能，并演示使用它来保护数据的各种方法。

实践内容

使用 DLP 检查字符串和文件是否与信息类型匹配
了解去标识化技术并使用 DLP 对数据进行去标识化
了解如何使用保留格式加密 (FPE) 对已去标识化的数据进行重标识
使用 DLP 隐去字符串和图片中的信息类型

所需条件

设置了结算信息的 Google Cloud 项目。如果没有，您必须创建一个。

2. 准备工作

此 Codelab 可以在 Google Cloud Platform 上完全运行，无需任何本地安装或配置。

Cloud Shell

在此 Codelab 中，我们将通过 Cloud Shell 使用命令行预配和管理不同的云资源和服务。

下载配套项目代码库：

git clone https://github.com/googleapis/nodejs-dlp

下载项目代码后，切换到示例目录并安装所需的 Node.js 软件包：

cd samples && npm install

使用以下 gcloud 命令设置项目，以确保您使用的是正确的项目：

gcloud config set project [PROJECT_ID]

启用 API

以下是我们需要为项目启用的 API：

Cloud Data Loss Prevention API - 提供对文本、图片和 Google Cloud Platform 存储库中隐私敏感的片段进行检测、风险分析和去标识化处理的方法
Cloud Key Management Service (KMS) API - Google Cloud KMS 允许客户管理加密密钥并使用这些密钥执行加密操作。

使用以下 gcloud 命令启用所需的 API：

gcloud services enable dlp.googleapis.com cloudkms.googleapis.com \
--project ${GOOGLE_CLOUD_PROJECT}

3. 检查字符串和文件

在上一步中下载的项目的示例目录中包含几个使用 Cloud DLP 不同功能的 JavaScript 文件。inspect.js 将检查所提供的字符串或文件是否存在敏感信息类型。

如需对此进行测试，您可以提供 string 选项以及包含一些潜在敏感信息的示例字符串：

node inspect.js -c $GOOGLE_CLOUD_PROJECT \
string 'My email address is jenny@somedomain.com and you can call me at 555-867-5309'

输出应会告诉我们每种匹配信息类型的发现结果，其中包括：

引用：模板会指定

InfoType：为字符串的该部分检测到的信息类型。您可以点击此处查看可能的信息类型的完整列表。默认情况下，inspect.js 将仅检查信息类型 CREDIT_CARD_NUMBER、PHONE_NUMBER 和 EMAIL_ADDRESS

可能性：根据每个结果匹配的可能性来对结果进行分类。可能性介于 VERY_UNLIKELY 到 VERY_LIKELY 之间。

上述命令请求的发现结果如下：

Findings:
        Quote: jenny@somedomain.com
        Info type: EMAIL_ADDRESS
        Likelihood: LIKELY
        Quote: 555-867-5309
        Info type: PHONE_NUMBER
        Likelihood: VERY_LIKELY

同样，我们也可以检查文件中的信息类型。查看示例 accounts.txt 文件：

resources/accounts.txt

My credit card number is 1234 5678 9012 3456, and my CVV is 789.

再次运行 inspect.js，这次使用文件选项：

node inspect.js -c $GOOGLE_CLOUD_PROJECT file resources/accounts.txt

结果：

Findings:
        Quote: 5678 9012 3456
        Info type: CREDIT_CARD_NUMBER
        Likelihood: VERY_LIKELY

对于这两种查询，我们都可以按可能性或信息类型限制结果。例如：

node inspect.js -c $GOOGLE_CLOUD_PROJECT \
string 'Call 900-649-2568 or email me at anthony@somedomain.com' \
-m VERY_LIKELY

通过将 VERY_LIKELY 指定为最小可能性，系统会排除任何小于 VERY_LIKELY 的匹配项：

Findings:
        Quote: 900-649-2568
        Info type: PHONE_NUMBER
        Likelihood: VERY_LIKELY

不受限制的完整结果如下：

Findings:
        Quote: 900-649-2568
        Info type: PHONE_NUMBER
        Likelihood: VERY_LIKELY
        Quote: anthony@somedomain.com
        Info type: EMAIL_ADDRESS
        Likelihood: LIKELY

同样，我们可以指定要检查的信息类型：

node inspect.js -c $GOOGLE_CLOUD_PROJECT \
string 'Call 900-649-2568 or email me at anthony@somedomain.com' \
-t EMAIL_ADDRESS

如果找到，则仅返回指定的信息类型：

Findings:
        Quote: anthony@somedomain.com
        Info type: EMAIL_ADDRESS
        Likelihood: LIKELY

以下是使用 API 检查输入的异步函数：

inspect.js

async function inspectString(
  callingProjectId,
  string,
  minLikelihood,
  maxFindings,
  infoTypes,
  customInfoTypes,
  includeQuote
) {
...
}

为上述参数提供的参数用于构造请求对象。然后，该请求会提供给 inspectContent 函数，以获取能够生成输出的响应：

inspect.js

  // Construct item to inspect
  const item = {value: string};

  // Construct request
  const request = {
    parent: dlp.projectPath(callingProjectId),
    inspectConfig: {
      infoTypes: infoTypes,
      customInfoTypes: customInfoTypes,
      minLikelihood: minLikelihood,
      includeQuote: includeQuote,
      limits: {
        maxFindingsPerRequest: maxFindings,
      },
    },
    item: item,
  };
...
...
 const [response] = await dlp.inspectContent(request);

4. 去标识化

除了检查和检测敏感数据外，Cloud DLP 还可以执行去标识化。去标识化是从数据中移除标识信息的过程。该 API 会检测信息类型定义的敏感数据，然后使用去标识化转换来遮盖、删除或以其他方式遮盖这些数据。

deid.js 将通过多种方式演示去标识化。最简单的去标识化方法是使用掩码：

node deid.js deidMask -c $GOOGLE_CLOUD_PROJECT \
"My order number is F12312399. Email me at anthony@somedomain.com"

通过掩码，API 会将匹配信息类型的字符替换为其他字符（默认情况下为 *）。输出将为：

My order number is F12312399. Email me at *****************************

请注意，字符串中的电子邮件地址经过了混淆处理，而任意订单号保持不变。（您可以使用自定义信息类型，但在此 Codelab 的范围内）。

我们来看一下使用 DLP API 通过掩码进行去标识化处理的函数：

deid.js

async function deidentifyWithMask(
  callingProjectId,
  string,
  maskingCharacter,
  numberToMask
) {
...
}

同样，这些参数用于构造请求对象。这次将其提供给 deidentifyContent 函数：

deid.js

  // Construct deidentification request
  const item = {value: string};
  const request = {
    parent: dlp.projectPath(callingProjectId),
    deidentifyConfig: {
      infoTypeTransformations: {
        transformations: [
          {
            primitiveTransformation: {
              characterMaskConfig: {
                maskingCharacter: maskingCharacter,
                numberToMask: numberToMask,
              },
            },
          },
        ],
      },
    },
    item: item,
  };
... 
... 
const [response] = await dlp.deidentifyContent(request);

使用格式保留加密进行去标识化

DLP API 还提供使用加密密钥加密敏感数据值的功能。

首先，使用 Cloud KMS 创建密钥环：

gcloud kms keyrings create dlp-keyring --location global

现在，我们可以创建一个用于加密数据的密钥：

gcloud kms keys create dlp-key \
--purpose='encryption' \
--location=global \
--keyring=dlp-keyring

DLP API 将接受使用我们创建的 KMS 密钥加密的封装密钥。我们可以生成一个将换行的随机字符串。我们稍后需要使用此信息来重新识别：

export AES_KEY=`head -c16 < /dev/random | base64 -w 0`

现在，我们可以使用 KMS 密钥加密字符串。这将生成一个二进制文件，其中包含加密字符串作为密文：

echo -n $AES_KEY | gcloud kms encrypt \
--location global \
--keyring dlp-keyring  \
--key dlp-key \
--plaintext-file - \
--ciphertext-file ./ciphertext.bin

借助 deid.js，我们现在可以通过加密对以下示例字符串中的电话号码进行去标识化处理：

node deid.js deidFpe -c $GOOGLE_CLOUD_PROJECT \
"My client's cell is 9006492568" `base64 -w 0 ciphertext.bin` \
projects/${GOOGLE_CLOUD_PROJECT}/locations/global/keyRings/dlp-keyring/cryptoKeys/dlp-key \
-s PHONE_NUMBER

输出将返回将匹配信息类型替换为加密字符串的字符串，并在其前面加上 -s 标志指示的信息类型：

My client's cell is PHONE_NUMBER(10):vSt55z79nR

我们来看一下用于对字符串进行去标识化的函数：

deid.js

async function deidentifyWithFpe(
  callingProjectId,
  string,
  alphabet,
  surrogateType,
  keyName,
  wrappedKey
) {
...
}

实参用于构造 cryptoReplaceFfxFpeConfig 对象：

deid.js

  const cryptoReplaceFfxFpeConfig = {
    cryptoKey: {
      kmsWrapped: {
        wrappedKey: wrappedKey,
        cryptoKeyName: keyName,
      },
    },
    commonAlphabet: alphabet,
  };
  if (surrogateType) {
    cryptoReplaceFfxFpeConfig.surrogateInfoType = {
      name: surrogateType,
    };
  }

进而在通过 deidentifyContent 函数向 API 发出请求时使用 cryptoReplaceFfxFpeConfig 对象：

deid.js

  // Construct deidentification request
  const item = {value: string};
  const request = {
    parent: dlp.projectPath(callingProjectId),
    deidentifyConfig: {
      infoTypeTransformations: {
        transformations: [
          {
            primitiveTransformation: {
              cryptoReplaceFfxFpeConfig: cryptoReplaceFfxFpeConfig,
            },
          },
        ],
      },
    },
    item: item,
  };

  try {
    // Run deidentification request
    const [response] = await dlp.deidentifyContent(request);

重标识数据

为了重新识别数据，DLP API 将使用我们在上一步中创建的密文：

node deid.js reidFpe -c $GOOGLE_CLOUD_PROJECT \
"<YOUR_DEID_OUTPUT>" \
PHONE_NUMBER `base64 -w 0 ciphertext.bin`  \
projects/${GOOGLE_CLOUD_PROJECT}/locations/global/keyRings/dlp-keyring/cryptoKeys/dlp-key

输出将是原始字符串，未指明隐去或代理类型：

My client's cell is 9006492568

用于重标识数据的函数与用于对数据进行去标识化的函数类似：

deid.js

async function reidentifyWithFpe(
  callingProjectId,
  string,
  alphabet,
  surrogateType,
  keyName,
  wrappedKey
) {
...
}

再次强调，这些参数会在向 API 发出的请求中使用，这次是发送给 reidentifyContent 函数：

deid.js

  // Construct deidentification request
  const item = {value: string};
  const request = {
    parent: dlp.projectPath(callingProjectId),
    reidentifyConfig: {
      infoTypeTransformations: {
        transformations: [
          {
            primitiveTransformation: {
              cryptoReplaceFfxFpeConfig: {
                cryptoKey: {
                  kmsWrapped: {
                    wrappedKey: wrappedKey,
                    cryptoKeyName: keyName,
                  },
                },
                commonAlphabet: alphabet,
                surrogateInfoType: {
                  name: surrogateType,
                },
              },
            },
          },
        ],
      },
    },
    inspectConfig: {
      customInfoTypes: [
        {
          infoType: {
            name: surrogateType,
          },
          surrogateType: {},
        },
      ],
    },
    item: item,
  };

  try {
    // Run reidentification request
    const [response] = await dlp.reidentifyContent(request);

通过日期偏移对日期进行去标识化

在某些情况下，日期可被视为我们可能需要进行混淆处理的敏感数据。通过日期偏移，我们可以按随机增量偏移日期，同时保留一段时间的顺序和持续时间。数据集中的每个日期都会偏移与该条目对应的时间量。如需演示如何通过日期偏移进行去标识化处理，请先看一下包含日期数据的 CSV 示例文件：

resources/dates.csv

name,birth_date,register_date,credit_card
Ann,01/01/1980,07/21/1996,4532908762519852
James,03/06/1988,04/09/2001,4301261899725540
Dan,08/14/1945,11/15/2011,4620761856015295
Laura,11/03/1992,01/04/2017,4564981067258901

该数据包含两个我们可以应用日期偏移的字段：birth_date 和 register_date。deid.js 将接受下限值和上限值来定义一个范围，以选择偏移日期的随机天数：

node deid.js deidDateShift -c $GOOGLE_CLOUD_PROJECT resources/dates.csv datesShifted.csv 30 90 birth_date

系统会生成一个名为 datesShifted.csv 的文件，其中的日期随机偏移天数，介于 30 到 90 之间。以下是生成的输出示例：

name,birth_date,register_date,credit_card
Ann,2/6/1980,7/21/1996,4532908762519852
James,5/18/1988,4/9/2001,4301261899725540
Dan,9/16/1945,11/15/2011,4620761856015295
Laura,12/16/1992,1/4/2017,4564981067258901

请注意，我们还可以指定要偏移 CSV 文件中的哪个日期列。birth_date 字段 register_date 字段保持不变。

我们来看一个通过日期偏移处理去标识化的函数：

deid.js

async function deidentifyWithDateShift(
  callingProjectId,
  inputCsvFile,
  outputCsvFile,
  dateFields,
  lowerBoundDays,
  upperBoundDays,
  contextFieldId,
  wrappedKey,
  keyName
) {
...
}

请注意，此函数可以接受封装的密钥和密钥名称（与使用 FPE 进行去标识化类似），这样我们就可以选择提供加密密钥来重标识日期偏移。我们提供的参数用于构建 dateShiftConfig 对象：

deid.js

  // Construct DateShiftConfig
  const dateShiftConfig = {
    lowerBoundDays: lowerBoundDays,
    upperBoundDays: upperBoundDays,
  };

  if (contextFieldId && keyName && wrappedKey) {
    dateShiftConfig.context = {name: contextFieldId};
    dateShiftConfig.cryptoKey = {
      kmsWrapped: {
        wrappedKey: wrappedKey,
        cryptoKeyName: keyName,
      },
    };
  } else if (contextFieldId || keyName || wrappedKey) {
    throw new Error(
      'You must set either ALL or NONE of {contextFieldId, keyName, wrappedKey}!'
    );
  }

  // Construct deidentification request
  const request = {
    parent: dlp.projectPath(callingProjectId),
    deidentifyConfig: {
      recordTransformations: {
        fieldTransformations: [
          {
            fields: dateFields,
            primitiveTransformation: {
              dateShiftConfig: dateShiftConfig,
            },
          },
        ],
      },
    },
    item: tableItem,
  };

5. 隐去字符串和图片

对敏感信息进行混淆处理的另一种方法是“隐去”。隐去功能会将匹配结果替换为被识别为匹配的信息类型。redact.js 演示了隐去操作：

node redact.js -c $GOOGLE_CLOUD_PROJECT \
string "Please refund the purchase to my credit card 4012888888881881" \
-t 'CREDIT_CARD_NUMBER'

输出结果会将示例信用卡号替换为信息类型 CREDIT_CARD_NUMBER：

Please refund the purchase on my credit card [CREDIT_CARD_NUMBER]

如果您想隐藏敏感信息，但仍想识别要移除的信息的类型，上述做法就十分实用。DLP API 同样可以隐去包含文本的图片中的信息。为便于演示，我们来看一张示例图片：

resources/test.png

如需隐去上图中的电话号码和电子邮件地址，请执行以下操作：

node redact.js -c $GOOGLE_CLOUD_PROJECT \
image resources/test.png ./redacted.png \
-t PHONE_NUMBER -t EMAIL_ADDRESS

按照说明，系统将生成一个名为 redacted.png 的新图片，其中请求的信息涂黑：

以下是用于从字符串进行隐去的函数：

redact.js

async function redactText(
  callingProjectId, 
  string,
  minLikelihood,
  infoTypes
) {
...}

以下是将提供给 deidentifyContent 函数的请求：

redact.js

const request = {
    parent: dlp.projectPath(callingProjectId),
    item: {
      value: string,
    },
    deidentifyConfig: {
      infoTypeTransformations: {
        transformations: [replaceWithInfoTypeTransformation],
      },
    },
    inspectConfig: {
      minLikelihood: minLikelihood,
      infoTypes: infoTypes,
    },
  };

同样，以下是遮盖图片的函数：

redact.js

async function redactImage(
  callingProjectId,
  filepath,
  minLikelihood,
  infoTypes,
  outputPath
) {
...}

以下是将提供给 redactImage 函数的请求：

redact.js

// Construct image redaction request
  const request = {
    parent: dlp.projectPath(callingProjectId),
    byteItem: {
      type: fileTypeConstant,
      data: fileBytes,
    },
    inspectConfig: {
      minLikelihood: minLikelihood,
      infoTypes: infoTypes,
    },
    imageRedactionConfigs: imageRedactionConfigs,
  };

6. 清理

我们已经探索了如何使用 DLP API 对数据中的敏感信息进行遮盖、去标识化和遮盖。现在该清理我们创建的所有资源了。

删除项目

在 GCP Console 中，转到 Cloud Resource Manager 页面：

在项目列表中，选择我们一直使用的项目，然后点击删除。此时，系统会提示您输入项目 ID。输入项目 ID，然后点击关停。

或者，您可以使用 gcloud 直接从 Cloud Shell 中删除整个项目：

gcloud projects delete $GOOGLE_CLOUD_PROJECT

7. 恭喜！

哇哈！大功告成！Cloud DLP 是一款功能强大的工具，可让您使用强大的敏感数据检查、分类和去标识化平台。

所学内容

我们了解了如何使用 Cloud DLP API 检查字符串和文件中的多种信息类型
我们了解了 DLP API 如何使用掩码对字符串进行去标识化以隐藏与信息类型匹配的数据
我们通过 DLP API 使用加密密钥对数据进行去标识化，然后再重新识别数据
我们使用 DLP API 隐去字符串和图片中的数据

报告错误