使用 Document AI 智能处理手写表单 (Python)

剩余时间：12 分钟

关于此 Codelab

上次更新时间：4月 26, 2021

Rajat Gupta 编写

1. 概览

Document AI 是什么？

Document AI API 是一个文档理解解决方案，它接受文档、电子邮件等非结构化数据，并使数据更易于理解、分析和使用。该 API 通过内容分类、实体提取、高级搜索等提供结构。

在本教程中，您主要关注将 Document AI API 与 Python 结合使用。本教程演示如何解析简单的医疗护理表单。

您将学习的内容

如何启用 Document AI API
如何对 API 请求进行身份验证
如何安装 Python 版客户端库
如何从扫描版表单中解析数据

所需条件

Google Cloud 项目
浏览器，例如 Chrome 或 Firefox
Python 3 知识

调查问卷

您将如何使用本教程？

仅阅读教程内容阅读并完成练习

您如何评价使用 Python 的体验？

新手水平中等水平熟练水平

您如何评价使用 Google Cloud 服务的体验？

新手水平中等水平熟练水平

2. 设置和要求

自定进度的环境设置

记住项目 ID，所有 Google Cloud 项目中的唯一名称。（您的名字已被接受，其他人无法用它！）此 ID 必须稍后以 PROJECT_ID 的形式提供。

接下来，您必须在 Cloud Console 中启用结算功能才能使用 Google Cloud 资源。

请务必按照“清理”部分中的说明操作。本部分将介绍如何关停资源，这样您就不会在本教程之外产生费用。Google Cloud 的新用户均有资格申请 $300 美元免费试用计划。

启动 Cloud Shell

虽然 Google Cloud 您可以从笔记本电脑远程运行 Google Cloud，但此 Codelab 使用在云中运行的命令行环境 Google Cloud Shell。

激活 Cloud Shell

在 Cloud Console 中，点击激活 Cloud Shell。

zlNW0HehB_AFW1qZ4AyebSQUdWm95n7TbnOr7UVm3j9dFcg6oWApJRlC0jnU1Mvb-IQp-trP1Px8xKNwt6o3pP6fyih947sEhOFI4IRF0W7WZk6hFqZDUGXQQXrw21GuMm2ecHrbzQ

如果您以前从未启动过 Cloud Shell，将看到一个中间屏幕（在折叠下面），描述它是什么。如果是这种情况，请点击继续（您将永远不会再看到它）。一次性屏幕如下所示：

kEPbNAo_w5C_pi9QvhFwWwky1cX8hr_xEMGWySNIoMCdi-Djx9AQRqWn-__DmEpC7vKgUtl-feTcv-wBxJ8NwzzAp7mY65-fi2LJo4twUoewT1SUjd6Y3h81RG3rKIkqhoVlFR-G7w

预配和连接到 Cloud Shell 只需花几分钟时间。

pTv5mEKzWMWp5VBrg2eGcuRPv9dLInPToS-mohlrqDASyYGWnZ_SwE-MzOWHe76ZdCSmw0kgWogSJv27lrQE8pvA5OD6P1I47nz8vrAdK7yR1NseZKJvcxAZrPb8wRxoqyTpD-gbhA

Cloud Shell 可让您从终端访问托管在云端的虚拟机。该虚拟机包含您需要的所有开发工具。它提供了一个持久的 5GB 主目录，并且在 Google Cloud 中运行，大大增强了网络性能和身份验证。只需使用一个浏览器或 Google Chromebook 即可完成本 Codelab 中的大部分（甚至全部）工作。

在连接到 Cloud Shell 后，您应该会看到自己已通过身份验证，并且相关项目已设置为您的项目 ID：

在 Cloud Shell 中运行以下命令以确认您已通过身份验证：

gcloud auth list

命令输出

 Credentialed Accounts
ACTIVE  ACCOUNT
*      <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`

gcloud config list project

命令输出

[core]
project = <PROJECT_ID>

如果不是上述结果，您可以使用以下命令进行设置：

gcloud config set project <PROJECT_ID>

命令输出

Updated property [core/project].

3. 启用 Cloud Document AI API

您必须启用 API，然后才能开始使用 Document AI。在浏览器中打开 Cloud Console。

依次点击导航菜单 ☰ > API 和服务 > 库。
搜索“Document AI API”，然后点击启用，以便在 Google Cloud 项目中使用该 API

4. 创建和测试处理器

您必须先创建一个表单解析器处理器实例，以便在本教程的 Document AI Platform 中使用。

在控制台中，导航到 Document AI Platform 概览
点击创建处理器，然后选择表单解析器
指定处理方名称，然后从列表中选择您的区域。
点击创建以创建处理器
复制处理器 ID。您以后必须在代码中使用该代码。

（可选）您可以通过上传文档在控制台中测试处理器。点击上传文档并选择要解析的表单。如果您没有可用的示例，则可以下载并使用此示例表单。

运行状况表单

您的输出应如下所示：已解析的表单

5. 对 API 请求进行身份验证

如需向 Document AI API 发出请求，您必须使用服务帐号。服务帐号属于您的项目，由 Google 客户端 Python 库用于发出 API 请求。与任何其他用户帐号一样，服务帐号由电子邮件地址表示。在本部分中，您将使用 Cloud SDK 创建服务帐号，然后创建您需要以服务帐号身份进行身份验证的凭据。

首先，使用 PROJECT_ID 设置一个环境变量，您将在整个 Codelab 中使用此变量：

export GOOGLE_CLOUD_PROJECT=$(gcloud config get-value core/project)

接下来，使用以下命令创建一个新的服务帐号来访问 Document AI API：

gcloud iam service-accounts create my-docai-sa \
  --display-name "my-docai-service-account"

接下来，创建 Python 代码用于登录新服务帐号的凭据。使用以下命令创建这些凭据并将其保存为 JSON 文件“~/key.json”：

gcloud iam service-accounts keys create ~/key.json \
  --iam-account  my-docai-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com

最后，设置 GOOGLE_APPLICATION_CREDENTIALS 环境变量，以供库用来查找您的凭据。如需详细了解此表单身份验证，请参阅指南。您应该使用以下命令将环境变量设置为您创建的凭据 JSON 文件的完整路径：

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"

6. 下载示例表单

我们有一个示例表单用于 Google Cloud Storage。请使用以下命令将其下载到您的工作目录。

gsutil cp gs://cloud-samples-data/documentai/form.pdf .

使用以下命令确认文件已下载到您的 Cloudshell：

ls -ltr form.pdf

7. 安装客户端库

安装客户端库：

pip3 install --upgrade google-cloud-documentai
pip3 install --upgrade google-cloud-storage

您应会看到类似下图的内容：

...
Installing collected packages: google-cloud-documentai
Successfully installed google-cloud-documentai-0.3.0
.
.
Installing collected packages: google-cloud-storage
Successfully installed google-cloud-storage-1.35.0

现在，您可以使用 Document AI API 了！

启动交互式 Python

在本教程中，您将使用名为 IPython 的交互式 Python 解释器。在 Cloud Shell 中运行 ipython 以启动会话。此命令会在交互式会话中运行 Python 解释器。

ipython

您应会看到类似下图的内容：

Python 3.7.3 (default, Jul 25 2020, 13:03:44)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

8. 发出同步流程文档请求

在此步骤中，您将使用同步端点调用进程文档。如要一次处理大量文档，您还可以使用异步 API，详细了解如何使用 Form Parser API，请参阅此处的指南。

将以下代码复制到 iPython 会话中：

project_id= 'YOUR_PROJECT_ID'
location = 'YOUR_PROJECT_LOCATION' # Format is 'us' or 'eu'
processor_id = 'YOUR_PROCESSOR_ID' # Create processor in Cloud Console
file_path = 'form.pdf' # The local file in your current working directory

from google.cloud import documentai_v1beta3 as documentai
from google.cloud import storage

def process_document(
    project_id=project_id, location=location, processor_id=processor_id,  file_path=file_path
):

    # Instantiates a client
    client = documentai.DocumentProcessorServiceClient()

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"

    with open(file_path, "rb") as image:
        image_content = image.read()

    # Read the file into memory
    document = {"content": image_content, "mime_type": "application/pdf"}

    # Configure the process request
    request = {"name": name, "document": document}

    # Use the Document AI client to process the sample form
    result = client.process_document(request=request)

    document = result.document
    document_text = document.text
    print("Document processing complete.")
    print("Text: {}".format(document_text))

立即运行您的代码，您应该会看到控制台中所提取和打印的文本。在接下来的步骤中，您将提取结构化数据，以便更轻松地存储在数据库中或在其他应用中使用。

调用函数：

process_document()

9. 提取表单键/值对

现在您可以从表单及其对应的置信度分数中提取键值对。文档响应对象包含来自输入文档的页面列表。每个 page 对象均包含表单字段列表及其文本的位置。

以下代码会遍历每个页面，然后提取并打印每个键、值和置信度分数。

在 processDocument() 函数的底部，粘贴以下代码：

    document_pages = document.pages

    for page in document_pages:
        print("Page Number:{}".format(page.page_number))
        for form_field in page.form_fields:
            fieldName=get_text(form_field.field_name,document)
            nameConfidence = round(form_field.field_name.confidence,4)
            fieldValue = get_text(form_field.field_value,document)
            valueConfidence = round(form_field.field_value.confidence,4)
            print(fieldName+fieldValue +"  (Confidence Scores: "+str(nameConfidence)+", "+str(valueConfidence)+")")

def get_text(doc_element: dict, document: dict):
    """
    Document AI identifies form fields by their offsets
    in document text. This function converts offsets
    to text snippets.
    """
    response = ""
    # If a text segment spans several lines, it will
    # be stored in different text segments.
    for segment in doc_element.text_anchor.text_segments:
        start_index = (
            int(segment.start_index)
            if segment in doc_element.text_anchor.text_segments
            else 0
        )
        end_index = int(segment.end_index)
        response += document.text[start_index:end_index]
    return response

现在，运行您的代码，调用以下函数：

process_document()

如果使用我们的示例文档，您应该会看到以下输出：

Document processing complete.
Page Number:1
Marital Status: Single  (Confidence Scores: 1.0000, 1.0000)
DOB: 09/04/1986 (Confidence Scores: 0.9999, 0.9999)
City: Towalo  (Confidence Scores: 0.9996, 0.9996)
Address: 24 Barney Lane  (Confidence Scores: 0.9994, 0.9994)
Referred By: None (Confidence Scores: 0.9968, 0.9968)
Phone #:  (906) 917-3486 (Confidence Scores: 0.9961, 0.9961)
State: NJ  (Confidence Scores: 0.9960, 0.9960)
Emergency Contact Phone: (906) 334-8926 (Confidence Scores: 0.9924, 0.9924)
Name: Sally Walker (Confidence Scores: 0.9922, 0.9922)

10. 恭喜

恭喜，您已成功使用 Document AI API 从手写表单中提取数据。我们推荐您试试其他表单图片

清理

为避免因本教程中使用的资源导致您的 Google Cloud 帐号产生费用，请执行以下操作：

在 Cloud Console 中，转到管理资源页面。
在项目列表中，选择您的项目，然后点击“删除”。
在对话框中输入项目 ID，然后点击“关停”以删除项目。

了解详情

许可

此作品已获得知识共享署名 2.0 通用许可证授权。

下一步

报告错误