使用 Python 管理 Document AI 处理器

1. 概览

c6d2ea69b1ba0eff.png

Document AI 是什么?

Document AI 是一个平台,可让您从文档中提取实用信息。从本质上讲,它提供了一个不断扩充的文档处理器列表(也称为解析器或拆分器,具体取决于其功能)。

您可以通过以下两种方式管理 Document AI 处理器:

  • 手动安装;
  • 使用 Document AI API 以编程方式创建。

下面是一个屏幕截图示例,显示了来自 Web 控制台和 Python 代码的处理器列表:

312f0e9b3a8b8291

在本实验中,您将学习如何使用 Python 客户端库以编程方式管理 Document AI 处理器。

您将看到的内容

  • 如何设置环境
  • 如何提取处理器类型
  • 如何创建处理器
  • 如何列出项目处理方
  • 如何使用处理器
  • 如何启用/停用处理器
  • 如何管理处理器版本
  • 如何删除处理器

所需条件

  • Google Cloud 项目
  • 一个浏览器,例如 ChromeFirefox
  • 熟悉 Python

调查问卷

您将如何使用本教程?

仅阅读教程内容 阅读并完成练习

您如何评价使用 Python 的体验?

新手水平 中等水平 熟练水平

您如何评价自己在 Google Cloud 服务方面的经验水平?

<ph type="x-smartling-placeholder"></ph> 新手 中级 熟练

2. 设置和要求

自定进度的环境设置

  1. 登录 Google Cloud 控制台,然后创建一个新项目或重复使用现有项目。如果您还没有 Gmail 或 Google Workspace 账号,则必须创建一个

295004821bab6a87

37d264871000675d

96d86d3d5655cdbe.png

  • 项目名称是此项目参与者的显示名称。它是 Google API 尚未使用的字符串。您可以随时对其进行更新。
  • 项目 ID 在所有 Google Cloud 项目中是唯一的,并且是不可变的(一经设置便无法更改)。Cloud 控制台会自动生成一个唯一字符串;通常情况下,您无需关注该字符串。在大多数 Codelab 中,您都需要引用项目 ID(通常用 PROJECT_ID 标识)。如果您不喜欢生成的 ID,可以再随机生成一个 ID。或者,您也可以尝试自己的项目 ID,看看是否可用。完成此步骤后便无法更改该 ID,并且此 ID 在项目期间会一直保留。
  • 此外,还有第三个值,即部分 API 使用的项目编号,供您参考。如需详细了解所有这三个值,请参阅文档
  1. 接下来,您需要在 Cloud 控制台中启用结算功能,以便使用 Cloud 资源/API。运行此 Codelab 应该不会产生太多的费用(如果有的话)。若要关闭资源以避免产生超出本教程范围的结算费用,您可以删除自己创建的资源或删除项目。Google Cloud 新用户符合参与 300 美元免费试用计划的条件。

启动 Cloud Shell

虽然 Google Cloud 可以通过笔记本电脑进行远程操作,但在本实验中,您使用的是 Cloud Shell,这是一个在云端运行的命令行环境。

激活 Cloud Shell

  1. 在 Cloud Console 中,点击激活 Cloud Shelld1264ca30785e435.png

cb81e7c8e34bc8d.png

如果这是您第一次启动 Cloud Shell,系统会显示一个中间屏幕,说明它是什么。如果您看到中间屏幕,请点击继续

d95252b003979716.png

预配和连接到 Cloud Shell 只需花几分钟时间。

7833d5e1c5d18f54

这个虚拟机装有所需的所有开发工具。它提供了一个持久的 5 GB 主目录,并在 Google Cloud 中运行,大大增强了网络性能和身份验证功能。您在此 Codelab 中的大部分(即使不是全部)工作都可以通过浏览器完成。

在连接到 Cloud Shell 后,您应该会看到自己已通过身份验证,并且相关项目已设为您的项目 ID。

  1. 在 Cloud Shell 中运行以下命令以确认您已通过身份验证:
gcloud auth list

命令输出

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`
  1. 在 Cloud Shell 中运行以下命令,以确认 gcloud 命令了解您的项目:
gcloud config list project

命令输出

[core]
project = <PROJECT_ID>

如果不是上述结果,您可以使用以下命令进行设置:

gcloud config set project <PROJECT_ID>

命令输出

Updated property [core/project].

3. 环境设置

在开始使用 Document AI 之前,请在 Cloud Shell 中运行以下命令以启用 Document AI API:

gcloud services enable documentai.googleapis.com

您应该会看到与以下类似的内容:

Operation "operations/..." finished successfully.

现在,您可以使用 Document AI 了!

导航到您的主目录:

cd ~

创建一个 Python 虚拟环境来隔离依赖项:

virtualenv venv-docai

激活此虚拟环境:

source venv-docai/bin/activate

安装 IPython、Document AI 客户端库和 python-tabulate(您将使用它们来精确输出请求结果):

pip install ipython google-cloud-documentai tabulate

您应该会看到与以下类似的内容:

...
Installing collected packages: ..., tabulate, ipython, google-cloud-documentai
Successfully installed ... google-cloud-documentai-2.15.0 ...

现在,您可以使用 Document AI 客户端库了!

设置以下环境变量:

export PROJECT_ID=$(gcloud config get-value core/project)
# Choose "us" or "eu"
export API_LOCATION="us"

从现在开始,所有步骤都应在同一会话中完成。

确保正确定义环境变量:

echo $PROJECT_ID
echo $API_LOCATION

在接下来的步骤中,您将使用刚刚安装的名为 IPython 的交互式 Python 解释器。在 Cloud Shell 中运行 ipython 来启动会话:

ipython

您应该会看到与以下类似的内容:

Python 3.9.2 (default, Feb 28 2021, 17:03:44)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.14.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

将以下代码复制到您的 IPython 会话中:

import os
from typing import Iterator, MutableSequence, Optional, Sequence, Tuple

import google.cloud.documentai_v1 as docai
from tabulate import tabulate

PROJECT_ID = os.getenv("PROJECT_ID", "")
API_LOCATION = os.getenv("API_LOCATION", "")

assert PROJECT_ID, "PROJECT_ID is undefined"
assert API_LOCATION in ("us", "eu"), "API_LOCATION is incorrect"

# Test processors
document_ocr_display_name = "document-ocr"
form_parser_display_name = "form-parser"

test_processor_display_names_and_types = (
    (document_ocr_display_name, "OCR_PROCESSOR"),
    (form_parser_display_name, "FORM_PARSER_PROCESSOR"),
)

def get_client() -> docai.DocumentProcessorServiceClient:
    client_options = {"api_endpoint": f"{API_LOCATION}-documentai.googleapis.com"}
    return docai.DocumentProcessorServiceClient(client_options=client_options)

def get_parent(client: docai.DocumentProcessorServiceClient) -> str:
    return client.common_location_path(PROJECT_ID, API_LOCATION)

def get_client_and_parent() -> Tuple[docai.DocumentProcessorServiceClient, str]:
    client = get_client()
    parent = get_parent(client)
    return client, parent
    

您现在可以发出第一个请求并获取处理器类型了。

4. 提取处理器类型

在下一步创建处理器之前,请提取可用的处理器类型。您可以使用 fetch_processor_types 检索此列表。

将以下函数添加到 IPython 会话中:

def fetch_processor_types() -> MutableSequence[docai.ProcessorType]:
    client, parent = get_client_and_parent()
    response = client.fetch_processor_types(parent=parent)

    return response.processor_types

def print_processor_types(processor_types: Sequence[docai.ProcessorType]):
    def sort_key(pt):
        return (not pt.allow_creation, pt.category, pt.type_)

    sorted_processor_types = sorted(processor_types, key=sort_key)
    data = processor_type_tabular_data(sorted_processor_types)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Processor types: {len(sorted_processor_types)}")

def processor_type_tabular_data(
    processor_types: Sequence[docai.ProcessorType],
) -> Iterator[Tuple[str, str, str, str]]:
    def locations(pt):
        return ", ".join(sorted(loc.location_id for loc in pt.available_locations))

    yield ("type", "category", "allow_creation", "locations")
    yield ("left", "left", "left", "left")
    if not processor_types:
        yield ("-", "-", "-", "-")
        return
    for pt in processor_types:
        yield (pt.type_, pt.category, f"{pt.allow_creation}", locations(pt))
        

列出处理器类型:

processor_types = fetch_processor_types()
print_processor_types(processor_types)

您应该会看到如下内容:

+---------------------------------+-------------+----------------+------------+
| type                            | category    | allow_creation | locations  |
+---------------------------------+-------------+----------------+------------+
| CUSTOM_CLASSIFICATION_PROCESSOR | CUSTOM      | True           | eu, us...  |
| CUSTOM_EXTRACTION_PROCESSOR     | CUSTOM      | True           | eu, us...  |
| FORM_PARSER_PROCESSOR           | GENERAL     | True           | eu, us...  |
| OCR_PROCESSOR                   | GENERAL     | True           | eu, us...  |
| EXPENSE_PROCESSOR               | SPECIALIZED | True           | eu, us...  |
...
+---------------------------------+-------------+----------------+------------+
→ Processor types: 40

现在,您已经掌握了在下一步中创建处理方所需的全部信息。

5. 创建处理器

如需创建处理器,请使用显示名和处理器类型调用 create_processor

添加以下函数:

def create_processor(display_name: str, type: str) -> docai.Processor:
    client, parent = get_client_and_parent()
    processor = docai.Processor(display_name=display_name, type_=type)

    return client.create_processor(parent=parent, processor=processor)
    

创建测试处理器:

separator = "=" * 80
for display_name, type in test_processor_display_names_and_types:
    print(separator)
    print(f"Creating {display_name} ({type})...")
    try:
        create_processor(display_name, type)
    except Exception as err:
        print(err)
print(separator)
print("Done")

您应该会看到以下内容:

================================================================================
Creating document-ocr (OCR_PROCESSOR)...
================================================================================
Creating form-parser (FORM_PARSER_PROCESSOR)...
================================================================================
Done

您已创建新的处理器!

接下来,了解如何列出处理器。

6. 列出项目处理器

list_processors 会返回属于您的项目的所有处理器的列表。

添加以下函数:

def list_processors() -> MutableSequence[docai.Processor]:
    client, parent = get_client_and_parent()
    response = client.list_processors(parent=parent)

    return list(response.processors)

def print_processors(processors: Optional[Sequence[docai.Processor]] = None):
    def sort_key(processor):
        return processor.display_name

    if processors is None:
        processors = list_processors()
    sorted_processors = sorted(processors, key=sort_key)
    data = processor_tabular_data(sorted_processors)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Processors: {len(sorted_processors)}")

def processor_tabular_data(
    processors: Sequence[docai.Processor],
) -> Iterator[Tuple[str, str, str]]:
    yield ("display_name", "type", "state")
    yield ("left", "left", "left")
    if not processors:
        yield ("-", "-", "-")
        return
    for processor in processors:
        yield (processor.display_name, processor.type_, processor.state.name)
        

调用函数:

processors = list_processors()
print_processors(processors)

您应该会看到以下内容:

+--------------+-----------------------+---------+
| display_name | type                  | state   |
+--------------+-----------------------+---------+
| document-ocr | OCR_PROCESSOR         | ENABLED |
| form-parser  | FORM_PARSER_PROCESSOR | ENABLED |
+--------------+-----------------------+---------+
→ Processors: 2

如需按处理器的显示名称检索处理器,请添加以下函数:

def get_processor(
    display_name: str,
    processors: Optional[Sequence[docai.Processor]] = None,
) -> Optional[docai.Processor]:
    if processors is None:
        processors = list_processors()
    for processor in processors:
        if processor.display_name == display_name:
            return processor
    return None
    

测试函数:

processor = get_processor(document_ocr_display_name, processors)

assert processor is not None
print(processor)

您应该会看到与以下类似的内容:

name: "projects/PROJECT_NUM/locations/LOCATION/processors/PROCESSOR_ID"
type_: "OCR_PROCESSOR"
display_name: "document-ocr"
state: ENABLED
...

现在,您已经知道如何列出项目处理器并按其显示名检索它们。接下来,了解如何使用处理器。

7. 使用处理器

可以通过以下两种方式处理文件:

  • 同步:调用 process_document 以分析单个文档并直接使用结果。
  • 异步:调用 batch_process_documents 以对多个或大型文档启动批处理。

您的考试文件 ( PDF) 是一份已手写回答的调查问卷扫描版。直接从 IPython 会话将其下载到您的工作目录中:

!gsutil cp gs://cloud-samples-data/documentai/form.pdf .

检查工作目录的内容:

!ls

您应具备以下条件:

...  form.pdf  ...  venv-docai  ...

您可以使用同步 process_document 方法来分析本地文件。添加以下函数:

def process_file(
    processor: docai.Processor,
    file_path: str,
    mime_type: str,
) -> docai.Document:
    client = get_client()
    with open(file_path, "rb") as document_file:
        document_content = document_file.read()
    document = docai.RawDocument(content=document_content, mime_type=mime_type)
    request = docai.ProcessRequest(raw_document=document, name=processor.name)

    response = client.process_document(request)

    return response.document
    

由于您的文档是调查问卷,因此请选择表单解析器。除了提取所有处理器都执行的文本(打印和手写)之外,这种通用处理器还能够检测表单字段。

分析文档:

processor = get_processor(form_parser_display_name)
assert processor is not None

file_path = "./form.pdf"
mime_type = "application/pdf"

document = process_file(processor, file_path, mime_type)

所有处理器都对文档运行光学字符识别 (OCR) 第一遍。查看 OCR 通道检测到的文本:

document.text.split("\n")

您应看到类似下图的内容:

['FakeDoc M.D.',
 'HEALTH INTAKE FORM',
 'Please fill out the questionnaire carefully. The information you provide will be used to complete',
 'your health profile and will be kept confidential.',
 'Date:',
 'Sally',
 'Walker',
 'Name:',
 '9/14/19',
 'DOB: 09/04/1986',
 'Address: 24 Barney Lane City: Towaco State: NJ Zip: 07082',
 'Email: Sally, waller@cmail.com Phone #: (906) 917-3486',
 'Gender: F',
 'Single Occupation: Software Engineer',
 'Referred By: None',
 'Emergency Contact: Eva Walker Emergency Contact Phone: (906) 334-8976',
 'Marital Status:',
  ...
]

添加以下函数以输出检测到的表单字段:

def print_form_fields(document: docai.Document):
    sorted_form_fields = form_fields_sorted_by_ocr_order(document)
    data = form_field_tabular_data(sorted_form_fields, document)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Form fields: {len(sorted_form_fields)}")

def form_field_tabular_data(
    form_fields: Sequence[docai.Document.Page.FormField],
    document: docai.Document,
) -> Iterator[Tuple[str, str, str]]:
    yield ("name", "value", "confidence")
    yield ("right", "left", "right")
    if not form_fields:
        yield ("-", "-", "-")
        return
    for form_field in form_fields:
        name_layout = form_field.field_name
        value_layout = form_field.field_value
        name = text_from_layout(name_layout, document)
        value = text_from_layout(value_layout, document)
        confidence = value_layout.confidence
        yield (name, value, f"{confidence:.1%}")
        

还应添加以下实用函数:

def form_fields_sorted_by_ocr_order(
    document: docai.Document,
) -> MutableSequence[docai.Document.Page.FormField]:
    def sort_key(form_field):
        # Sort according to the field name detected position
        text_anchor = form_field.field_name.text_anchor
        return text_anchor.text_segments[0].start_index if text_anchor else 0

    fields = (field for page in document.pages for field in page.form_fields)

    return sorted(fields, key=sort_key)


def text_from_layout(
    layout: docai.Document.Page.Layout,
    document: docai.Document,
) -> str:
    full_text = document.text
    segs = layout.text_anchor.text_segments
    text = "".join(full_text[seg.start_index : seg.end_index] for seg in segs)
    if text.endswith("\n"):
        text = text[:-1]

    return text
    

输出检测到的表单字段:

print_form_fields(document)

您应该会看到如下所示的输出:

+--------------+-------------------------+------------+
|         name | value                   | confidence |
+--------------+-------------------------+------------+
|        Date: | 9/14/19                 |     100.0% |
|        Name: | Sally                   |      99.7% |
|              | Walker                  |            |
|         DOB: | 09/04/1986              |     100.0% |
|     Address: | 24 Barney Lane          |      99.9% |
|        City: | Towaco                  |      99.8% |
|       State: | NJ                      |      99.7% |
|         Zip: | 07082                   |      99.5% |
|       Email: | Sally, waller@cmail.com |      99.6% |
|     Phone #: | (906) 917-3486          |     100.0% |
|      Gender: | F                       |     100.0% |
|  Occupation: | Software Engineer       |     100.0% |
| Referred By: | None                    |     100.0% |
...
+--------------+-------------------------+------------+
→ Form fields: 17

查看检测到的字段名称和值 ( PDF)。以下是调查问卷的上半部分:

ea7370f0bb0cc494.png

您分析了一个表单,其中既包含印刷文本又包含手写文本。您还检测到了具有高置信度的其字段。结果就是,您的像素已转换为结构化数据!

8. 启用和停用处理器

借助 disable_processorenable_processor,您可以控制是否可以使用处理器。

添加以下函数:

def update_processor_state(processor: docai.Processor, enable_processor: bool):
    client = get_client()
    if enable_processor:
        request = docai.EnableProcessorRequest(name=processor.name)
        operation = client.enable_processor(request)
    else:
        request = docai.DisableProcessorRequest(name=processor.name)
        operation = client.disable_processor(request)
    operation.result()  # Wait for operation to complete

def enable_processor(processor: docai.Processor):
    update_processor_state(processor, True)

def disable_processor(processor: docai.Processor):
    update_processor_state(processor, False)
    

停用表单解析器处理器,并检查处理器的状态:

processor = get_processor(form_parser_display_name)
assert processor is not None

disable_processor(processor)
print_processors()

您应该会看到以下内容:

+--------------+-----------------------+----------+
| display_name | type                  | state    |
+--------------+-----------------------+----------+
| document-ocr | OCR_PROCESSOR         | ENABLED  |
| form-parser  | FORM_PARSER_PROCESSOR | DISABLED |
+--------------+-----------------------+----------+
→ Processors: 2

重新启用表单解析器处理器:

enable_processor(processor)
print_processors()

您应该会看到以下内容:

+--------------+-----------------------+---------+
| display_name | type                  | state   |
+--------------+-----------------------+---------+
| document-ocr | OCR_PROCESSOR         | ENABLED |
| form-parser  | FORM_PARSER_PROCESSOR | ENABLED |
+--------------+-----------------------+---------+
→ Processors: 2

接下来,了解如何管理处理器版本。

9. 管理处理器版本

处理器有多种版本。查看如何使用 list_processor_versionsset_default_processor_version 方法。

添加以下函数:

def list_processor_versions(
    processor: docai.Processor,
) -> MutableSequence[docai.ProcessorVersion]:
    client = get_client()
    response = client.list_processor_versions(parent=processor.name)

    return list(response)


def get_sorted_processor_versions(
    processor: docai.Processor,
) -> MutableSequence[docai.ProcessorVersion]:
    def sort_key(processor_version: docai.ProcessorVersion):
        return processor_version.name

    versions = list_processor_versions(processor)

    return sorted(versions, key=sort_key)


def print_processor_versions(processor: docai.Processor):
    versions = get_sorted_processor_versions(processor)
    default_version_name = processor.default_processor_version
    data = processor_versions_tabular_data(versions, default_version_name)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Processor versions: {len(versions)}")


def processor_versions_tabular_data(
    versions: Sequence[docai.ProcessorVersion],
    default_version_name: str,
) -> Iterator[Tuple[str, str, str]]:
    yield ("version", "display name", "default")
    yield ("left", "left", "left")
    if not versions:
        yield ("-", "-", "-")
        return
    for version in versions:
        mapping = docai.DocumentProcessorServiceClient.parse_processor_version_path(
            version.name
        )
        processor_version = mapping["processor_version"]
        is_default = "Y" if version.name == default_version_name else ""
        yield (processor_version, version.display_name, is_default)
        

列出 OCR 处理器的可用版本:

processor = get_processor(document_ocr_display_name)
assert processor is not None
print_processor_versions(processor)

您将获得处理器版本:

+--------------------------------+--------------------------+---------+
| version                        | display name             | default |
+--------------------------------+--------------------------+---------+
| pretrained-ocr-v1.0-2020-09-23 | Google Stable            | Y       |
| pretrained-ocr-v1.1-2022-09-12 | Google Release Candidate |         |
| pretrained-ocr-v1.2-2022-11-10 | Google Release Candidate |         |
+--------------------------------+--------------------------+---------+
→ Processor versions: 3

现在,添加一个函数来更改默认处理器版本:

def set_default_processor_version(processor: docai.Processor, version_name: str):
    client = get_client()
    request = docai.SetDefaultProcessorVersionRequest(
        processor=processor.name,
        default_processor_version=version_name,
    )

    operation = client.set_default_processor_version(request)
    operation.result()  # Wait for operation to complete
    

切换到最新的处理器版本:

processor = get_processor(document_ocr_display_name)
assert processor is not None
versions = get_sorted_processor_versions(processor)

new_version = versions[-1]  # Latest version
set_default_processor_version(processor, new_version.name)

# Update the processor info
processor = get_processor(document_ocr_display_name)
assert processor is not None
print_processor_versions(processor)

您将获得新版本配置:

+--------------------------------+--------------------------+---------+
| version                        | display name             | default |
+--------------------------------+--------------------------+---------+
| pretrained-ocr-v1.0-2020-09-23 | Google Stable            |         |
| pretrained-ocr-v1.1-2022-09-12 | Google Release Candidate |         |
| pretrained-ocr-v1.2-2022-11-10 | Google Release Candidate | Y       |
+--------------------------------+--------------------------+---------+
→ Processor versions: 3

接下来是处理方管理的最终方法(删除)。

10. 删除处理器

最后,查看如何使用 delete_processor 方法。

添加以下函数:

def delete_processor(processor: docai.Processor):
    client = get_client()
    operation = client.delete_processor(name=processor.name)
    operation.result()  # Wait for operation to complete
    

删除测试处理器:

processors_to_delete = [dn for dn, _ in test_processor_display_names_and_types]
print("Deleting processors...")

for processor in list_processors():
    if processor.display_name not in processors_to_delete:
        continue
    print(f"  Deleting {processor.display_name}...")
    delete_processor(processor)

print("Done\n")
print_processors()

您应该会看到以下内容:

Deleting processors...
  Deleting form-parser...
  Deleting document-ocr...
Done

+--------------+------+-------+
| display_name | type | state |
+--------------+------+-------+
| -            | -    | -     |
+--------------+------+-------+
→ Processors: 0

您已经了解了所有处理方管理方法!即将大功告成...

11. 恭喜!

c6d2ea69b1ba0eff.png

您已了解如何使用 Python 管理 Document AI 处理器!

清理

如需在 Cloud Shell 中清理开发环境,请执行以下操作:

  • 如果您仍处于 IPython 会话,请返回到 shell:exit
  • 停止使用 Python 虚拟环境:deactivate
  • 删除虚拟环境文件夹:cd ~ ; rm -rf ./venv-docai

如需从 Cloud Shell 中删除 Google Cloud 项目,请执行以下操作:

  • 检索当前项目 ID:PROJECT_ID=$(gcloud config get-value core/project)
  • 确保这是您要删除的项目:echo $PROJECT_ID
  • 删除项目:gcloud projects delete $PROJECT_ID

了解详情

许可

此作品已获得 Creative Commons Attribution 2.0 通用许可授权。