使用 Python 管理 Document AI 处理器

剩余时间：11 分钟

关于此 Codelab

上次更新时间：5月 13, 2025

Laurent Picard 编写

1. 概览

Document AI 是什么？

Document AI 是一个平台，可让您从文档中提取数据洞见。其核心是提供不断增长的文档处理器（也称为解析器或分屏器，具体取决于其功能）列表。

您可以通过以下两种方式管理 Document AI 处理器：

通过 Web 控制台手动执行；
使用 Document AI API 以编程方式。

以下示例屏幕截图显示了 Web 控制台和 Python 代码中的处理器列表：

在本实验中，您将重点学习如何使用 Python 客户端库以编程方式管理 Document AI 处理器。

您将看到的内容

如何设置环境
如何提取处理器类型
如何创建处理器
如何列出项目处理器
如何使用处理器
如何启用/停用处理器
如何管理处理器版本
如何删除处理器

所需条件

Google Cloud 项目
一个浏览器，例如 Chrome 或 Firefox
熟悉如何使用 Python

调查问卷

您将如何使用本教程？

仅阅读教程内容阅读并完成练习

您如何评价使用 Python 的体验？

新手水平中等水平熟练水平

您如何评价自己在 Google Cloud 服务方面的经验水平？

新手中等熟练

2. 设置和要求

自定进度的环境设置

项目名称是此项目参与者的显示名称。它是 Google API 尚未使用的字符串。您可以随时对其进行更新。
项目 ID 在所有 Google Cloud 项目中是唯一的，并且是不可变的（一经设置便无法更改）。Cloud 控制台会自动生成一个唯一字符串；通常情况下，您无需关注该字符串。在大多数 Codelab 中，您都需要引用项目 ID（通常用 PROJECT_ID 标识）。如果您不喜欢生成的 ID，可以再随机生成一个 ID。或者，您也可以尝试自己的项目 ID，看看是否可用。完成此步骤后便无法更改该 ID，并且此 ID 在项目期间会一直保留。
此外，还有第三个值，即部分 API 使用的项目编号，供您参考。如需详细了解所有这三个值，请参阅文档。

接下来，您需要在 Cloud 控制台中启用结算功能，以便使用 Cloud 资源/API。运行此 Codelab 应该不会产生太多的费用（如果有的话）。若要关闭资源以避免产生超出本教程范围的结算费用，您可以删除自己创建的资源或删除项目。Google Cloud 新用户符合参与 300 美元免费试用计划的条件。

启动 Cloud Shell

虽然可以通过笔记本电脑对 Google Cloud 进行远程操作，但在此实验中，您将使用 Cloud Shell，这是一个在云端运行的命令行环境。

激活 Cloud Shell

在 Cloud Console 中，点击激活 Cloud Shell。

如果这是您首次启动 Cloud Shell，系统会显示一个中间屏幕，介绍 Cloud Shell 是什么。如果系统显示中间屏幕，请点击继续。

预配和连接到 Cloud Shell 只需花几分钟时间。

此虚拟机已加载所需的所有开发工具。它提供了一个持久的 5 GB 主目录，并且在 Google Cloud 中运行，大大增强了网络性能和身份验证。您在此 Codelab 中的大部分（甚至全部）工作都可以使用浏览器完成。

在连接到 Cloud Shell 后，您应该会看到自己已通过身份验证，并且项目已设置为您的项目 ID。

在 Cloud Shell 中运行以下命令以确认您已通过身份验证：

gcloud auth list

命令输出

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`

在 Cloud Shell 中运行以下命令，以确认 gcloud 命令了解您的项目：

gcloud config list project

命令输出

[core]
project = <PROJECT_ID>

如果不是上述结果，您可以使用以下命令进行设置：

gcloud config set project <PROJECT_ID>

命令输出

Updated property [core/project].

3. 环境设置

您必须先在 Cloud Shell 中运行以下命令以启用 Document AI API，然后才能开始使用 Document AI：

gcloud services enable documentai.googleapis.com

您应该会看到与以下类似的内容：

Operation "operations/..." finished successfully.

现在，您可以使用 Document AI 了！

前往您的主目录：

cd ~

创建一个 Python 虚拟环境来隔离依赖项：

virtualenv venv-docai

激活此虚拟环境：

source venv-docai/bin/activate

安装 IPython、Document AI 客户端库和 python-tabulate（用于以美观的格式输出请求结果）：

pip install ipython google-cloud-documentai tabulate

您应该会看到与以下类似的内容：

...
Installing collected packages: ..., tabulate, ipython, google-cloud-documentai
Successfully installed ... google-cloud-documentai-2.15.0 ...

现在，您可以使用 Document AI 客户端库了！

设置以下环境变量：

export PROJECT_ID=$(gcloud config get-value core/project)

# Choose "us" or "eu"
export API_LOCATION="us"

从现在开始，所有步骤都应在同一会话中完成。

确保环境变量已正确定义：

echo $PROJECT_ID

echo $API_LOCATION

在后续步骤中，您将使用刚刚安装的称为 IPython 的交互式 Python 解释器。在 Cloud Shell 中运行 ipython 以启动会话：

ipython

您应该会看到与以下类似的内容：

Python 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 9.1.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

将以下代码复制到您的 IPython 会话中：

import os
from typing import Iterator, MutableSequence, Optional, Sequence, Tuple

import google.cloud.documentai_v1 as docai
from tabulate import tabulate

PROJECT_ID = os.getenv("PROJECT_ID", "")
API_LOCATION = os.getenv("API_LOCATION", "")

assert PROJECT_ID, "PROJECT_ID is undefined"
assert API_LOCATION in ("us", "eu"), "API_LOCATION is incorrect"

# Test processors
document_ocr_display_name = "document-ocr"
form_parser_display_name = "form-parser"

test_processor_display_names_and_types = (
    (document_ocr_display_name, "OCR_PROCESSOR"),
    (form_parser_display_name, "FORM_PARSER_PROCESSOR"),
)

def get_client() -> docai.DocumentProcessorServiceClient:
    client_options = {"api_endpoint": f"{API_LOCATION}-documentai.googleapis.com"}
    return docai.DocumentProcessorServiceClient(client_options=client_options)

def get_parent(client: docai.DocumentProcessorServiceClient) -> str:
    return client.common_location_path(PROJECT_ID, API_LOCATION)

def get_client_and_parent() -> Tuple[docai.DocumentProcessorServiceClient, str]:
    client = get_client()
    parent = get_parent(client)
    return client, parent

您现在可以发出第一个请求并提取处理器类型了。

4. 提取处理器类型

在下一步中创建处理器之前，请提取可用的处理器类型。您可以使用 fetch_processor_types 检索此列表。

将以下函数添加到 IPython 会话中：

def fetch_processor_types() -> MutableSequence[docai.ProcessorType]:
    client, parent = get_client_and_parent()
    response = client.fetch_processor_types(parent=parent)

    return response.processor_types

def print_processor_types(processor_types: Sequence[docai.ProcessorType]):
    def sort_key(pt):
        return (not pt.allow_creation, pt.category, pt.type_)

    sorted_processor_types = sorted(processor_types, key=sort_key)
    data = processor_type_tabular_data(sorted_processor_types)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Processor types: {len(sorted_processor_types)}")

def processor_type_tabular_data(
    processor_types: Sequence[docai.ProcessorType],
) -> Iterator[Tuple[str, str, str, str]]:
    def locations(pt):
        return ", ".join(sorted(loc.location_id for loc in pt.available_locations))

    yield ("type", "category", "allow_creation", "locations")
    yield ("left", "left", "left", "left")
    if not processor_types:
        yield ("-", "-", "-", "-")
        return
    for pt in processor_types:
        yield (pt.type_, pt.category, f"{pt.allow_creation}", locations(pt))

列出处理器类型：

processor_types = fetch_processor_types()
print_processor_types(processor_types)

您应该会看到类似以下内容的输出：

+--------------------------------------+-------------+----------------+-----------+
| type                                 | category    | allow_creation | locations |
+--------------------------------------+-------------+----------------+-----------+
| CUSTOM_CLASSIFICATION_PROCESSOR      | CUSTOM      | True           | eu, us    |
...
| FORM_PARSER_PROCESSOR                | GENERAL     | True           | eu, us    |
| LAYOUT_PARSER_PROCESSOR              | GENERAL     | True           | eu, us    |
| OCR_PROCESSOR                        | GENERAL     | True           | eu, us    |
| BANK_STATEMENT_PROCESSOR             | SPECIALIZED | True           | eu, us    |
| EXPENSE_PROCESSOR                    | SPECIALIZED | True           | eu, us    |
...
+--------------------------------------+-------------+----------------+-----------+
→ Processor types: 19

现在，您已经拥有在下一步中创建处理器所需的所有信息。

5. 创建处理器

如需创建处理器，请使用显示名称和处理器类型调用 create_processor。

添加以下函数：

def create_processor(display_name: str, type: str) -> docai.Processor:
    client, parent = get_client_and_parent()
    processor = docai.Processor(display_name=display_name, type_=type)

    return client.create_processor(parent=parent, processor=processor)

创建测试处理器：

separator = "=" * 80
for display_name, type in test_processor_display_names_and_types:
    print(separator)
    print(f"Creating {display_name} ({type})...")
    try:
        create_processor(display_name, type)
    except Exception as err:
        print(err)
print(separator)
print("Done")

您应该会看到以下内容：

================================================================================
Creating document-ocr (OCR_PROCESSOR)...
================================================================================
Creating form-parser (FORM_PARSER_PROCESSOR)...
================================================================================
Done

您已创建新的处理器！

接下来，了解如何列出处理器。

6. 列出项目处理器

list_processors 会返回属于您项目的所有处理器的列表。

添加以下函数：

def list_processors() -> MutableSequence[docai.Processor]:
    client, parent = get_client_and_parent()
    response = client.list_processors(parent=parent)

    return list(response.processors)

def print_processors(processors: Optional[Sequence[docai.Processor]] = None):
    def sort_key(processor):
        return processor.display_name

    if processors is None:
        processors = list_processors()
    sorted_processors = sorted(processors, key=sort_key)
    data = processor_tabular_data(sorted_processors)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Processors: {len(sorted_processors)}")

def processor_tabular_data(
    processors: Sequence[docai.Processor],
) -> Iterator[Tuple[str, str, str]]:
    yield ("display_name", "type", "state")
    yield ("left", "left", "left")
    if not processors:
        yield ("-", "-", "-")
        return
    for processor in processors:
        yield (processor.display_name, processor.type_, processor.state.name)

调用函数：

processors = list_processors()
print_processors(processors)

您应该会看到以下内容：

+--------------+-----------------------+---------+
| display_name | type                  | state   |
+--------------+-----------------------+---------+
| document-ocr | OCR_PROCESSOR         | ENABLED |
| form-parser  | FORM_PARSER_PROCESSOR | ENABLED |
+--------------+-----------------------+---------+
→ Processors: 2

如需按显示名称检索处理器，请添加以下函数：

def get_processor(
    display_name: str,
    processors: Optional[Sequence[docai.Processor]] = None,
) -> Optional[docai.Processor]:
    if processors is None:
        processors = list_processors()
    for processor in processors:
        if processor.display_name == display_name:
            return processor
    return None

测试函数：

processor = get_processor(document_ocr_display_name, processors)

assert processor is not None
print(processor)

您应该会看到与以下类似的内容：

name: "projects/PROJECT_NUM/locations/LOCATION/processors/PROCESSOR_ID"
type_: "OCR_PROCESSOR"
display_name: "document-ocr"
state: ENABLED
...

现在，您已经知道如何列出项目处理器并按显示名称检索它们了。接下来，了解如何使用处理器。

7. 使用处理器

文档可以通过以下两种方式进行处理：

同步：调用 process_document 来分析单个文档，并直接使用结果。
异步：调用 batch_process_documents 可对多个或大型文档启动批处理。

您的测试文档（PDF）是已扫描的调查问卷，其中包含手写的答案。直接从 IPython 会话将其下载到您的工作目录：

!gsutil cp gs://cloud-samples-data/documentai/form.pdf .

检查工作目录中的内容：

!ls

您应具备以下条件：

...  form.pdf  ...  venv-docai  ...

您可以使用同步的 process_document 方法分析本地文件。添加以下函数：

def process_file(
    processor: docai.Processor,
    file_path: str,
    mime_type: str,
) -> docai.Document:
    client = get_client()
    with open(file_path, "rb") as document_file:
        document_content = document_file.read()
    document = docai.RawDocument(content=document_content, mime_type=mime_type)
    request = docai.ProcessRequest(raw_document=document, name=processor.name)

    response = client.process_document(request)

    return response.document

由于您的文档是调查问卷，因此请选择表单解析器。除了提取所有处理器都能提取的文字（印刷文字和手写文字）之外，此通用处理器还可以检测表单字段。

分析文档：

processor = get_processor(form_parser_display_name)
assert processor is not None

file_path = "./form.pdf"
mime_type = "application/pdf"

document = process_file(processor, file_path, mime_type)

所有处理器都会对文档运行光学字符识别 (OCR) 第一遍。查看 OCR 通过检测到的文本：

document.text.split("\n")

您应看到类似下图的内容：

['FakeDoc M.D.',
 'HEALTH INTAKE FORM',
 'Please fill out the questionnaire carefully. The information you provide will be used to complete',
 'your health profile and will be kept confidential.',
 'Date:',
 '9/14/19',
 'Name:',
 'Sally Walker',
 'DOB: 09/04/1986',
 'Address: 24 Barney Lane',
 'City: Towaco',
 'State: NJ Zip: 07082',
 'Email: Sally, walker@cmail.com',
 '_Phone #: (906) 917-3486',
 'Gender: F',
 'Marital Status:',
  ...
]

添加以下函数以输出检测到的表单字段：

def print_form_fields(document: docai.Document):
    sorted_form_fields = form_fields_sorted_by_ocr_order(document)
    data = form_field_tabular_data(sorted_form_fields, document)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Form fields: {len(sorted_form_fields)}")

def form_field_tabular_data(
    form_fields: Sequence[docai.Document.Page.FormField],
    document: docai.Document,
) -> Iterator[Tuple[str, str, str]]:
    yield ("name", "value", "confidence")
    yield ("right", "left", "right")
    if not form_fields:
        yield ("-", "-", "-")
        return
    for form_field in form_fields:
        name_layout = form_field.field_name
        value_layout = form_field.field_value
        name = text_from_layout(name_layout, document)
        value = text_from_layout(value_layout, document)
        confidence = value_layout.confidence
        yield (name, value, f"{confidence:.1%}")

还添加了以下实用函数：

def form_fields_sorted_by_ocr_order(
    document: docai.Document,
) -> MutableSequence[docai.Document.Page.FormField]:
    def sort_key(form_field):
        # Sort according to the field name detected position
        text_anchor = form_field.field_name.text_anchor
        return text_anchor.text_segments[0].start_index if text_anchor else 0

    fields = (field for page in document.pages for field in page.form_fields)

    return sorted(fields, key=sort_key)


def text_from_layout(
    layout: docai.Document.Page.Layout,
    document: docai.Document,
) -> str:
    full_text = document.text
    segs = layout.text_anchor.text_segments
    text = "".join(full_text[seg.start_index : seg.end_index] for seg in segs)
    if text.endswith("\n"):
        text = text[:-1]

    return text

输出检测到的表单字段：

print_form_fields(document)

您应该会看到如下所示的打印输出：

+-----------------+-------------------------+------------+
|            name | value                   | confidence |
+-----------------+-------------------------+------------+
|           Date: | 9/14/19                 |      83.0% |
|           Name: | Sally Walker            |      87.3% |
|            DOB: | 09/04/1986              |      88.5% |
|        Address: | 24 Barney Lane          |      82.4% |
|           City: | Towaco                  |      90.0% |
|          State: | NJ                      |      89.4% |
|            Zip: | 07082                   |      91.4% |
|          Email: | Sally, walker@cmail.com |      79.7% |
|       _Phone #: | walker@cmail.com        |      93.2% |
|                 | (906                    |            |
|         Gender: | F                       |      88.2% |
| Marital Status: | Single                  |      85.2% |
|     Occupation: | Software Engineer       |      81.5% |
|    Referred By: | None                    |      76.9% |
...
+-----------------+-------------------------+------------+
→ Form fields: 17

查看系统检测到的字段名称和值（PDF）。以下是调查问卷的上半部分：

您分析了一个同时包含印刷体文本和手写文本的表单。您还检测到了其字段，并且置信度很高。这样一来，您的像素就已转换为结构化数据！

8. 启用和停用处理器

借助 disable_processor 和 enable_processor，您可以控制是否可以使用某个处理器。

添加以下函数：

def update_processor_state(processor: docai.Processor, enable_processor: bool):
    client = get_client()
    if enable_processor:
        request = docai.EnableProcessorRequest(name=processor.name)
        operation = client.enable_processor(request)
    else:
        request = docai.DisableProcessorRequest(name=processor.name)
        operation = client.disable_processor(request)
    operation.result()  # Wait for operation to complete

def enable_processor(processor: docai.Processor):
    update_processor_state(processor, True)

def disable_processor(processor: docai.Processor):
    update_processor_state(processor, False)

停用表单解析器处理器，然后检查处理器的状态：

processor = get_processor(form_parser_display_name)
assert processor is not None

disable_processor(processor)
print_processors()

您应该会看到以下内容：

+--------------+-----------------------+----------+
| display_name | type                  | state    |
+--------------+-----------------------+----------+
| document-ocr | OCR_PROCESSOR         | ENABLED  |
| form-parser  | FORM_PARSER_PROCESSOR | DISABLED |
+--------------+-----------------------+----------+
→ Processors: 2

重新启用表单解析器处理器：

enable_processor(processor)
print_processors()

您应该会看到以下内容：

+--------------+-----------------------+---------+
| display_name | type                  | state   |
+--------------+-----------------------+---------+
| document-ocr | OCR_PROCESSOR         | ENABLED |
| form-parser  | FORM_PARSER_PROCESSOR | ENABLED |
+--------------+-----------------------+---------+
→ Processors: 2

接下来，了解如何管理处理器版本。

9. 管理处理器版本

处理器可以有多个版本。查看如何使用 list_processor_versions 和 set_default_processor_version 方法。

添加以下函数：

def list_processor_versions(
    processor: docai.Processor,
) -> MutableSequence[docai.ProcessorVersion]:
    client = get_client()
    response = client.list_processor_versions(parent=processor.name)

    return list(response)


def get_sorted_processor_versions(
    processor: docai.Processor,
) -> MutableSequence[docai.ProcessorVersion]:
    def sort_key(processor_version: docai.ProcessorVersion):
        return processor_version.name

    versions = list_processor_versions(processor)

    return sorted(versions, key=sort_key)


def print_processor_versions(processor: docai.Processor):
    versions = get_sorted_processor_versions(processor)
    default_version_name = processor.default_processor_version
    data = processor_versions_tabular_data(versions, default_version_name)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Processor versions: {len(versions)}")


def processor_versions_tabular_data(
    versions: Sequence[docai.ProcessorVersion],
    default_version_name: str,
) -> Iterator[Tuple[str, str, str]]:
    yield ("version", "display name", "default")
    yield ("left", "left", "left")
    if not versions:
        yield ("-", "-", "-")
        return
    for version in versions:
        mapping = docai.DocumentProcessorServiceClient.parse_processor_version_path(
            version.name
        )
        processor_version = mapping["processor_version"]
        is_default = "Y" if version.name == default_version_name else ""
        yield (processor_version, version.display_name, is_default)

列出 OCR 处理器的可用版本：

processor = get_processor(document_ocr_display_name)
assert processor is not None
print_processor_versions(processor)

您将获得以下处理器版本：

+--------------------------------+--------------------------+---------+
| version                        | display name             | default |
+--------------------------------+--------------------------+---------+
| pretrained-ocr-v1.0-2020-09-23 | Google Stable            |         |
| pretrained-ocr-v1.1-2022-09-12 | Google Release Candidate |         |
| pretrained-ocr-v1.2-2022-11-10 | Google Release Candidate |         |
| pretrained-ocr-v2.0-2023-06-02 | Google Stable            | Y       |
| pretrained-ocr-v2.1-2024-08-07 | Google Release Candidate |         |
+--------------------------------+--------------------------+---------+
→ Processor versions: 5

现在，添加一个函数来更改默认的处理器版本：

def set_default_processor_version(processor: docai.Processor, version_name: str):
    client = get_client()
    request = docai.SetDefaultProcessorVersionRequest(
        processor=processor.name,
        default_processor_version=version_name,
    )

    operation = client.set_default_processor_version(request)
    operation.result()  # Wait for operation to complete

切换到最新的处理器版本：

processor = get_processor(document_ocr_display_name)
assert processor is not None
versions = get_sorted_processor_versions(processor)

new_version = versions[-1]  # Latest version
set_default_processor_version(processor, new_version.name)

# Update the processor info
processor = get_processor(document_ocr_display_name)
assert processor is not None
print_processor_versions(processor)

您会获得新版配置：

+--------------------------------+--------------------------+---------+
| version                        | display name             | default |
+--------------------------------+--------------------------+---------+
| pretrained-ocr-v1.0-2020-09-23 | Google Stable            |         |
| pretrained-ocr-v1.1-2022-09-12 | Google Release Candidate |         |
| pretrained-ocr-v1.2-2022-11-10 | Google Release Candidate |         |
| pretrained-ocr-v2.0-2023-06-02 | Google Stable            |         |
| pretrained-ocr-v2.1-2024-08-07 | Google Release Candidate | Y       |
+--------------------------------+--------------------------+---------+
→ Processor versions: 5

接下来，介绍终极的处理器管理方法（删除）。

10. 删除处理器

最后，了解如何使用 delete_processor 方法。

添加以下函数：

def delete_processor(processor: docai.Processor):
    client = get_client()
    operation = client.delete_processor(name=processor.name)
    operation.result()  # Wait for operation to complete

删除测试处理器：

processors_to_delete = [dn for dn, _ in test_processor_display_names_and_types]
print("Deleting processors...")

for processor in list_processors():
    if processor.display_name not in processors_to_delete:
        continue
    print(f"  Deleting {processor.display_name}...")
    delete_processor(processor)

print("Done\n")
print_processors()

您应该会看到以下内容：

Deleting processors...
  Deleting form-parser...
  Deleting document-ocr...
Done

+--------------+------+-------+
| display_name | type | state |
+--------------+------+-------+
| -            | -    | -     |
+--------------+------+-------+
→ Processors: 0

您已掌握所有处理器管理方法！即将大功告成...

11. 恭喜！

您已学习如何使用 Python 管理 Document AI 处理器！

清理

如需清理开发环境，请在 Cloud Shell 中执行以下操作：

如果您仍在 IPython 会话中，请返回 shell：exit
停止使用 Python 虚拟环境：deactivate
删除您的虚拟环境文件夹：cd ~ ; rm -rf ./venv-docai

如需删除 Google Cloud 项目，请在 Cloud Shell 中执行以下操作：

检索您的当前项目 ID：PROJECT_ID=$(gcloud config get-value core/project)
请确保这是您要删除的项目：echo $PROJECT_ID
删除项目：gcloud projects delete $PROJECT_ID

了解详情

在浏览器中试用 Document AI：https://cloud.google.com/document-ai/docs/drag-and-drop
Document AI 处理器详情：https://cloud.google.com/document-ai/docs/processors-list
Google Cloud 上的 Python：https://cloud.google.com/python
Python 版 Cloud 客户端库：https://github.com/googleapis/google-cloud-python

许可

此作品已获得 Creative Commons Attribution 2.0 通用许可授权。

报告错误