使用 Document AI (Python) 进行光学字符识别 (OCR)

剩余时间：12 分钟

关于此 Codelab

上次更新时间：6月 20, 2023

Holt Skinner 编写

此页面由 Cloud Translation API 翻译。

1. 概览

Document AI 是什么？

Document AI 是一种文档理解解决方案，接受非结构化数据（例如文档、电子邮件、账单、表单等），使数据更易于理解、分析和使用。该 API 通过内容分类、实体提取、高级搜索等提供结构。

在本实验中，您将学习如何使用 Python 版 Document AI API 执行光学字符识别。

我们将利用 A.A. Milne 的经典小说《Winnie the Pooh》的 PDF 文件，该文件最近加入了美国的公共领域。此文件由 Google 图书扫描并数字化。

学习内容

如何启用 Document AI API
如何对 API 请求进行身份验证
如何安装 Python 版客户端库
如何使用在线处理 API 和批处理 API
如何解析 PDF 文件中的文本

所需条件

Google Cloud 项目
一个浏览器，例如 Chrome 或 Firefox
熟悉 Python (3.9+)

调查问卷

您将如何使用本教程？

仅阅读教程内容阅读并完成练习

您如何评价使用 Python 的体验？

新手水平中等水平熟练水平

您如何评价自己在使用 Google Cloud 服务方面的经验水平？

新手中级熟练

2. 设置和要求

自定进度的环境设置

选择项目

新建项目

获取项目 ID

请记住项目 ID，它是所有 Google Cloud 项目中的唯一名称。（上述项目 ID 已被占用，您无法使用，抱歉！）您稍后必须以 PROJECT_ID 的形式提供此 ID。

接下来，您必须在 Cloud 控制台中启用结算功能才能使用 Google Cloud 资源。

请务必按照“清理”部分部分。此部分建议您如何关停资源，以免产生超出本教程范围的费用。Google Cloud 的新用户符合参与 $300 USD 免费试用计划的条件。

启动 Cloud Shell

虽然您可以从笔记本电脑远程操作 Google Cloud，但此 Codelab 使用 Google Cloud Shell，这是一个在 Cloud 中运行的命令行环境。

激活 Cloud Shell

在 Cloud 控制台中，点击激活 Cloud Shell

激活 Cloud Shell

如果您以前从未启动过 Cloud Shell，将看到一个中间屏幕（在折叠下面），描述它是什么。如果是这种情况，请点击继续（您将永远不会再看到它）。一次性屏幕如下所示：

Cloud Shell 简介

预配和连接到 Cloud Shell 只需花几分钟时间。

Cloud Shell 可让您通过终端访问托管在云端的虚拟机。该虚拟机包含您需要的所有开发工具。它提供了一个持久的 5GB 主目录，并且在 Google Cloud 中运行，大大增强了网络性能和身份验证。只需使用一个浏览器即可完成本 Codelab 中的大部分工作。

在连接到 Cloud Shell 后，您应该会看到自己已通过身份验证，并且相关项目已设置为您的项目 ID。

在 Cloud Shell 中运行以下命令以确认您已通过身份验证：

gcloud auth list

命令输出

 Credentialed Accounts
ACTIVE  ACCOUNT
*      <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`

gcloud config list project

命令输出

[core]
project = <PROJECT_ID>

如果不是上述结果，您可以使用以下命令进行设置：

gcloud config set project <PROJECT_ID>

命令输出

Updated property [core/project].

3. 启用 Document AI API

您必须先启用该 API，然后才能开始使用 Document AI。您可以使用 gcloud 命令行界面或 Cloud 控制台来完成此操作。

使用 `gcloud` CLI

如果您未使用 Cloud Shell，请按照在本地机器上安装 gcloud CLI 中的步骤操作。
您可以使用以下 gcloud 命令启用这些 API。

gcloud services enable documentai.googleapis.com storage.googleapis.com

您应该会看到与以下类似的内容：

Operation "operations/..." finished successfully.

使用 Cloud Console

在浏览器中打开 Cloud 控制台。

使用控制台顶部的搜索栏，搜索“Document AI API”，然后点击启用，以在您的 Google Cloud 项目中使用该 API

Search API

对 Google Cloud Storage API 重复执行上一步。

现在，您可以使用 Document AI 了！

4. 创建和测试处理器

您必须先创建一个将执行提取的文档 OCR 处理器实例。您可以使用 Cloud 控制台或 Processor Management API 完成此操作。

Cloud 控制台

在控制台中，转到 Document AI Platform 概览
点击 Explore Processors（探索处理器），然后选择 Document OCR
将其命名为 codelab-ocr（或您能记住的其他内容），然后在列表中选择最近的区域。
点击 Create 以创建处理器
复制您的处理方 ID。您稍后必须在代码中使用此 ID。

您可以通过上传文档在控制台中测试处理器。点击 Upload Test Document（上传测试文档），然后选择要解析的文档。

您可以下载下面的 PDF 文件，其中包含上述小说的前 3 页。

书名页

输出应如下所示：解析的图书

Python 客户端库

请按照此 Codelab 操作，了解如何使用 Python 客户端库管理 Document AI 处理器：

使用 Python 管理 Document AI 处理器 - Codelab

5. 对 API 请求进行身份验证

如需向 Document AI API 发出请求，您必须使用服务账号。服务账号属于您的项目，Python 客户端库会使用它来发出 API 请求。与任何其他用户账号一样，服务账号由电子邮件地址表示。在本部分中，您将使用 Cloud SDK 创建服务账号，然后创建以服务账号进行身份验证所需的凭据。

首先，打开 Cloud Shell 并使用 PROJECT_ID 设置环境变量，您将在此 Codelab 中一直使用该变量：

export GOOGLE_CLOUD_PROJECT=$(gcloud config get-value core/project)

接下来，使用以下命令创建一个新的服务账号来访问 Document AI API：

gcloud iam service-accounts create my-docai-sa \
  --display-name "my-docai-service-account"

接下来，向您的服务账号授予访问项目中 Document AI 和 Cloud Storage 的权限。

gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} \
    --member="serviceAccount:my-docai-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com" \
    --role="roles/documentai.admin"

gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} \
    --member="serviceAccount:my-docai-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com" \
    --role="roles/storage.admin"

gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} \
    --member="serviceAccount:my-docai-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com" \
    --role="roles/serviceusage.serviceUsageConsumer"

接下来，创建 Python 代码以新服务账号身份登录所用的凭据。使用以下命令创建这些凭据并将其保存为 JSON 文件 ~/key.json：

gcloud iam service-accounts keys create ~/key.json \
  --iam-account  my-docai-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com

最后，设置 GOOGLE_APPLICATION_CREDENTIALS 环境变量，供库用来查找您的凭据。如需详细了解这种形式的身份验证，请参阅相关指南。应使用以下命令将环境变量设置为您创建的凭据 JSON 文件的完整路径：

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"

6. 安装客户端库

为 Document AI、Cloud Storage 和 Document AI Toolbox 安装 Python 客户端库：

pip3 install --upgrade google-cloud-documentai
pip3 install --upgrade google-cloud-storage
pip3 install --upgrade google-cloud-documentai-toolbox

您应该会看到与以下类似的内容：

...
Installing collected packages: google-cloud-documentai
Successfully installed google-cloud-documentai-2.15.0
.
.
Installing collected packages: google-cloud-storage
Successfully installed google-cloud-storage-2.9.0
.
.
Installing collected packages: google-cloud-documentai-toolbox
Successfully installed google-cloud-documentai-toolbox-0.6.0a0

现在，您可以使用 Document AI API 了！

7. 下载示例 PDF

我们提供了一个示例文档，其中包含上述小说的前 3 页。

您可以通过以下链接下载 PDF 文件。然后，将其上传到 cloudshell 实例。

您还可以使用 gsutil 从我们的公开 Google Cloud Storage 存储桶下载该文件。

gsutil cp gs://cloud-samples-data/documentai/codelabs/ocr/Winnie_the_Pooh_3_Pages.pdf .

8. 提出在线处理请求

在此步骤中，您将使用在线处理（同步）API 处理上述小说的前 3 页。此方法最适合存储在本地的较小文档。如需了解每种处理器类型的页面数量上限和文件大小上限，请参阅完整的处理器列表。

在本地机器上使用 Cloud Shell Editor 或文本编辑器创建名为 online_processing.py 的文件，并使用以下代码。

将 YOUR_PROJECT_ID、YOUR_PROJECT_LOCATION、YOUR_PROCESSOR_ID 和 FILE_PATH 替换为适合您环境的值。

online_processing.py

from google.api_core.client_options import ClientOptions
from google.cloud import documentai


PROJECT_ID = "YOUR_PROJECT_ID"
LOCATION = "YOUR_PROJECT_LOCATION"  # Format is 'us' or 'eu'
PROCESSOR_ID = "YOUR_PROCESSOR_ID"  # Create processor in Cloud Console

# The local file in your current working directory
FILE_PATH = "Winnie_the_Pooh_3_Pages.pdf"
# Refer to https://cloud.google.com/document-ai/docs/file-types
# for supported file types
MIME_TYPE = "application/pdf"

# Instantiates a client
docai_client = documentai.DocumentProcessorServiceClient(
    client_options=ClientOptions(api_endpoint=f"{LOCATION}-documentai.googleapis.com")
)

# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
RESOURCE_NAME = docai_client.processor_path(PROJECT_ID, LOCATION, PROCESSOR_ID)

# Read the file into memory
with open(FILE_PATH, "rb") as image:
    image_content = image.read()

# Load Binary Data into Document AI RawDocument Object
raw_document = documentai.RawDocument(content=image_content, mime_type=MIME_TYPE)

# Configure the process request
request = documentai.ProcessRequest(name=RESOURCE_NAME, raw_document=raw_document)

# Use the Document AI client to process the sample form
result = docai_client.process_document(request=request)

document_object = result.document
print("Document processing complete.")
print(f"Text: {document_object.text}")

运行代码，它将提取文本并将其输出到控制台。

如果使用我们的示例文档，您应该会看到以下输出内容：

Document processing complete.
Text: CHAPTER I
IN WHICH We Are Introduced to
Winnie-the-Pooh and Some
Bees, and the Stories Begin
Here is Edward Bear, coming
downstairs now, bump, bump, bump, on the back
of his head, behind Christopher Robin. It is, as far
as he knows, the only way of coming downstairs,
but sometimes he feels that there really is another
way, if only he could stop bumping for a moment
and think of it. And then he feels that perhaps there
isn't. Anyhow, here he is at the bottom, and ready
to be introduced to you. Winnie-the-Pooh.
When I first heard his name, I said, just as you
are going to say, "But I thought he was a boy?"
"So did I," said Christopher Robin.
"Then you can't call him Winnie?"
"I don't."
"But you said "

...

Digitized by
Google

9. 发出批处理请求

现在，假设您想要读取上述整部小说中的文本。

在线处理对可以发送的页面数和文件大小有限制，并且每次 API 调用只能处理一个文档文件。
批处理允许以异步方法处理较大/多个文件。

在此步骤中，我们将使用 Document AI Batch Processing API 处理整部《Winnie the Pooh》小说，并将文本输出到 Google Cloud Storage 存储桶。

批处理使用长时间运行的操作以异步方式管理请求，因此我们必须以与在线处理不同的方式发出请求和检索输出。不过，无论使用在线处理还是批处理，输出都将采用相同的 Document 对象格式。

此步骤展示了如何提供特定文档供 Document AI 处理。稍后的步骤将介绍如何处理整个文档目录。

将 PDF 文件上传到 Cloud Storage

batch_process_documents() 方法目前接受来自 Google Cloud Storage 的文件。您可以引用 documentai_v1.types.BatchProcessRequest，详细了解对象结构。

在此示例中，您可以直接从我们的示例存储桶读取文件。

您也可以使用 gsutil 将文件复制到您自己的存储桶中。

gsutil cp gs://cloud-samples-data/documentai/codelabs/ocr/Winnie_the_Pooh.pdf gs://YOUR_BUCKET_NAME/

或者，您也可以通过下方的链接下载上述小说的示例文件，然后将其上传到您自己的存储桶。

您还需要一个 GCS 存储桶来存储 API 的输出。

您可以按照 Cloud Storage 文档了解如何创建存储桶。

使用 `batch_process_documents()` 方法

创建一个名为 batch_processing.py 的文件，并使用以下代码。

将 YOUR_PROJECT_ID、YOUR_PROCESSOR_LOCATION、YOUR_PROCESSOR_ID、YOUR_INPUT_URI 和 YOUR_OUTPUT_URI 替换为您的环境的相应值。

确保 YOUR_INPUT_URI 直接指向 PDF 文件，例如：gs://cloud-samples-data/documentai/codelabs/ocr/Winnie_the_Pooh.pdf。

batch_processing.py

"""
Makes a Batch Processing Request to Document AI
"""

import re

from google.api_core.client_options import ClientOptions
from google.api_core.exceptions import InternalServerError
from google.api_core.exceptions import RetryError
from google.cloud import documentai
from google.cloud import storage

# TODO(developer): Fill these variables before running the sample.
project_id = "YOUR_PROJECT_ID"
location = "YOUR_PROCESSOR_LOCATION"  # Format is "us" or "eu"
processor_id = "YOUR_PROCESSOR_ID"  # Create processor before running sample
gcs_output_uri = "YOUR_OUTPUT_URI"  # Must end with a trailing slash `/`. Format: gs://bucket/directory/subdirectory/
processor_version_id = (
    "YOUR_PROCESSOR_VERSION_ID"  # Optional. Example: pretrained-ocr-v1.0-2020-09-23
)

# TODO(developer): If `gcs_input_uri` is a single file, `mime_type` must be specified.
gcs_input_uri = "YOUR_INPUT_URI"  # Format: `gs://bucket/directory/file.pdf` or `gs://bucket/directory/`
input_mime_type = "application/pdf"
field_mask = "text,entities,pages.pageNumber"  # Optional. The fields to return in the Document object.


def batch_process_documents(
    project_id: str,
    location: str,
    processor_id: str,
    gcs_input_uri: str,
    gcs_output_uri: str,
    processor_version_id: str = None,
    input_mime_type: str = None,
    field_mask: str = None,
    timeout: int = 400,
):
    # You must set the api_endpoint if you use a location other than "us".
    opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")

    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    if not gcs_input_uri.endswith("/") and "." in gcs_input_uri:
        # Specify specific GCS URIs to process individual documents
        gcs_document = documentai.GcsDocument(
            gcs_uri=gcs_input_uri, mime_type=input_mime_type
        )
        # Load GCS Input URI into a List of document files
        gcs_documents = documentai.GcsDocuments(documents=[gcs_document])
        input_config = documentai.BatchDocumentsInputConfig(gcs_documents=gcs_documents)
    else:
        # Specify a GCS URI Prefix to process an entire directory
        gcs_prefix = documentai.GcsPrefix(gcs_uri_prefix=gcs_input_uri)
        input_config = documentai.BatchDocumentsInputConfig(gcs_prefix=gcs_prefix)

    # Cloud Storage URI for the Output Directory
    gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
        gcs_uri=gcs_output_uri, field_mask=field_mask
    )

    # Where to write results
    output_config = documentai.DocumentOutputConfig(gcs_output_config=gcs_output_config)

    if processor_version_id:
        # The full resource name of the processor version, e.g.:
        # projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}
        name = client.processor_version_path(
            project_id, location, processor_id, processor_version_id
        )
    else:
        # The full resource name of the processor, e.g.:
        # projects/{project_id}/locations/{location}/processors/{processor_id}
        name = client.processor_path(project_id, location, processor_id)

    request = documentai.BatchProcessRequest(
        name=name,
        input_documents=input_config,
        document_output_config=output_config,
    )

    # BatchProcess returns a Long Running Operation (LRO)
    operation = client.batch_process_documents(request)

    # Continually polls the operation until it is complete.
    # This could take some time for larger files
    # Format: projects/{project_id}/locations/{location}/operations/{operation_id}
    try:
        print(f"Waiting for operation {operation.operation.name} to complete...")
        operation.result(timeout=timeout)
    # Catch exception when operation doesn"t finish before timeout
    except (RetryError, InternalServerError) as e:
        print(e.message)

    # NOTE: Can also use callbacks for asynchronous processing
    #
    # def my_callback(future):
    #   result = future.result()
    #
    # operation.add_done_callback(my_callback)

    # Once the operation is complete,
    # get output document information from operation metadata
    metadata = documentai.BatchProcessMetadata(operation.metadata)

    if metadata.state != documentai.BatchProcessMetadata.State.SUCCEEDED:
        raise ValueError(f"Batch Process Failed: {metadata.state_message}")

    storage_client = storage.Client()

    print("Output files:")
    # One process per Input Document
    for process in list(metadata.individual_process_statuses):
        # output_gcs_destination format: gs://BUCKET/PREFIX/OPERATION_NUMBER/INPUT_FILE_NUMBER/
        # The Cloud Storage API requires the bucket name and URI prefix separately
        matches = re.match(r"gs://(.*?)/(.*)", process.output_gcs_destination)
        if not matches:
            print(
                "Could not parse output GCS destination:",
                process.output_gcs_destination,
            )
            continue

        output_bucket, output_prefix = matches.groups()

        # Get List of Document Objects from the Output Bucket
        output_blobs = storage_client.list_blobs(output_bucket, prefix=output_prefix)

        # Document AI may output multiple JSON files per source file
        for blob in output_blobs:
            # Document AI should only output JSON files to GCS
            if blob.content_type != "application/json":
                print(
                    f"Skipping non-supported file: {blob.name} - Mimetype: {blob.content_type}"
                )
                continue

            # Download JSON File as bytes object and convert to Document Object
            print(f"Fetching {blob.name}")
            document = documentai.Document.from_json(
                blob.download_as_bytes(), ignore_unknown_fields=True
            )

            # For a full list of Document object attributes, please reference this page:
            # https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.Document

            # Read the text recognition output from the processor
            print("The document contains the following text:")
            print(document.text)


if __name__ == "__main__":
    batch_process_documents(
        project_id=project_id,
        location=location,
        processor_id=processor_id,
        gcs_input_uri=gcs_input_uri,
        gcs_output_uri=gcs_output_uri,
        input_mime_type=input_mime_type,
        field_mask=field_mask,
    )

运行代码，您应该会看到控制台中提取并输出的完整小说文本。

由于文件比上一个示例大得多，因此这可能需要一些时间才能完成。（哦，真费时间…）

但是，使用 Batch Processing API 时，您将收到一个操作 ID，在任务完成后可用于从 GCS 获取输出。

输出应如下所示：

Waiting for operation projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION_NUMBER to complete...
Document processing complete.
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-0.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-1.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-10.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-11.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-12.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-13.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-14.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-15.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-16.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-17.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-18.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-2.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-3.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-4.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-5.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-6.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-7.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-8.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-9.json

This is a reproduction of a library book that was digitized
by Google as part of an ongoing effort to preserve the
information in books and make it universally accessible.
TM
Google books
https://books.google.com

.....

He nodded and went
out ... and in a moment
I heard Winnie-the-Pooh
-bump, bump, bump-go-ing up the stairs behind
him.
Digitized by
Google

10. 对目录发出批处理请求

有时，您可能想要处理整个文档目录，而无需单独列出每个文档。batch_process_documents() 方法支持输入一系列特定文档或目录路径。

此步骤将展示如何处理文档文件的完整目录。大部分代码的运作方式与上一步相同，唯一的区别在于随 BatchProcessRequest 发送的 GCS URI 不同。

我们的示例存储桶中有一个目录，其中包含上述小说的多个页面（位于不同的文件中）。

gs://cloud-samples-data/documentai/codelabs/ocr/multi-document/

您可以直接读取这些文件，也可以将它们复制到自己的 Cloud Storage 存储桶中。

重新运行上一步中的代码，将 YOUR_INPUT_URI 替换为 Cloud Storage 中的目录。

运行代码，您应该会看到从 Cloud Storage 目录中的所有文档文件中提取的文本。

输出应如下所示：

Waiting for operation projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION_NUMBER to complete...
Document processing complete.
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh_Page_0-0.json
Fetching docai-output/OPERATION_NUMBER/1/Winnie_the_Pooh_Page_1-0.json
Fetching docai-output/OPERATION_NUMBER/2/Winnie_the_Pooh_Page_10-0.json
Fetching docai-output/OPERATION_NUMBER/3/Winnie_the_Pooh_Page_12-0.json
Fetching docai-output/OPERATION_NUMBER/4/Winnie_the_Pooh_Page_16-0.json
Fetching docai-output/OPERATION_NUMBER/5/Winnie_the_Pooh_Page_7-0.json

Introduction
(I₂
F YOU happen to have read another
book about Christopher Robin, you may remember
th
CHAPTER I
IN WHICH We Are Introduced to
Winnie-the-Pooh and Some
Bees, and the Stories Begin
HERE is
10
WINNIE-THE-POOH
"I wonder if you've got such a thing as a balloon
about you?"
"A balloon?"
"Yes, 
12
WINNIE-THE-POOH
and you took your gun with you, just in case, as
you always did, and Winnie-the-P
16
WINNIE-THE-POOH
this song, and one bee sat down on the nose of the
cloud for a moment, and then g
WE ARE INTRODUCED
7
"Oh, help!" said Pooh, as he dropped ten feet on
the branch below him.
"If only

11. 使用 Document AI Toolbox 处理批处理响应

由于与 Cloud Storage 集成，批处理需要完成几个步骤。Document 输出也可以“分片”转换为多个 .json 文件，具体取决于输入文档的大小。

Document AI Toolbox Python SDK 旨在简化 Document AI 的后处理和其他常见任务。此库旨在对 Document AI 客户端库进行补充，而不是替代它。如需查看完整的规范，请参阅参考文档。

此步骤介绍了如何使用 Document AI Toolbox 发出批处理请求并检索输出。

batch_processing_toolbox.py

"""
Makes a Batch Processing Request to Document AI using Document AI Toolbox
"""

from google.api_core.client_options import ClientOptions
from google.cloud import documentai
from google.cloud import documentai_toolbox

# TODO(developer): Fill these variables before running the sample.
project_id = "YOUR_PROJECT_ID"
location = "YOUR_PROCESSOR_LOCATION"  # Format is "us" or "eu"
processor_id = "YOUR_PROCESSOR_ID"  # Create processor before running sample
gcs_output_uri = "YOUR_OUTPUT_URI"  # Must end with a trailing slash `/`. Format: gs://bucket/directory/subdirectory/
processor_version_id = (
    "YOUR_PROCESSOR_VERSION_ID"  # Optional. Example: pretrained-ocr-v1.0-2020-09-23
)

# TODO(developer): If `gcs_input_uri` is a single file, `mime_type` must be specified.
gcs_input_uri = "YOUR_INPUT_URI"  # Format: `gs://bucket/directory/file.pdf`` or `gs://bucket/directory/``
input_mime_type = "application/pdf"
field_mask = "text,entities,pages.pageNumber"  # Optional. The fields to return in the Document object.


def batch_process_toolbox(
    project_id: str,
    location: str,
    processor_id: str,
    gcs_input_uri: str,
    gcs_output_uri: str,
    processor_version_id: str = None,
    input_mime_type: str = None,
    field_mask: str = None,
):
    # You must set the api_endpoint if you use a location other than "us".
    opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")

    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    if not gcs_input_uri.endswith("/") and "." in gcs_input_uri:
        # Specify specific GCS URIs to process individual documents
        gcs_document = documentai.GcsDocument(
            gcs_uri=gcs_input_uri, mime_type=input_mime_type
        )
        # Load GCS Input URI into a List of document files
        gcs_documents = documentai.GcsDocuments(documents=[gcs_document])
        input_config = documentai.BatchDocumentsInputConfig(gcs_documents=gcs_documents)
    else:
        # Specify a GCS URI Prefix to process an entire directory
        gcs_prefix = documentai.GcsPrefix(gcs_uri_prefix=gcs_input_uri)
        input_config = documentai.BatchDocumentsInputConfig(gcs_prefix=gcs_prefix)

    # Cloud Storage URI for the Output Directory
    gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
        gcs_uri=gcs_output_uri, field_mask=field_mask
    )

    # Where to write results
    output_config = documentai.DocumentOutputConfig(gcs_output_config=gcs_output_config)

    if processor_version_id:
        # The full resource name of the processor version, e.g.:
        # projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}
        name = client.processor_version_path(
            project_id, location, processor_id, processor_version_id
        )
    else:
        # The full resource name of the processor, e.g.:
        # projects/{project_id}/locations/{location}/processors/{processor_id}
        name = client.processor_path(project_id, location, processor_id)

    request = documentai.BatchProcessRequest(
        name=name,
        input_documents=input_config,
        document_output_config=output_config,
    )

    # BatchProcess returns a Long Running Operation (LRO)
    operation = client.batch_process_documents(request)

    # Operation Name Format: projects/{project_id}/locations/{location}/operations/{operation_id}
    documents = documentai_toolbox.document.Document.from_batch_process_operation(
        location=location, operation_name=operation.operation.name
    )

    for document in documents:
        # Read the text recognition output from the processor
        print("The document contains the following text:")
        # Truncated at 100 characters for brevity
        print(document.text[:100])


if __name__ == "__main__":
    batch_process_toolbox(
        project_id=project_id,
        location=location,
        processor_id=processor_id,
        gcs_input_uri=gcs_input_uri,
        gcs_output_uri=gcs_output_uri,
        input_mime_type=input_mime_type,
        field_mask=field_mask,
    )

12. 恭喜

您已成功通过在线处理、批处理和 Document AI Toolbox 使用 Document AI 从小说中提取文本。

我们建议您尝试使用其他文档，并探索平台上可用的其他处理器。

清理

为避免因本教程中使用的资源导致您的 Google Cloud 账号产生费用，请执行以下操作：

在 Cloud Console 中，转到管理资源页面。
在项目列表中，选择您的项目，然后点击“删除”。
在对话框中输入项目 ID，然后点击“关停”以删除项目。

了解详情

通过以下后续 Codelab 继续了解 Document AI。

资源

许可

此作品已获得 Creative Commons Attribution 2.0 通用许可授权。

报告错误

使用 Document AI (Python) 进行光学字符识别 (OCR)

关于此 Codelab

Document AI 是什么？

学习内容

所需条件

调查问卷

您将如何使用本教程？

您如何评价使用 Python 的体验？

您如何评价自己在使用 Google Cloud 服务方面的经验水平？

自定进度的环境设置

启动 Cloud Shell

激活 Cloud Shell

使用 gcloud CLI

使用 Cloud Console

Cloud 控制台

Python 客户端库

将 PDF 文件上传到 Cloud Storage

使用 batch_process_documents() 方法

清理

了解详情

资源

许可

使用 `gcloud` CLI

使用 `batch_process_documents()` 方法