使用 Python 管理 Document AI 處理器

還剩 11 分鐘

程式碼研究室簡介

上次更新時間：5月 13, 2025

作者：Laurent Picard

本頁面由 Cloud Translation API 翻譯而成。

1. 總覽

什麼是 Document AI？

Document AI 是可從文件中擷取洞察資料的平台，本質上，它提供日益龐大的文件處理器清單 (視功能而定，也稱為剖析器或拆分器)。

您可以透過兩種方式管理 Document AI 處理器：

透過網頁控制台手動設定
使用 Document AI API 透過程式處理。

以下是顯示處理器清單的螢幕截圖範例，包括網頁控制台和 Python 程式碼：

在本實驗室中，您將著重於透過程式設計使用 Python 用戶端程式庫管理 Document AI 處理器。

您會看到的畫面

如何設定環境
如何擷取處理器類型
如何建立處理器
如何列出專案處理器
如何使用處理器
如何啟用/停用處理器
如何管理處理器版本
如何刪除處理器

軟硬體需求

Google Cloud 專案
Chrome 或 Firefox 等瀏覽器
熟悉 Python 的使用方式

問卷調查

您要如何使用這個教學課程？

只閱讀閱讀並完成練習

請評估您使用 Python 的體驗。

新手中級熟練

請評分你使用 Google Cloud 服務的體驗。

新手中級熟練

2. 設定和需求

自助式環境設定

登入 Google Cloud 控制台，然後建立新專案或重複使用現有專案。如果您還沒有 Gmail 或 Google Workspace 帳戶，請務必建立帳戶。

「Project name」是這個專案參與者的顯示名稱。這是 Google API 不會使用的字元字串。您隨時可以更新。
專案 ID 在所有 Google Cloud 專案中都是不重複的值，且無法變更 (設定後即無法變更)。Cloud 控制台會自動產生唯一字串，您通常不需要理會這個字串。在大多數程式碼研究室中，您需要參照專案 ID (通常會標示為 PROJECT_ID)。如果您不喜歡產生的 ID，可以產生另一個隨機 ID。或者，您也可以自行嘗試，看看是否可用。在這個步驟後就無法變更，且會在專案期間維持不變。
提醒您，有些 API 會使用第三個值「專案編號」。如要進一步瞭解這三個值，請參閱說明文件。

接下來，您需要在 Cloud 控制台中啟用帳單功能，才能使用 Cloud 資源/API。執行這個程式碼研究室不會產生任何費用，如要關閉資源，避免在本教學課程結束後繼續產生費用，您可以刪除建立的資源或專案。Google Cloud 新使用者可享有 $300 美元的免費試用期。

啟動 Cloud Shell

雖然 Google Cloud 可透過筆記型電腦遠端操作，但在本實驗室中，您將使用 Cloud Shell，這是在雲端運作的指令列環境。

啟用 Cloud Shell

在 Cloud 控制台中，按一下「啟用 Cloud Shell」圖示。

如果這是您首次啟動 Cloud Shell，系統會顯示中介畫面，說明 Cloud Shell 的功能。如果您看到中介畫面，請按一下「繼續」。

佈建並連線至 Cloud Shell 的作業只需幾分鐘的時間。

這個虛擬機器會載入所有必要的開發工具。提供永久的 5 GB 主目錄，而且在 Google Cloud 中運作，大幅提升網路效能和驗證功能。您可以在瀏覽器中完成本程式碼研究室的大部分工作，甚至是全部工作。

連線至 Cloud Shell 後，您應會發現自己通過驗證，且專案已設為您的專案 ID。

在 Cloud Shell 中執行下列指令，確認您已通過驗證：

gcloud auth list

指令輸出

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`

在 Cloud Shell 中執行下列指令，確認 gcloud 指令知道您的專案：

gcloud config list project

指令輸出

[core]
project = <PROJECT_ID>

如未設定，請輸入下列指令設定專案：

gcloud config set project <PROJECT_ID>

指令輸出

Updated property [core/project].

3. 環境設定

您必須先在 Cloud Shell 中執行下列指令，啟用 Document AI API，才能開始使用 Document AI：

gcloud services enable documentai.googleapis.com

畫面應如下所示：

Operation "operations/..." finished successfully.

你現在可以使用 Document AI 了！

前往主目錄：

cd ~

建立 Python 虛擬環境，以便隔離依附元件：

virtualenv venv-docai

啟用虛擬環境：

source venv-docai/bin/activate

安裝 IPython、Document AI 用戶端程式庫和 python-tabulate (用於以美觀格式顯示要求結果)：

pip install ipython google-cloud-documentai tabulate

畫面應如下所示：

...
Installing collected packages: ..., tabulate, ipython, google-cloud-documentai
Successfully installed ... google-cloud-documentai-2.15.0 ...

您現在可以使用 Document AI 用戶端程式庫了！

設定下列環境變數：

export PROJECT_ID=$(gcloud config get-value core/project)

# Choose "us" or "eu"
export API_LOCATION="us"

從現在起，所有步驟都應在同一個工作階段中完成。

確認環境變數已正確定義：

echo $PROJECT_ID

echo $API_LOCATION

在後續步驟中，您將使用剛安裝的互動式 Python 解譯器 IPython。在 Cloud Shell 中執行 ipython，啟動工作階段：

ipython

畫面應如下所示：

Python 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 9.1.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

將下列程式碼複製到 IPython 工作階段：

import os
from typing import Iterator, MutableSequence, Optional, Sequence, Tuple

import google.cloud.documentai_v1 as docai
from tabulate import tabulate

PROJECT_ID = os.getenv("PROJECT_ID", "")
API_LOCATION = os.getenv("API_LOCATION", "")

assert PROJECT_ID, "PROJECT_ID is undefined"
assert API_LOCATION in ("us", "eu"), "API_LOCATION is incorrect"

# Test processors
document_ocr_display_name = "document-ocr"
form_parser_display_name = "form-parser"

test_processor_display_names_and_types = (
    (document_ocr_display_name, "OCR_PROCESSOR"),
    (form_parser_display_name, "FORM_PARSER_PROCESSOR"),
)

def get_client() -> docai.DocumentProcessorServiceClient:
    client_options = {"api_endpoint": f"{API_LOCATION}-documentai.googleapis.com"}
    return docai.DocumentProcessorServiceClient(client_options=client_options)

def get_parent(client: docai.DocumentProcessorServiceClient) -> str:
    return client.common_location_path(PROJECT_ID, API_LOCATION)

def get_client_and_parent() -> Tuple[docai.DocumentProcessorServiceClient, str]:
    client = get_client()
    parent = get_parent(client)
    return client, parent

您已準備好提出第一個要求，並擷取處理器類型。

4. 擷取處理器類型

在下一個步驟中建立處理器前，請先擷取可用的處理器類型。您可以使用 fetch_processor_types 擷取這份清單。

將下列函式新增至 IPython 工作階段：

def fetch_processor_types() -> MutableSequence[docai.ProcessorType]:
    client, parent = get_client_and_parent()
    response = client.fetch_processor_types(parent=parent)

    return response.processor_types

def print_processor_types(processor_types: Sequence[docai.ProcessorType]):
    def sort_key(pt):
        return (not pt.allow_creation, pt.category, pt.type_)

    sorted_processor_types = sorted(processor_types, key=sort_key)
    data = processor_type_tabular_data(sorted_processor_types)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Processor types: {len(sorted_processor_types)}")

def processor_type_tabular_data(
    processor_types: Sequence[docai.ProcessorType],
) -> Iterator[Tuple[str, str, str, str]]:
    def locations(pt):
        return ", ".join(sorted(loc.location_id for loc in pt.available_locations))

    yield ("type", "category", "allow_creation", "locations")
    yield ("left", "left", "left", "left")
    if not processor_types:
        yield ("-", "-", "-", "-")
        return
    for pt in processor_types:
        yield (pt.type_, pt.category, f"{pt.allow_creation}", locations(pt))

列出處理器類型：

processor_types = fetch_processor_types()
print_processor_types(processor_types)

您應該會看到類似下列的內容：

+--------------------------------------+-------------+----------------+-----------+
| type                                 | category    | allow_creation | locations |
+--------------------------------------+-------------+----------------+-----------+
| CUSTOM_CLASSIFICATION_PROCESSOR      | CUSTOM      | True           | eu, us    |
...
| FORM_PARSER_PROCESSOR                | GENERAL     | True           | eu, us    |
| LAYOUT_PARSER_PROCESSOR              | GENERAL     | True           | eu, us    |
| OCR_PROCESSOR                        | GENERAL     | True           | eu, us    |
| BANK_STATEMENT_PROCESSOR             | SPECIALIZED | True           | eu, us    |
| EXPENSE_PROCESSOR                    | SPECIALIZED | True           | eu, us    |
...
+--------------------------------------+-------------+----------------+-----------+
→ Processor types: 19

您現在已取得在下一個步驟中建立處理器所需的所有資訊。

5. 建立處理器

如要建立處理器，請呼叫 create_processor，並提供顯示名稱和處理器類型。

新增下列函式：

def create_processor(display_name: str, type: str) -> docai.Processor:
    client, parent = get_client_and_parent()
    processor = docai.Processor(display_name=display_name, type_=type)

    return client.create_processor(parent=parent, processor=processor)

建立測試處理器：

separator = "=" * 80
for display_name, type in test_processor_display_names_and_types:
    print(separator)
    print(f"Creating {display_name} ({type})...")
    try:
        create_processor(display_name, type)
    except Exception as err:
        print(err)
print(separator)
print("Done")

您應該會看到以下內容：

================================================================================
Creating document-ocr (OCR_PROCESSOR)...
================================================================================
Creating form-parser (FORM_PARSER_PROCESSOR)...
================================================================================
Done

您已建立新的處理器！

接著，請瞭解如何列出處理器。

6. 列出專案處理器

list_processors 會傳回屬於專案的所有處理器清單。

新增下列函式：

def list_processors() -> MutableSequence[docai.Processor]:
    client, parent = get_client_and_parent()
    response = client.list_processors(parent=parent)

    return list(response.processors)

def print_processors(processors: Optional[Sequence[docai.Processor]] = None):
    def sort_key(processor):
        return processor.display_name

    if processors is None:
        processors = list_processors()
    sorted_processors = sorted(processors, key=sort_key)
    data = processor_tabular_data(sorted_processors)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Processors: {len(sorted_processors)}")

def processor_tabular_data(
    processors: Sequence[docai.Processor],
) -> Iterator[Tuple[str, str, str]]:
    yield ("display_name", "type", "state")
    yield ("left", "left", "left")
    if not processors:
        yield ("-", "-", "-")
        return
    for processor in processors:
        yield (processor.display_name, processor.type_, processor.state.name)

呼叫函式：

processors = list_processors()
print_processors(processors)

您應該會看到以下內容：

+--------------+-----------------------+---------+
| display_name | type                  | state   |
+--------------+-----------------------+---------+
| document-ocr | OCR_PROCESSOR         | ENABLED |
| form-parser  | FORM_PARSER_PROCESSOR | ENABLED |
+--------------+-----------------------+---------+
→ Processors: 2

如要依顯示名稱擷取處理器，請新增下列函式：

def get_processor(
    display_name: str,
    processors: Optional[Sequence[docai.Processor]] = None,
) -> Optional[docai.Processor]:
    if processors is None:
        processors = list_processors()
    for processor in processors:
        if processor.display_name == display_name:
            return processor
    return None

測試函式：

processor = get_processor(document_ocr_display_name, processors)

assert processor is not None
print(processor)

畫面應如下所示：

name: "projects/PROJECT_NUM/locations/LOCATION/processors/PROCESSOR_ID"
type_: "OCR_PROCESSOR"
display_name: "document-ocr"
state: ENABLED
...

您現在已瞭解如何列出專案處理器，並根據顯示名稱擷取這些處理器。接下來，請參閱如何使用處理器。

7. 使用處理器

文件處理方式有兩種：

同步：呼叫 process_document 來分析單一文件，並直接使用結果。
非同步：呼叫 batch_process_documents 即可針對多個或大型文件啟動批次處理作業。

您的測試文件 ( PDF) 是已完成的掃描問卷，其中包含手寫答案。直接從 IPython 工作階段下載至工作目錄：

!gsutil cp gs://cloud-samples-data/documentai/form.pdf .

檢查工作目錄的內容：

!ls

你應該具備以下條件：

...  form.pdf  ...  venv-docai  ...

您可以使用同步 process_document 方法分析本機檔案。新增下列函式：

def process_file(
    processor: docai.Processor,
    file_path: str,
    mime_type: str,
) -> docai.Document:
    client = get_client()
    with open(file_path, "rb") as document_file:
        document_content = document_file.read()
    document = docai.RawDocument(content=document_content, mime_type=mime_type)
    request = docai.ProcessRequest(raw_document=document, name=processor.name)

    response = client.process_document(request)

    return response.document

由於您的文件是問卷，請選擇表單剖析器。除了擷取文字 (印刷和手寫) 之外，這個一般處理器還會偵測表單欄位。

分析文件：

processor = get_processor(form_parser_display_name)
assert processor is not None

file_path = "./form.pdf"
mime_type = "application/pdf"

document = process_file(processor, file_path, mime_type)

所有處理器都會在文件上執行光學字元辨識 (OCR) 第一階段。查看 OCR 檢查作業偵測到的文字：

document.text.split("\n")

畫面應如下所示：

['FakeDoc M.D.',
 'HEALTH INTAKE FORM',
 'Please fill out the questionnaire carefully. The information you provide will be used to complete',
 'your health profile and will be kept confidential.',
 'Date:',
 '9/14/19',
 'Name:',
 'Sally Walker',
 'DOB: 09/04/1986',
 'Address: 24 Barney Lane',
 'City: Towaco',
 'State: NJ Zip: 07082',
 'Email: Sally, walker@cmail.com',
 '_Phone #: (906) 917-3486',
 'Gender: F',
 'Marital Status:',
  ...
]

新增下列函式，以便列印偵測到的單一表單欄位：

def print_form_fields(document: docai.Document):
    sorted_form_fields = form_fields_sorted_by_ocr_order(document)
    data = form_field_tabular_data(sorted_form_fields, document)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Form fields: {len(sorted_form_fields)}")

def form_field_tabular_data(
    form_fields: Sequence[docai.Document.Page.FormField],
    document: docai.Document,
) -> Iterator[Tuple[str, str, str]]:
    yield ("name", "value", "confidence")
    yield ("right", "left", "right")
    if not form_fields:
        yield ("-", "-", "-")
        return
    for form_field in form_fields:
        name_layout = form_field.field_name
        value_layout = form_field.field_value
        name = text_from_layout(name_layout, document)
        value = text_from_layout(value_layout, document)
        confidence = value_layout.confidence
        yield (name, value, f"{confidence:.1%}")

並新增下列公用函式：

def form_fields_sorted_by_ocr_order(
    document: docai.Document,
) -> MutableSequence[docai.Document.Page.FormField]:
    def sort_key(form_field):
        # Sort according to the field name detected position
        text_anchor = form_field.field_name.text_anchor
        return text_anchor.text_segments[0].start_index if text_anchor else 0

    fields = (field for page in document.pages for field in page.form_fields)

    return sorted(fields, key=sort_key)


def text_from_layout(
    layout: docai.Document.Page.Layout,
    document: docai.Document,
) -> str:
    full_text = document.text
    segs = layout.text_anchor.text_segments
    text = "".join(full_text[seg.start_index : seg.end_index] for seg in segs)
    if text.endswith("\n"):
        text = text[:-1]

    return text

列印偵測到的表單欄位：

print_form_fields(document)

您應該會看到類似以下的列印內容：

+-----------------+-------------------------+------------+
|            name | value                   | confidence |
+-----------------+-------------------------+------------+
|           Date: | 9/14/19                 |      83.0% |
|           Name: | Sally Walker            |      87.3% |
|            DOB: | 09/04/1986              |      88.5% |
|        Address: | 24 Barney Lane          |      82.4% |
|           City: | Towaco                  |      90.0% |
|          State: | NJ                      |      89.4% |
|            Zip: | 07082                   |      91.4% |
|          Email: | Sally, walker@cmail.com |      79.7% |
|       _Phone #: | walker@cmail.com        |      93.2% |
|                 | (906                    |            |
|         Gender: | F                       |      88.2% |
| Marital Status: | Single                  |      85.2% |
|     Occupation: | Software Engineer       |      81.5% |
|    Referred By: | None                    |      76.9% |
...
+-----------------+-------------------------+------------+
→ Form fields: 17

查看系統偵測到的欄位名稱和值 ( PDF)。以下是問卷上半部：

您分析的表單含有印刷文字和手寫文字。您也已以高可信度偵測到該表單的欄位。結果是，您的像素已轉換為結構化資料！

8. 啟用及停用處理器

您可以使用 disable_processor 和 enable_processor 控管是否可以使用處理器。

新增下列函式：

def update_processor_state(processor: docai.Processor, enable_processor: bool):
    client = get_client()
    if enable_processor:
        request = docai.EnableProcessorRequest(name=processor.name)
        operation = client.enable_processor(request)
    else:
        request = docai.DisableProcessorRequest(name=processor.name)
        operation = client.disable_processor(request)
    operation.result()  # Wait for operation to complete

def enable_processor(processor: docai.Processor):
    update_processor_state(processor, True)

def disable_processor(processor: docai.Processor):
    update_processor_state(processor, False)

停用表單剖析器處理器，並檢查處理器的狀態：

processor = get_processor(form_parser_display_name)
assert processor is not None

disable_processor(processor)
print_processors()

您應該會看到以下內容：

+--------------+-----------------------+----------+
| display_name | type                  | state    |
+--------------+-----------------------+----------+
| document-ocr | OCR_PROCESSOR         | ENABLED  |
| form-parser  | FORM_PARSER_PROCESSOR | DISABLED |
+--------------+-----------------------+----------+
→ Processors: 2

重新啟用表單剖析器處理器：

enable_processor(processor)
print_processors()

您應該會看到以下內容：

+--------------+-----------------------+---------+
| display_name | type                  | state   |
+--------------+-----------------------+---------+
| document-ocr | OCR_PROCESSOR         | ENABLED |
| form-parser  | FORM_PARSER_PROCESSOR | ENABLED |
+--------------+-----------------------+---------+
→ Processors: 2

接下來，我們將說明如何管理處理器版本。

9. 管理處理器版本

處理器可提供多個版本。瞭解如何使用 list_processor_versions 和 set_default_processor_version 方法。

新增下列函式：

def list_processor_versions(
    processor: docai.Processor,
) -> MutableSequence[docai.ProcessorVersion]:
    client = get_client()
    response = client.list_processor_versions(parent=processor.name)

    return list(response)


def get_sorted_processor_versions(
    processor: docai.Processor,
) -> MutableSequence[docai.ProcessorVersion]:
    def sort_key(processor_version: docai.ProcessorVersion):
        return processor_version.name

    versions = list_processor_versions(processor)

    return sorted(versions, key=sort_key)


def print_processor_versions(processor: docai.Processor):
    versions = get_sorted_processor_versions(processor)
    default_version_name = processor.default_processor_version
    data = processor_versions_tabular_data(versions, default_version_name)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Processor versions: {len(versions)}")


def processor_versions_tabular_data(
    versions: Sequence[docai.ProcessorVersion],
    default_version_name: str,
) -> Iterator[Tuple[str, str, str]]:
    yield ("version", "display name", "default")
    yield ("left", "left", "left")
    if not versions:
        yield ("-", "-", "-")
        return
    for version in versions:
        mapping = docai.DocumentProcessorServiceClient.parse_processor_version_path(
            version.name
        )
        processor_version = mapping["processor_version"]
        is_default = "Y" if version.name == default_version_name else ""
        yield (processor_version, version.display_name, is_default)

列出 OCR 處理器的可用版本：

processor = get_processor(document_ocr_display_name)
assert processor is not None
print_processor_versions(processor)

您會取得以下處理器版本：

+--------------------------------+--------------------------+---------+
| version                        | display name             | default |
+--------------------------------+--------------------------+---------+
| pretrained-ocr-v1.0-2020-09-23 | Google Stable            |         |
| pretrained-ocr-v1.1-2022-09-12 | Google Release Candidate |         |
| pretrained-ocr-v1.2-2022-11-10 | Google Release Candidate |         |
| pretrained-ocr-v2.0-2023-06-02 | Google Stable            | Y       |
| pretrained-ocr-v2.1-2024-08-07 | Google Release Candidate |         |
+--------------------------------+--------------------------+---------+
→ Processor versions: 5

接著，新增函式來變更預設處理器版本：

def set_default_processor_version(processor: docai.Processor, version_name: str):
    client = get_client()
    request = docai.SetDefaultProcessorVersionRequest(
        processor=processor.name,
        default_processor_version=version_name,
    )

    operation = client.set_default_processor_version(request)
    operation.result()  # Wait for operation to complete

切換至最新的處理器版本：

processor = get_processor(document_ocr_display_name)
assert processor is not None
versions = get_sorted_processor_versions(processor)

new_version = versions[-1]  # Latest version
set_default_processor_version(processor, new_version.name)

# Update the processor info
processor = get_processor(document_ocr_display_name)
assert processor is not None
print_processor_versions(processor)

您會取得新版設定：

+--------------------------------+--------------------------+---------+
| version                        | display name             | default |
+--------------------------------+--------------------------+---------+
| pretrained-ocr-v1.0-2020-09-23 | Google Stable            |         |
| pretrained-ocr-v1.1-2022-09-12 | Google Release Candidate |         |
| pretrained-ocr-v1.2-2022-11-10 | Google Release Candidate |         |
| pretrained-ocr-v2.0-2023-06-02 | Google Stable            |         |
| pretrained-ocr-v2.1-2024-08-07 | Google Release Candidate | Y       |
+--------------------------------+--------------------------+---------+
→ Processor versions: 5

接下來，是最終的處理器管理方法 (刪除)。

10. 刪除處理器

最後，請查看如何使用 delete_processor 方法。

新增下列函式：

def delete_processor(processor: docai.Processor):
    client = get_client()
    operation = client.delete_processor(name=processor.name)
    operation.result()  # Wait for operation to complete

刪除測試處理器：

processors_to_delete = [dn for dn, _ in test_processor_display_names_and_types]
print("Deleting processors...")

for processor in list_processors():
    if processor.display_name not in processors_to_delete:
        continue
    print(f"  Deleting {processor.display_name}...")
    delete_processor(processor)

print("Done\n")
print_processors()

您應該會看到以下內容：

Deleting processors...
  Deleting form-parser...
  Deleting document-ocr...
Done

+--------------+------+-------+
| display_name | type | state |
+--------------+------+-------+
| -            | -    | -     |
+--------------+------+-------+
→ Processors: 0

您已瞭解所有處理器管理方法！就快完成了...

11. 恭喜！

您已瞭解如何使用 Python 管理 Document AI 處理器！

清除所用資源

如要清理開發環境，請透過 Cloud Shell 執行下列操作：

如果您仍在 IPython 工作階段中，請返回 Shell：exit
停止使用 Python 虛擬環境：deactivate
刪除虛擬環境資料夾：cd ~ ; rm -rf ./venv-docai

如要刪除 Google Cloud 專案，請在 Cloud Shell 中執行下列操作：

擷取目前的專案 ID：PROJECT_ID=$(gcloud config get-value core/project)
請確認這是您要刪除的專案：echo $PROJECT_ID
刪除專案：gcloud projects delete $PROJECT_ID

瞭解詳情

在瀏覽器中試用 Document AI：https://cloud.google.com/document-ai/docs/drag-and-drop
Document AI 處理器詳細資訊：https://cloud.google.com/document-ai/docs/processors-list
Google Cloud 上的 Python：https://cloud.google.com/python
Python 適用的 Cloud 用戶端程式庫：https://github.com/googleapis/google-cloud-python

授權

這項內容採用的授權為 Creative Commons 姓名標示 2.0 通用授權。

回報錯誤