使用 Python 管理 Document AI 處理器

1. 總覽

c6d2ea69b1ba0eff.png

什麼是 Document AI?

Document AI 這個平台可讓您從文件中擷取深入分析結果。本核心提供一系列不斷成長的文件處理器 (視其功能而定,這又稱為剖析器或分割器)。

管理 Document AI 處理器的方式有兩種:

  • 從網路控制台手動執行
  • 透過 Document AI API 編寫程式。

以下是從網路控制台和 Python 程式碼顯示的處理器清單範例螢幕截圖:

312f0e9b3a8b8291.png

在本研究室中,您將專注於使用 Python 用戶端程式庫以程式輔助方式管理 Document AI 處理器。

顯示資料

  • 如何設定環境
  • 如何擷取處理器類型
  • 如何建立處理器
  • 如何列出專案處理器
  • 如何使用處理器
  • 如何啟用/停用處理器
  • 如何管理處理器版本
  • 如何刪除處理器

軟硬體需求

  • Google Cloud 專案
  • 瀏覽器,例如 ChromeFirefox
  • 熟悉使用 Python

問卷調查

您會如何使用這個教學課程?

僅供閱讀 閱讀並完成練習

您對 Python 的使用體驗有何評價?

新手 中級 還算容易

針對使用 Google Cloud 服務的經驗,您會給予什麼評價?

新手 中級 還算容易

2. 設定和需求

自修環境設定

  1. 登入 Google Cloud 控制台,建立新專案或重複使用現有專案。如果您還沒有 Gmail 或 Google Workspace 帳戶,請先建立帳戶

295004821bab6a87.png

37d264871000675d.png

96d86d3d5655cdbe.png

  • 「專案名稱」是這項專案參與者的顯示名稱。這是 Google API 未使用的字元字串。您可以隨時更新付款方式。
  • 所有 Google Cloud 專案的專案 ID 均不得重複,而且設定後即無法變更。Cloud 控制台會自動產生一個不重複的字串。但通常是在乎它何在在大部分的程式碼研究室中,您必須參照專案 ID (通常為 PROJECT_ID)。如果您對產生的 ID 不滿意,可以隨機產生一個 ID。或者,您也可以自行嘗試,看看是否支援。在這個步驟後,這個名稱即無法變更,而且在專案期間內仍會保持有效。
  • 資訊中的第三個值是專案編號,部分 API 會使用這個編號。如要進一步瞭解這三個值,請參閱說明文件
  1. 接下來,您需要在 Cloud 控制台中啟用計費功能,才能使用 Cloud 資源/API。執行本程式碼研究室不會產生任何費用 (如果有的話)。如要關閉資源,以免產生本教學課程結束後產生的費用,您可以刪除自己建立的資源或刪除專案。新使用者符合 $300 美元免費試用計畫的資格。

啟動 Cloud Shell

雖然 Google Cloud 可以從筆記型電腦遠端操作,但在本研究室中,您將使用 Cloud Shell,這是 Cloud 環境中執行的指令列環境。

啟用 Cloud Shell

  1. 在 Cloud 控制台中,按一下「啟用 Cloud Shell」圖示 d1264ca30785e435.png

cb81e7c8e34bc8d.png

如果您是第一次啟動 Cloud Shell,系統會顯示中繼畫面,說明這項服務的內容。如果系統顯示中繼畫面,請按一下「繼續」

d95252b003979716.png

佈建並連線至 Cloud Shell 只需幾分鐘的時間。

7833d5e1c5d18f54.png

這個虛擬機器已載入所有必要的開發工具。提供永久的 5 GB 主目錄,而且在 Google Cloud 中運作,大幅提高網路效能和驗證能力。在本程式碼研究室中,您的大部分作業都可透過瀏覽器完成。

連線至 Cloud Shell 後,您應會發現自己通過驗證,且專案已設為您的專案 ID。

  1. 在 Cloud Shell 中執行下列指令,確認您已通過驗證:
gcloud auth list

指令輸出

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`
  1. 在 Cloud Shell 中執行下列指令,確認 gcloud 指令知道您的專案:
gcloud config list project

指令輸出

[core]
project = <PROJECT_ID>

如果尚未設定,請使用下列指令進行設定:

gcloud config set project <PROJECT_ID>

指令輸出

Updated property [core/project].

3. 環境設定

開始使用 Document AI 之前,請在 Cloud Shell 中執行下列指令,啟用 Document AI API:

gcloud services enable documentai.googleapis.com

畫面應如下所示:

Operation "operations/..." finished successfully.

您現在可以使用 Document AI 了!

前往主目錄:

cd ~

建立 Python 虛擬環境來區隔依附元件:

virtualenv venv-docai

啟用虛擬環境:

source venv-docai/bin/activate

安裝 IPython、Document AI 用戶端程式庫和 python-tabulate (這可協助您美化要求結果):

pip install ipython google-cloud-documentai tabulate

畫面應如下所示:

...
Installing collected packages: ..., tabulate, ipython, google-cloud-documentai
Successfully installed ... google-cloud-documentai-2.15.0 ...

您現在可以使用 Document AI 用戶端程式庫了!

設定下列環境變數:

export PROJECT_ID=$(gcloud config get-value core/project)
# Choose "us" or "eu"
export API_LOCATION="us"

從現在起,所有步驟應在同一個工作階段中完成。

確認環境變數定義正確:

echo $PROJECT_ID
echo $API_LOCATION

在後續步驟中,您將使用剛才安裝的互動式 Python 解譯器 IPython。在 Cloud Shell 中執行 ipython 即可啟動工作階段:

ipython

畫面應如下所示:

Python 3.9.2 (default, Feb 28 2021, 17:03:44)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.14.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

將下列程式碼複製到您的 IPython 工作階段:

import os
from typing import Iterator, MutableSequence, Optional, Sequence, Tuple

import google.cloud.documentai_v1 as docai
from tabulate import tabulate

PROJECT_ID = os.getenv("PROJECT_ID", "")
API_LOCATION = os.getenv("API_LOCATION", "")

assert PROJECT_ID, "PROJECT_ID is undefined"
assert API_LOCATION in ("us", "eu"), "API_LOCATION is incorrect"

# Test processors
document_ocr_display_name = "document-ocr"
form_parser_display_name = "form-parser"

test_processor_display_names_and_types = (
    (document_ocr_display_name, "OCR_PROCESSOR"),
    (form_parser_display_name, "FORM_PARSER_PROCESSOR"),
)

def get_client() -> docai.DocumentProcessorServiceClient:
    client_options = {"api_endpoint": f"{API_LOCATION}-documentai.googleapis.com"}
    return docai.DocumentProcessorServiceClient(client_options=client_options)

def get_parent(client: docai.DocumentProcessorServiceClient) -> str:
    return client.common_location_path(PROJECT_ID, API_LOCATION)

def get_client_and_parent() -> Tuple[docai.DocumentProcessorServiceClient, str]:
    client = get_client()
    parent = get_parent(client)
    return client, parent
    

您現在可以發出第一個要求並擷取處理器類型。

4. 正在擷取處理器類型

在下一個步驟建立處理器前,請先擷取可用的處理器類型。你可以使用 fetch_processor_types 擷取這份清單。

將下列函式新增至 IPython 工作階段:

def fetch_processor_types() -> MutableSequence[docai.ProcessorType]:
    client, parent = get_client_and_parent()
    response = client.fetch_processor_types(parent=parent)

    return response.processor_types

def print_processor_types(processor_types: Sequence[docai.ProcessorType]):
    def sort_key(pt):
        return (not pt.allow_creation, pt.category, pt.type_)

    sorted_processor_types = sorted(processor_types, key=sort_key)
    data = processor_type_tabular_data(sorted_processor_types)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Processor types: {len(sorted_processor_types)}")

def processor_type_tabular_data(
    processor_types: Sequence[docai.ProcessorType],
) -> Iterator[Tuple[str, str, str, str]]:
    def locations(pt):
        return ", ".join(sorted(loc.location_id for loc in pt.available_locations))

    yield ("type", "category", "allow_creation", "locations")
    yield ("left", "left", "left", "left")
    if not processor_types:
        yield ("-", "-", "-", "-")
        return
    for pt in processor_types:
        yield (pt.type_, pt.category, f"{pt.allow_creation}", locations(pt))
        

列出處理器類型:

processor_types = fetch_processor_types()
print_processor_types(processor_types)

畫面應如下所示:

+---------------------------------+-------------+----------------+------------+
| type                            | category    | allow_creation | locations  |
+---------------------------------+-------------+----------------+------------+
| CUSTOM_CLASSIFICATION_PROCESSOR | CUSTOM      | True           | eu, us...  |
| CUSTOM_EXTRACTION_PROCESSOR     | CUSTOM      | True           | eu, us...  |
| FORM_PARSER_PROCESSOR           | GENERAL     | True           | eu, us...  |
| OCR_PROCESSOR                   | GENERAL     | True           | eu, us...  |
| EXPENSE_PROCESSOR               | SPECIALIZED | True           | eu, us...  |
...
+---------------------------------+-------------+----------------+------------+
→ Processor types: 40

現在您已擁有建立處理器所需的所有資訊。

5. 建立處理器

如要建立處理器,請以顯示名稱和處理器類型呼叫 create_processor

新增下列函式:

def create_processor(display_name: str, type: str) -> docai.Processor:
    client, parent = get_client_and_parent()
    processor = docai.Processor(display_name=display_name, type_=type)

    return client.create_processor(parent=parent, processor=processor)
    

建立測試處理器:

separator = "=" * 80
for display_name, type in test_processor_display_names_and_types:
    print(separator)
    print(f"Creating {display_name} ({type})...")
    try:
        create_processor(display_name, type)
    except Exception as err:
        print(err)
print(separator)
print("Done")

畫面應會顯示以下內容:

================================================================================
Creating document-ocr (OCR_PROCESSOR)...
================================================================================
Creating form-parser (FORM_PARSER_PROCESSOR)...
================================================================================
Done

您已建立新的處理器!

接著,瞭解如何列出處理器。

6. 列出專案處理器

list_processors 會傳回專案中所有處理器的清單。

新增下列函式:

def list_processors() -> MutableSequence[docai.Processor]:
    client, parent = get_client_and_parent()
    response = client.list_processors(parent=parent)

    return list(response.processors)

def print_processors(processors: Optional[Sequence[docai.Processor]] = None):
    def sort_key(processor):
        return processor.display_name

    if processors is None:
        processors = list_processors()
    sorted_processors = sorted(processors, key=sort_key)
    data = processor_tabular_data(sorted_processors)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Processors: {len(sorted_processors)}")

def processor_tabular_data(
    processors: Sequence[docai.Processor],
) -> Iterator[Tuple[str, str, str]]:
    yield ("display_name", "type", "state")
    yield ("left", "left", "left")
    if not processors:
        yield ("-", "-", "-")
        return
    for processor in processors:
        yield (processor.display_name, processor.type_, processor.state.name)
        

呼叫函式:

processors = list_processors()
print_processors(processors)

畫面應會顯示以下內容:

+--------------+-----------------------+---------+
| display_name | type                  | state   |
+--------------+-----------------------+---------+
| document-ocr | OCR_PROCESSOR         | ENABLED |
| form-parser  | FORM_PARSER_PROCESSOR | ENABLED |
+--------------+-----------------------+---------+
→ Processors: 2

如要依顯示名稱擷取處理器,請新增下列函式:

def get_processor(
    display_name: str,
    processors: Optional[Sequence[docai.Processor]] = None,
) -> Optional[docai.Processor]:
    if processors is None:
        processors = list_processors()
    for processor in processors:
        if processor.display_name == display_name:
            return processor
    return None
    

測試函式:

processor = get_processor(document_ocr_display_name, processors)

assert processor is not None
print(processor)

畫面應如下所示:

name: "projects/PROJECT_NUM/locations/LOCATION/processors/PROCESSOR_ID"
type_: "OCR_PROCESSOR"
display_name: "document-ocr"
state: ENABLED
...

現在,您已瞭解如何列出專案處理器,並透過顯示名稱擷取這些處理器。接著來看看如何使用處理器。

7. 使用處理器

系統可以透過下列兩種方式處理文件:

  • 同步:呼叫 process_document 可分析單一文件並直接使用結果。
  • 非同步:呼叫 batch_process_documents 即可在多份或大型文件上啟動批次處理。

您的測試文件 ( PDF) 是一份經過掃描的問卷,其中包含手寫答案。直接從 IPython 工作階段將其下載至工作目錄:

!gsutil cp gs://cloud-samples-data/documentai/form.pdf .

檢查工作目錄的內容:

!ls

畫面應具備下列項目:

...  form.pdf  ...  venv-docai  ...

您可以使用同步 process_document 方法分析本機檔案。新增下列函式:

def process_file(
    processor: docai.Processor,
    file_path: str,
    mime_type: str,
) -> docai.Document:
    client = get_client()
    with open(file_path, "rb") as document_file:
        document_content = document_file.read()
    document = docai.RawDocument(content=document_content, mime_type=mime_type)
    request = docai.ProcessRequest(raw_document=document, name=processor.name)

    response = client.process_document(request)

    return response.document
    

你的文件是問卷,因此請選擇表單剖析器。除了擷取紙本和手寫文字,所有處理器都能擷取文字,這個一般處理器還會偵測表單欄位。

分析文件:

processor = get_processor(form_parser_display_name)
assert processor is not None

file_path = "./form.pdf"
mime_type = "application/pdf"

document = process_file(processor, file_path, mime_type)

所有處理器都會先對文件執行光學字元辨識 (OCR)。查看 OCR 票證偵測到的文字:

document.text.split("\n")

畫面應如下所示:

['FakeDoc M.D.',
 'HEALTH INTAKE FORM',
 'Please fill out the questionnaire carefully. The information you provide will be used to complete',
 'your health profile and will be kept confidential.',
 'Date:',
 'Sally',
 'Walker',
 'Name:',
 '9/14/19',
 'DOB: 09/04/1986',
 'Address: 24 Barney Lane City: Towaco State: NJ Zip: 07082',
 'Email: Sally, waller@cmail.com Phone #: (906) 917-3486',
 'Gender: F',
 'Single Occupation: Software Engineer',
 'Referred By: None',
 'Emergency Contact: Eva Walker Emergency Contact Phone: (906) 334-8976',
 'Marital Status:',
  ...
]

新增下列函式,輸出偵測到的表單欄位:

def print_form_fields(document: docai.Document):
    sorted_form_fields = form_fields_sorted_by_ocr_order(document)
    data = form_field_tabular_data(sorted_form_fields, document)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Form fields: {len(sorted_form_fields)}")

def form_field_tabular_data(
    form_fields: Sequence[docai.Document.Page.FormField],
    document: docai.Document,
) -> Iterator[Tuple[str, str, str]]:
    yield ("name", "value", "confidence")
    yield ("right", "left", "right")
    if not form_fields:
        yield ("-", "-", "-")
        return
    for form_field in form_fields:
        name_layout = form_field.field_name
        value_layout = form_field.field_value
        name = text_from_layout(name_layout, document)
        value = text_from_layout(value_layout, document)
        confidence = value_layout.confidence
        yield (name, value, f"{confidence:.1%}")
        

同時新增以下公用程式函式:

def form_fields_sorted_by_ocr_order(
    document: docai.Document,
) -> MutableSequence[docai.Document.Page.FormField]:
    def sort_key(form_field):
        # Sort according to the field name detected position
        text_anchor = form_field.field_name.text_anchor
        return text_anchor.text_segments[0].start_index if text_anchor else 0

    fields = (field for page in document.pages for field in page.form_fields)

    return sorted(fields, key=sort_key)


def text_from_layout(
    layout: docai.Document.Page.Layout,
    document: docai.Document,
) -> str:
    full_text = document.text
    segs = layout.text_anchor.text_segments
    text = "".join(full_text[seg.start_index : seg.end_index] for seg in segs)
    if text.endswith("\n"):
        text = text[:-1]

    return text
    

列印偵測到的表單欄位:

print_form_fields(document)

您應該會看到如下的輸出內容:

+--------------+-------------------------+------------+
|         name | value                   | confidence |
+--------------+-------------------------+------------+
|        Date: | 9/14/19                 |     100.0% |
|        Name: | Sally                   |      99.7% |
|              | Walker                  |            |
|         DOB: | 09/04/1986              |     100.0% |
|     Address: | 24 Barney Lane          |      99.9% |
|        City: | Towaco                  |      99.8% |
|       State: | NJ                      |      99.7% |
|         Zip: | 07082                   |      99.5% |
|       Email: | Sally, waller@cmail.com |      99.6% |
|     Phone #: | (906) 917-3486          |     100.0% |
|      Gender: | F                       |     100.0% |
|  Occupation: | Software Engineer       |     100.0% |
| Referred By: | None                    |     100.0% |
...
+--------------+-------------------------+------------+
→ Form fields: 17

檢查偵測到的欄位名稱和值 ( PDF)。以下是問卷的上半部:

ea7370f0bb0cc494.png

您已經分析了一份包含列印和手寫文字的表單,以及可信度高的欄位。如此一來,像素就已轉換成結構化資料!

8. 啟用及停用處理器

您可以利用 disable_processorenable_processor 控制是否可使用處理器。

新增下列函式:

def update_processor_state(processor: docai.Processor, enable_processor: bool):
    client = get_client()
    if enable_processor:
        request = docai.EnableProcessorRequest(name=processor.name)
        operation = client.enable_processor(request)
    else:
        request = docai.DisableProcessorRequest(name=processor.name)
        operation = client.disable_processor(request)
    operation.result()  # Wait for operation to complete

def enable_processor(processor: docai.Processor):
    update_processor_state(processor, True)

def disable_processor(processor: docai.Processor):
    update_processor_state(processor, False)
    

停用表單剖析器處理器,並檢查處理器狀態:

processor = get_processor(form_parser_display_name)
assert processor is not None

disable_processor(processor)
print_processors()

畫面應會顯示以下內容:

+--------------+-----------------------+----------+
| display_name | type                  | state    |
+--------------+-----------------------+----------+
| document-ocr | OCR_PROCESSOR         | ENABLED  |
| form-parser  | FORM_PARSER_PROCESSOR | DISABLED |
+--------------+-----------------------+----------+
→ Processors: 2

重新啟用表單剖析器處理器:

enable_processor(processor)
print_processors()

畫面應會顯示以下內容:

+--------------+-----------------------+---------+
| display_name | type                  | state   |
+--------------+-----------------------+---------+
| document-ocr | OCR_PROCESSOR         | ENABLED |
| form-parser  | FORM_PARSER_PROCESSOR | ENABLED |
+--------------+-----------------------+---------+
→ Processors: 2

接著來看看如何管理處理器版本。

9. 管理處理器版本

處理器支援多個版本。瞭解如何使用 list_processor_versionsset_default_processor_version 方法。

新增下列函式:

def list_processor_versions(
    processor: docai.Processor,
) -> MutableSequence[docai.ProcessorVersion]:
    client = get_client()
    response = client.list_processor_versions(parent=processor.name)

    return list(response)


def get_sorted_processor_versions(
    processor: docai.Processor,
) -> MutableSequence[docai.ProcessorVersion]:
    def sort_key(processor_version: docai.ProcessorVersion):
        return processor_version.name

    versions = list_processor_versions(processor)

    return sorted(versions, key=sort_key)


def print_processor_versions(processor: docai.Processor):
    versions = get_sorted_processor_versions(processor)
    default_version_name = processor.default_processor_version
    data = processor_versions_tabular_data(versions, default_version_name)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Processor versions: {len(versions)}")


def processor_versions_tabular_data(
    versions: Sequence[docai.ProcessorVersion],
    default_version_name: str,
) -> Iterator[Tuple[str, str, str]]:
    yield ("version", "display name", "default")
    yield ("left", "left", "left")
    if not versions:
        yield ("-", "-", "-")
        return
    for version in versions:
        mapping = docai.DocumentProcessorServiceClient.parse_processor_version_path(
            version.name
        )
        processor_version = mapping["processor_version"]
        is_default = "Y" if version.name == default_version_name else ""
        yield (processor_version, version.display_name, is_default)
        

列出 OCR 處理器的可用版本:

processor = get_processor(document_ocr_display_name)
assert processor is not None
print_processor_versions(processor)

就能取得處理器版本:

+--------------------------------+--------------------------+---------+
| version                        | display name             | default |
+--------------------------------+--------------------------+---------+
| pretrained-ocr-v1.0-2020-09-23 | Google Stable            | Y       |
| pretrained-ocr-v1.1-2022-09-12 | Google Release Candidate |         |
| pretrained-ocr-v1.2-2022-11-10 | Google Release Candidate |         |
+--------------------------------+--------------------------+---------+
→ Processor versions: 3

現在,請新增函式,變更預設處理器版本:

def set_default_processor_version(processor: docai.Processor, version_name: str):
    client = get_client()
    request = docai.SetDefaultProcessorVersionRequest(
        processor=processor.name,
        default_processor_version=version_name,
    )

    operation = client.set_default_processor_version(request)
    operation.result()  # Wait for operation to complete
    

改用最新的處理器版本:

processor = get_processor(document_ocr_display_name)
assert processor is not None
versions = get_sorted_processor_versions(processor)

new_version = versions[-1]  # Latest version
set_default_processor_version(processor, new_version.name)

# Update the processor info
processor = get_processor(document_ocr_display_name)
assert processor is not None
print_processor_versions(processor)

取得新版本的設定:

+--------------------------------+--------------------------+---------+
| version                        | display name             | default |
+--------------------------------+--------------------------+---------+
| pretrained-ocr-v1.0-2020-09-23 | Google Stable            |         |
| pretrained-ocr-v1.1-2022-09-12 | Google Release Candidate |         |
| pretrained-ocr-v1.2-2022-11-10 | Google Release Candidate | Y       |
+--------------------------------+--------------------------+---------+
→ Processor versions: 3

接著是終極處理器管理方法 (刪除)。

10. 正在刪除處理器

最後,瞭解如何使用 delete_processor 方法。

新增下列函式:

def delete_processor(processor: docai.Processor):
    client = get_client()
    operation = client.delete_processor(name=processor.name)
    operation.result()  # Wait for operation to complete
    

刪除測試處理器:

processors_to_delete = [dn for dn, _ in test_processor_display_names_and_types]
print("Deleting processors...")

for processor in list_processors():
    if processor.display_name not in processors_to_delete:
        continue
    print(f"  Deleting {processor.display_name}...")
    delete_processor(processor)

print("Done\n")
print_processors()

畫面應會顯示以下內容:

Deleting processors...
  Deleting form-parser...
  Deleting document-ocr...
Done

+--------------+------+-------+
| display_name | type | state |
+--------------+------+-------+
| -            | -    | -     |
+--------------+------+-------+
→ Processors: 0

您已介紹完所有處理器管理方法!就快完成了...

11. 恭喜!

c6d2ea69b1ba0eff.png

您已瞭解如何使用 Python 管理 Document AI 處理器!

清除所用資源

如要清除開發環境,請透過 Cloud Shell 執行下列操作:

  • 如果您目前仍在 IPython 工作階段,請返回殼層:exit
  • 停止使用 Python 虛擬環境:deactivate
  • 刪除虛擬環境資料夾:cd ~ ; rm -rf ./venv-docai

如要刪除 Google Cloud 專案,請透過 Cloud Shell 進行:

  • 擷取目前的專案 ID:PROJECT_ID=$(gcloud config get-value core/project)
  • 請確認這是要刪除的專案:echo $PROJECT_ID
  • 刪除專案:gcloud projects delete $PROJECT_ID

瞭解詳情

授權

這項內容採用的是創用 CC 姓名標示 2.0 通用授權。