具備 Document AI 的專業處理器 (Python)

1. 簡介

在本程式碼研究室中，您將瞭解如何透過 Document AI 專用處理器來分類及剖析 Python 中的特殊文件。進行分類和拆分時，我們將使用內含發票、收據和公用程式說明的 PDF 檔案範例。之後，我們會在剖析和實體擷取作業中使用發票做為範例。

這個程序和程式碼範例適用於 Document AI 支援的任何專屬文件。

必要條件

本程式碼研究室是以其他 Document AI 程式碼研究室呈現的內容為基礎。

建議您先完成下列程式碼研究室，再繼續操作：

課程內容

如何分類及識別特殊文件的分割點。
如何使用專用處理器擷取結構定義化實體。

軟硬體需求

Google Cloud 專案
瀏覽器，例如 Chrome 或 Firefox
對 Python 3 的瞭解

2. 開始設定

本程式碼研究室假設您已完成「入門程式碼研究室」中列出的 Document AI 設定步驟。

請先完成下列步驟再繼續：

您還需要安裝 Pandas，這是常用的 Python 資料分析程式庫。

pip3 install --upgrade pandas

3. 建立專門的處理器

您必須先建立要用於本教學課程的處理器執行個體。

在控制台中，前往 Document AI Platform Overview (Document AI 平台總覽)
按一下「Create Processor」，向下捲動至「Specialized」並選取「Procurement Doc Splitter」。
將名稱命名為「codelab-procurement-splitter」(或其他您會記得的)，然後從清單中選取最接近的區域。
點選「建立」來建立處理器
複製處理器 ID。您稍後必須在程式碼中使用此 ID。
使用「Invoice Parser」(應付憑據剖析器) 重複執行步驟 2 至 6 (您可命名為「codelab-invoice-parser」)

在控制台中測試處理器

您可以藉由上傳文件，在主控台中測試月結單剖析器。

按一下「上傳文件」，然後選取要剖析的應付憑據。如果沒有可用的月結單範例，可以下載並使用這個範本。

輸出內容應如下所示：

4. 下載範例文件

在此研究室中，有幾個範例文件可用。

你可以透過下列連結下載 PDF。然後上傳至 Cloud Shell 執行個體。

或者，您也可以使用 gsutil，從公開的 Cloud Storage 值區下載這些資料。

gsutil cp gs://cloud-samples-data/documentai/codelabs/specialized-processors/procurement_multi_document.pdf .

gsutil cp gs://cloud-samples-data/documentai/codelabs/specialized-processors/google_invoice.pdf .

5. 分類與分割文件

在這個步驟中，您將使用線上處理 API，為多頁文件分類及偵測邏輯分割點。

如要傳送多個檔案，或是檔案大小超過線上處理數量上限頁面，您也可以使用批次處理 API。方法請參閱 Document AI OCR 程式碼研究室。

對於一般處理器 (而非處理器 ID) 而言，提出 API 要求的程式碼都是相同的。

採購分割器/分類器

建立名為 classification.py 的檔案，並使用下方的程式碼。

將 PROCUREMENT_SPLITTER_ID 替換為您稍早建立的採購分割處理器 ID。將 YOUR_PROJECT_ID 和 YOUR_PROJECT_LOCATION 分別替換為您的 Cloud 專案 ID 和處理器位置。

classification.py

import pandas as pd
from google.cloud import documentai_v1 as documentai


def online_process(
    project_id: str,
    location: str,
    processor_id: str,
    file_path: str,
    mime_type: str,
) -> documentai.Document:
    """
    Processes a document using the Document AI Online Processing API.
    """

    opts = {"api_endpoint": f"{location}-documentai.googleapis.com"}

    # Instantiates a client
    documentai_client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    resource_name = documentai_client.processor_path(project_id, location, processor_id)

    # Read the file into memory
    with open(file_path, "rb") as file:
        file_content = file.read()

    # Load Binary Data into Document AI RawDocument Object
    raw_document = documentai.RawDocument(content=file_content, mime_type=mime_type)

    # Configure the process request
    request = documentai.ProcessRequest(name=resource_name, raw_document=raw_document)

    # Use the Document AI client to process the sample form
    result = documentai_client.process_document(request=request)

    return result.document


PROJECT_ID = "YOUR_PROJECT_ID"
LOCATION = "YOUR_PROJECT_LOCATION"  # Format is 'us' or 'eu'
PROCESSOR_ID = "PROCUREMENT_SPLITTER_ID"  # Create processor in Cloud Console

# The local file in your current working directory
FILE_PATH = "procurement_multi_document.pdf"
# Refer to https://cloud.google.com/document-ai/docs/processors-list
# for supported file types
MIME_TYPE = "application/pdf"

document = online_process(
    project_id=PROJECT_ID,
    location=LOCATION,
    processor_id=PROCESSOR_ID,
    file_path=FILE_PATH,
    mime_type=MIME_TYPE,
)

print("Document processing complete.")

types = []
confidence = []
pages = []

# Each Document.entity is a classification
for entity in document.entities:
    classification = entity.type_
    types.append(classification)
    confidence.append(f"{entity.confidence:.0%}")

    # entity.page_ref contains the pages that match the classification
    pages_list = []
    for page_ref in entity.page_anchor.page_refs:
        pages_list.append(page_ref.page)
    pages.append(pages_list)

# Create a Pandas Dataframe to print the values in tabular format.
df = pd.DataFrame({"Classification": types, "Confidence": confidence, "Pages": pages})

print(df)

輸出內容應如下所示：

$ python3 classification.py
Document processing complete.
         Classification Confidence Pages
0     invoice_statement       100%   [0]
1     receipt_statement        98%   [1]
2                 other        81%   [2]
3     utility_statement       100%   [3]
4  restaurant_statement       100%   [4]

請注意，採購分割器/分類器可在第 0-1 頁和第 3 至 4 頁正確識別文件類型。

第 2 頁包含一般醫療登記表單，因此分類器正確識別為「other」。

6. 擷取實體

您現在可以從檔案中擷取已結構定義化的實體，包括可信度分數、屬性和正規化值。

用來發出 API 要求的程式碼與上一步相同，可透過線上或批次要求來完成。

我們會存取實體的下列資訊：

實體類型
- (例如：invoice_date、receiver_name、total_amount)
原始值
- 原始文件檔案呈現的資料值。
正規化值
- 正規化和標準格式的資料值 (如適用)。
- 也可以使用 Enterprise Knowledge Graph 的充實內容
信心值
- 「保證」模型就是值準確

部分實體類型 (例如 line_item) 也可以包含 line_item/unit_price 和 line_item/description 等巢狀實體的「屬性」。

此範例會壓平巢狀結構結構以方便查看。

應付憑據剖析器

建立名為 extraction.py 的檔案，並使用下方的程式碼。

將 INVOICE_PARSER_ID 替換為先前建立的應付憑據剖析器處理者 ID，並使用 google_invoice.pdf 檔案

extraction.py

import pandas as pd
from google.cloud import documentai_v1 as documentai


def online_process(
    project_id: str,
    location: str,
    processor_id: str,
    file_path: str,
    mime_type: str,
) -> documentai.Document:
    """
    Processes a document using the Document AI Online Processing API.
    """

    opts = {"api_endpoint": f"{location}-documentai.googleapis.com"}

    # Instantiates a client
    documentai_client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    resource_name = documentai_client.processor_path(project_id, location, processor_id)

    # Read the file into memory
    with open(file_path, "rb") as file:
        file_content = file.read()

    # Load Binary Data into Document AI RawDocument Object
    raw_document = documentai.RawDocument(content=file_content, mime_type=mime_type)

    # Configure the process request
    request = documentai.ProcessRequest(name=resource_name, raw_document=raw_document)

    # Use the Document AI client to process the sample form
    result = documentai_client.process_document(request=request)

    return result.document


PROJECT_ID = "YOUR_PROJECT_ID"
LOCATION = "YOUR_PROJECT_LOCATION"  # Format is 'us' or 'eu'
PROCESSOR_ID = "INVOICE_PARSER_ID"  # Create processor in Cloud Console

# The local file in your current working directory
FILE_PATH = "google_invoice.pdf"
# Refer to https://cloud.google.com/document-ai/docs/processors-list
# for supported file types
MIME_TYPE = "application/pdf"

document = online_process(
    project_id=PROJECT_ID,
    location=LOCATION,
    processor_id=PROCESSOR_ID,
    file_path=FILE_PATH,
    mime_type=MIME_TYPE,
)

types = []
raw_values = []
normalized_values = []
confidence = []

# Grab each key/value pair and their corresponding confidence scores.
for entity in document.entities:
    types.append(entity.type_)
    raw_values.append(entity.mention_text)
    normalized_values.append(entity.normalized_value.text)
    confidence.append(f"{entity.confidence:.0%}")

    # Get Properties (Sub-Entities) with confidence scores
    for prop in entity.properties:
        types.append(prop.type_)
        raw_values.append(prop.mention_text)
        normalized_values.append(prop.normalized_value.text)
        confidence.append(f"{prop.confidence:.0%}")

# Create a Pandas Dataframe to print the values in tabular format.
df = pd.DataFrame(
    {
        "Type": types,
        "Raw Value": raw_values,
        "Normalized Value": normalized_values,
        "Confidence": confidence,
    }
)

print(df)

輸出內容應如下所示：

$ python3 extraction.py
                     Type                                         Raw Value Normalized Value Confidence
0                     vat                                         $1,767.97                        100%
1          vat/tax_amount                                         $1,767.97      1767.97 USD         0%
2            invoice_date                                      Sep 24, 2019       2019-09-24        99%
3                due_date                                      Sep 30, 2019       2019-09-30        99%
4            total_amount                                         19,647.68         19647.68        97%
5        total_tax_amount                                         $1,767.97      1767.97 USD        92%
6              net_amount                                         22,379.39         22379.39        91%
7           receiver_name                                       Jane Smith,                         83%
8              invoice_id                                         23413561D                         67%
9        receiver_address  1600 Amphitheatre Pkway\nMountain View, CA 94043                         66%
10         freight_amount                                           $199.99       199.99 USD        56%
11               currency                                                 $              USD        53%
12          supplier_name                                        John Smith                         19%
13         purchase_order                                         23413561D                          1%
14        receiver_tax_id                                         23413561D                          0%
15          supplier_iban                                         23413561D                          0%
16              line_item                   9.99 12 12 ft HDMI cable 119.88                        100%
17   line_item/unit_price                                              9.99             9.99        90%
18     line_item/quantity                                                12               12        77%
19  line_item/description                                  12 ft HDMI cable                         39%
20       line_item/amount                                            119.88           119.88        92%
21              line_item           12 399.99 27" Computer Monitor 4,799.88                        100%
22     line_item/quantity                                                12               12        80%
23   line_item/unit_price                                            399.99           399.99        91%
24  line_item/description                              27" Computer Monitor                         15%
25       line_item/amount                                          4,799.88          4799.88        94%
26              line_item                Ergonomic Keyboard 12 59.99 719.88                        100%
27  line_item/description                                Ergonomic Keyboard                         32%
28     line_item/quantity                                                12               12        76%
29   line_item/unit_price                                             59.99            59.99        92%
30       line_item/amount                                            719.88           719.88        94%
31              line_item                     Optical mouse 12 19.99 239.88                        100%
32  line_item/description                                     Optical mouse                         26%
33     line_item/quantity                                                12               12        78%
34   line_item/unit_price                                             19.99            19.99        91%
35       line_item/amount                                            239.88           239.88        94%
36              line_item                      Laptop 12 1,299.99 15,599.88                        100%
37  line_item/description                                            Laptop                         83%
38     line_item/quantity                                                12               12        76%
39   line_item/unit_price                                          1,299.99          1299.99        90%
40       line_item/amount                                         15,599.88         15599.88        94%
41              line_item              Misc processing fees 899.99 899.99 1                        100%
42  line_item/description                              Misc processing fees                         22%
43   line_item/unit_price                                            899.99           899.99        91%
44       line_item/amount                                            899.99           899.99        94%
45     line_item/quantity                                                 1                1        63%

7. 選用：試用其他特殊處理器

您已成功使用適用於 Procurement 的 Document AI 分類文件及剖析月結單。Document AI 也支援下列其他特殊解決方案：

您可以按照相同的程序，使用相同程式碼來處理任何特殊處理器。

如果想試用其他專業解決方案，可以使用其他處理器類型和專屬範例文件重新執行研究室。

範例文件

以下是一些範例文件，您可以試用其他專門的處理器。

解決方案	處理器類型	文件
身分識別	美國駕照剖析器
汽車融資租賃	貸款分割器和分類器
汽車融資租賃	W9 剖析器
合約	合約剖析器

如要查看其他範例文件和處理器輸出內容，請參閱說明文件。

8. 恭喜

恭喜！您已成功使用 Document AI 將專屬文件中的資料分類及擷取資料。我們建議你嘗試其他專門的文件類型。

清除所用資源

如要避免系統向您的 Google Cloud 帳戶收取您在本教學課程中所用資源的相關費用：

在 Cloud 控制台中，前往「管理資源」頁面。
在專案清單中，選取您的專案，然後按一下「刪除」。
在對話方塊中輸入專案 ID，然後按一下「關閉」，即可刪除專案。

具備 Document AI 的專業處理器 (Python)

1. 簡介

必要條件

課程內容

軟硬體需求

2. 開始設定

3. 建立專門的處理器

在控制台中測試處理器

4. 下載範例文件

5. 分類與分割文件

採購分割器/分類器

classification.py

6. 擷取實體

應付憑據剖析器

extraction.py

7. 選用：試用其他特殊處理器

範例文件

8. 恭喜

清除所用資源

瞭解詳情

資源

授權