Document AI（Python）を使用した光学式文字認識（OCR）

1. 概要

Document AI とは

Document AI は、ドキュメント、メール、請求書、フォームなどの非構造化データを理解、分析、利用しやすくするドキュメント理解ソリューションです。この API は、コンテンツ分類、エンティティ抽出、高度な検索機能などを利用して、データ構造を提供します。

このラボでは、Python で Document AI API を使用して光学式文字認識を実行する方法を学びます。

ここでは、最近米国でパブリックドメインになった A.A. ミルンの児童小説『クマのプーさん』の PDF ファイルを使用します。このファイルは Google ブックスによってスキャンされ、デジタル化されています。

学習内容

Document AI API を有効にする方法
API リクエストを認証する方法
Python 用クライアントライブラリをインストールする方法
オンラインとバッチ処理用の API を使用する方法
PDF ファイルのテキストを解析する方法

必要なもの

Google Cloud プロジェクト
ブラウザ（Chrome、Firefox など）
Python（3.9 以降）に関する知識

アンケート

このチュートリアルをどのように使用されますか？

通読のみ

通読して演習を行う

Python のご利用経験はどの程度ありますか？

初心者

中級者

上級者

Google Cloud サービスの使用経験はどの程度ありますか？

初心者

中級者

上級者

2. 設定と要件

セルフペース型の環境設定

Cloud コンソールにログインして、新しいプロジェクトを作成するか、既存のプロジェクトを再利用します（Gmail アカウントも Google Workspace アカウントもまだお持ちでない場合は、アカウントを作成してください）。

プロジェクトを選択

新しいプロジェクト

Project ID を取得

プロジェクト ID を忘れないようにしてください。プロジェクト ID はすべての Google Cloud プロジェクトを通じて一意の名前にする必要があります（上記のプロジェクト ID は取得済みのため使用できません）。以降では、PROJECT_ID の箇所にこの ID を使用してください。

次に、Google Cloud リソースを使用するために、Cloud コンソールで課金を有効にする必要があります。

「クリーンアップ」セクションにある指示に従ってください。ここには、このチュートリアルの終了後に課金が発生しないようにリソースをシャットダウンする方法が記載されています。Google Cloud の新規ユーザーは、300 米ドル分の無料トライアルプログラムをご利用いただけます。

Cloud Shell の起動

Google Cloud はノートパソコンでリモートから操作できますが、この Codelab では Google Cloud Shell（Cloud 上で動作するコマンドライン環境）を使用します。

Cloud Shell をアクティブにする

Cloud コンソールで、「Cloud Shell をアクティブにする」をクリックします。

Cloud Shell をアクティブにする

Cloud Shell を初めて起動した場合は、その内容を説明する画面が（スクロールしなければ見えない位置に）表示されます。その場合は、[続行] をクリックしてください（以後表示されなくなります）。この中間画面は次のようになります。

Cloud Shell の紹介

すぐにプロビジョニングが実行され、Cloud Shell に接続にします。

Cloud Shell では、ターミナルからクラウド上の仮想マシンにアクセスできます。仮想マシンには、開発に必要なツールがすべて含まれています。仮想マシンは Google Cloud で稼働し、永続的なホームディレクトリが 5 GB 用意されているため、ネットワークのパフォーマンスと認証が大幅に向上しています。このラボで行う作業のほとんどはブラウザから実行できます。

Cloud Shell に接続すると、すでに認証は完了しており、プロジェクトに各自のプロジェクト ID が設定されています。

Cloud Shell で次のコマンドを実行して、認証されていることを確認してみましょう。

gcloud auth list

コマンド出力

 Credentialed Accounts
ACTIVE  ACCOUNT
*      <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`

gcloud config list project

コマンド出力

[core]
project = <PROJECT_ID>

上記のようになっていない場合は、次のコマンドで設定できます。

gcloud config set project <PROJECT_ID>

コマンド出力

Updated property [core/project].

3. Document AI API を有効にする

Document AI を使用する前に、API を有効にする必要があります。これを行うには、gcloud コマンドラインインターフェースまたは Cloud コンソールを使用します。

`gcloud` CLI を使用する

Cloud Shell を使用していない場合は、ローカルマシンに gcloud CLI をインストールするの手順に沿って操作します。
次の gcloud コマンドを使用して API を有効にできます。

gcloud services enable documentai.googleapis.com storage.googleapis.com

次のように表示されます。

Operation "operations/..." finished successfully.

Cloud Console を使用する

ブラウザで Cloud コンソールを開きます。

コンソールの上部にある検索バーで「Document AI API」を検索します。[有効にする] をクリックして、Google Cloud プロジェクトで API を使用します。

API の検索

Google Cloud Storage API にも同じ手順を繰り返します。

これで Document AI を使用できるようになりました。

4. プロセッサを作成してテストする

まず、抽出を実行する Document OCR プロセッサのインスタンスを作成します。この操作には Cloud コンソールまたは Processor Management API を使用します。

Cloud コンソール

コンソールで Document AI Platform の [概要] に移動します。
[プロセッサを確認] をクリックして、[Document OCR] を選択します。
codelab-ocr（または覚えている名前）を付けて、最も近いリージョンをリストから選択します。
[作成] をクリックして、プロセッサを作成します。
プロセッサ ID をコピーします。これは、後でコードを作成する際に使用します。

ドキュメントをアップロードして、プロセッサをテストしてみましょう。[Upload Test Document] をクリックして、解析するドキュメントを選択します。

次の PDF ファイルをダウンロードしましょう。このファイルには、小説の最初の 3 ページが含まれています。

タイトルページ

出力は次のようになります。解析されたドキュメント

Python Client ライブラリ

次の Codelab を参照して、Python クライアントライブラリで Document AI プロセッサを管理する方法をご確認ください。

Python による Document AI プロセッサの管理 - Codelab

5. API リクエストを認証する

Document AI API にリクエストを送信するには、サービスアカウントを使用する必要があります。プロジェクトのサービスアカウントを Python クライアントライブラリで使用して、API リクエストを送信します。ほかのユーザーアカウントと同じように、サービスアカウントはメールアドレスで表されます。このセクションでは、Cloud SDK を使用してサービスアカウントを作成し、サービスアカウントの認証で必要になる認証情報を作成します。

まず、Cloud Shell を開き、この Codelab 全体を通して使用する PROJECT_ID を使って環境変数を設定します。

export GOOGLE_CLOUD_PROJECT=$(gcloud config get-value core/project)

次に、Document AI API にアクセスする新しいサービスアカウントを作成します。

gcloud iam service-accounts create my-docai-sa \
  --display-name "my-docai-service-account"

プロジェクトで Document AI と Cloud Storage にアクセスするための権限をサービスアカウントに付与します。

gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} \
    --member="serviceAccount:my-docai-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com" \
    --role="roles/documentai.admin"

gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} \
    --member="serviceAccount:my-docai-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com" \
    --role="roles/storage.admin"

gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} \
    --member="serviceAccount:my-docai-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com" \
    --role="roles/serviceusage.serviceUsageConsumer"

作成した新しいサービスアカウントとしてログインするために Python コードで使用する認証情報を作成します。次のコマンドを使用して認証情報を作成し、JSON ファイル ~/key.json に保存します。

gcloud iam service-accounts keys create ~/key.json \
  --iam-account  my-docai-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com

最後に、GOOGLE_APPLICATION_CREDENTIALS 環境変数を設定します。この変数は、ライブラリが認証情報を検索する際に使用します。この認証形式の詳細については、ガイドをご覧ください。先ほど作成した、認証情報を含む JSON ファイルのフルパスを設定する必要があります。

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"

6. クライアントライブラリをインストールする

Document AI、Cloud Storage、Document AI Toolbox 用の Python クライアントライブラリをインストールします。

pip3 install --upgrade google-cloud-documentai
pip3 install --upgrade google-cloud-storage
pip3 install --upgrade google-cloud-documentai-toolbox

次のように表示されます。

...
Installing collected packages: google-cloud-documentai
Successfully installed google-cloud-documentai-2.15.0
.
.
Installing collected packages: google-cloud-storage
Successfully installed google-cloud-storage-2.9.0
.
.
Installing collected packages: google-cloud-documentai-toolbox
Successfully installed google-cloud-documentai-toolbox-0.6.0a0

これで、Document AI API を使用する準備ができました。

7. サンプル PDF をダウンロードする

小説の最初の 3 ページが含まれているサンプルドキュメントが用意されています。

この PDF ファイルは次のリンク先からダウンロードできます。ダウンロードしたファイルをcloudshell インスタンスにアップロードします。

このファイルは、gsutil を使用して、Google の Google Cloud Storage 公開バケットからダウンロードすることもできます。

gsutil cp gs://cloud-samples-data/documentai/codelabs/ocr/Winnie_the_Pooh_3_Pages.pdf .

8. オンライン処理をリクエストする

このステップでは、オンライン処理（同期）API を使用して、小説の最初の 3 ページを処理します。この方法は、ローカルに保存された小さいドキュメントに最適です。各プロセッサタイプで処理できる最大ページ数とファイルサイズについては、プロセッサの詳細リストをご覧ください。

Cloud Shell エディタまたはローカルマシンのテキストエディタを使用して、online_processing.py というファイルを作成し、次のコードを使用します。

YOUR_PROJECT_ID、YOUR_PROJECT_LOCATION、YOUR_PROCESSOR_ID、FILE_PATH は、実際の環境に合わせて適切な値に置き換えます。

online_processing.py

from google.api_core.client_options import ClientOptions
from google.cloud import documentai


PROJECT_ID = "YOUR_PROJECT_ID"
LOCATION = "YOUR_PROJECT_LOCATION"  # Format is 'us' or 'eu'
PROCESSOR_ID = "YOUR_PROCESSOR_ID"  # Create processor in Cloud Console

# The local file in your current working directory
FILE_PATH = "Winnie_the_Pooh_3_Pages.pdf"
# Refer to https://cloud.google.com/document-ai/docs/file-types
# for supported file types
MIME_TYPE = "application/pdf"

# Instantiates a client
docai_client = documentai.DocumentProcessorServiceClient(
    client_options=ClientOptions(api_endpoint=f"{LOCATION}-documentai.googleapis.com")
)

# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
RESOURCE_NAME = docai_client.processor_path(PROJECT_ID, LOCATION, PROCESSOR_ID)

# Read the file into memory
with open(FILE_PATH, "rb") as image:
    image_content = image.read()

# Load Binary Data into Document AI RawDocument Object
raw_document = documentai.RawDocument(content=image_content, mime_type=MIME_TYPE)

# Configure the process request
request = documentai.ProcessRequest(name=RESOURCE_NAME, raw_document=raw_document)

# Use the Document AI client to process the sample form
result = docai_client.process_document(request=request)

document_object = result.document
print("Document processing complete.")
print(f"Text: {document_object.text}")

コードを実行します。テキストが抽出され、コンソールに表示されます。

サンプルドキュメントの出力結果は次のようになります。

Document processing complete.
Text: CHAPTER I
IN WHICH We Are Introduced to
Winnie-the-Pooh and Some
Bees, and the Stories Begin
Here is Edward Bear, coming
downstairs now, bump, bump, bump, on the back
of his head, behind Christopher Robin. It is, as far
as he knows, the only way of coming downstairs,
but sometimes he feels that there really is another
way, if only he could stop bumping for a moment
and think of it. And then he feels that perhaps there
isn't. Anyhow, here he is at the bottom, and ready
to be introduced to you. Winnie-the-Pooh.
When I first heard his name, I said, just as you
are going to say, "But I thought he was a boy?"
"So did I," said Christopher Robin.
"Then you can't call him Winnie?"
"I don't."
"But you said "

...

Digitized by
Google

9. バッチ処理をリクエストする

小説全体のテキストを処理したい場合はどうしたらよいでしょう。

オンライン処理では、送信できるページ数とファイルサイズが制限されています。また、1 回の API 呼び出しで許可されるドキュメントファイルは 1 つだけです。
バッチ処理では、サイズの大きいファイルや複数のファイルを非同期で処理できます。

このステップでは、Document AI Batch Processing API を使用して『くまのプーさん』全体を処理し、テキストを Google Cloud Storage バケットに出力します。

バッチ処理では、長時間実行オペレーションを使用して、リクエストを非同期で管理します。このため、オンライン処理とはリクエストの送信方法と出力の取得方法が異なります。ただし、どちらの場合でも出力は同じ Document オブジェクト形式になります。

ここでは、特定のドキュメントを Document AI で処理する方法について説明します。後のステップでは、ドキュメントを含むディレクトリ全体を処理する方法を紹介します。

Cloud Storage に PDF をアップロードする

現在、batch_process_documents() メソッドは Google Cloud Storage のファイルを処理することができます。オブジェクト構造の詳細については、documentai_v1.types.BatchProcessRequest をご覧ください。

この例では、サンプルバケットから直接ファイルを読み込みます。

次のように、gsutil を使用してファイルをバケットにコピーすることもできます。

gsutil cp gs://cloud-samples-data/documentai/codelabs/ocr/Winnie_the_Pooh.pdf gs://YOUR_BUCKET_NAME/

また、以下のリンクから小説のサンプルファイルをダウンロードして、バケットにアップロードすることもできます。

API の出力を保存する GCS バケットも必要です。

ストレージバケットの作成方法については、Cloud Storage のドキュメントをご覧ください。

`batch_process_documents()` メソッドの使用

batch_processing.py というファイルを作成して、以下のコードを使用します。

YOUR_PROJECT_ID、YOUR_PROCESSOR_LOCATION、YOUR_PROCESSOR_ID、YOUR_INPUT_URI、YOUR_OUTPUT_URI は、実際の環境に合わせて適切な値に置き換えてください。

YOUR_INPUT_URI が PDF ファイルを直接指していることを確認します（例: gs://cloud-samples-data/documentai/codelabs/ocr/Winnie_the_Pooh.pdf）。

batch_processing.py

"""
Makes a Batch Processing Request to Document AI
"""

import re

from google.api_core.client_options import ClientOptions
from google.api_core.exceptions import InternalServerError
from google.api_core.exceptions import RetryError
from google.cloud import documentai
from google.cloud import storage

# TODO(developer): Fill these variables before running the sample.
project_id = "YOUR_PROJECT_ID"
location = "YOUR_PROCESSOR_LOCATION"  # Format is "us" or "eu"
processor_id = "YOUR_PROCESSOR_ID"  # Create processor before running sample
gcs_output_uri = "YOUR_OUTPUT_URI"  # Must end with a trailing slash `/`. Format: gs://bucket/directory/subdirectory/
processor_version_id = (
    "YOUR_PROCESSOR_VERSION_ID"  # Optional. Example: pretrained-ocr-v1.0-2020-09-23
)

# TODO(developer): If `gcs_input_uri` is a single file, `mime_type` must be specified.
gcs_input_uri = "YOUR_INPUT_URI"  # Format: `gs://bucket/directory/file.pdf` or `gs://bucket/directory/`
input_mime_type = "application/pdf"
field_mask = "text,entities,pages.pageNumber"  # Optional. The fields to return in the Document object.


def batch_process_documents(
    project_id: str,
    location: str,
    processor_id: str,
    gcs_input_uri: str,
    gcs_output_uri: str,
    processor_version_id: str = None,
    input_mime_type: str = None,
    field_mask: str = None,
    timeout: int = 400,
):
    # You must set the api_endpoint if you use a location other than "us".
    opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")

    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    if not gcs_input_uri.endswith("/") and "." in gcs_input_uri:
        # Specify specific GCS URIs to process individual documents
        gcs_document = documentai.GcsDocument(
            gcs_uri=gcs_input_uri, mime_type=input_mime_type
        )
        # Load GCS Input URI into a List of document files
        gcs_documents = documentai.GcsDocuments(documents=[gcs_document])
        input_config = documentai.BatchDocumentsInputConfig(gcs_documents=gcs_documents)
    else:
        # Specify a GCS URI Prefix to process an entire directory
        gcs_prefix = documentai.GcsPrefix(gcs_uri_prefix=gcs_input_uri)
        input_config = documentai.BatchDocumentsInputConfig(gcs_prefix=gcs_prefix)

    # Cloud Storage URI for the Output Directory
    gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
        gcs_uri=gcs_output_uri, field_mask=field_mask
    )

    # Where to write results
    output_config = documentai.DocumentOutputConfig(gcs_output_config=gcs_output_config)

    if processor_version_id:
        # The full resource name of the processor version, e.g.:
        # projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}
        name = client.processor_version_path(
            project_id, location, processor_id, processor_version_id
        )
    else:
        # The full resource name of the processor, e.g.:
        # projects/{project_id}/locations/{location}/processors/{processor_id}
        name = client.processor_path(project_id, location, processor_id)

    request = documentai.BatchProcessRequest(
        name=name,
        input_documents=input_config,
        document_output_config=output_config,
    )

    # BatchProcess returns a Long Running Operation (LRO)
    operation = client.batch_process_documents(request)

    # Continually polls the operation until it is complete.
    # This could take some time for larger files
    # Format: projects/{project_id}/locations/{location}/operations/{operation_id}
    try:
        print(f"Waiting for operation {operation.operation.name} to complete...")
        operation.result(timeout=timeout)
    # Catch exception when operation doesn"t finish before timeout
    except (RetryError, InternalServerError) as e:
        print(e.message)

    # NOTE: Can also use callbacks for asynchronous processing
    #
    # def my_callback(future):
    #   result = future.result()
    #
    # operation.add_done_callback(my_callback)

    # Once the operation is complete,
    # get output document information from operation metadata
    metadata = documentai.BatchProcessMetadata(operation.metadata)

    if metadata.state != documentai.BatchProcessMetadata.State.SUCCEEDED:
        raise ValueError(f"Batch Process Failed: {metadata.state_message}")

    storage_client = storage.Client()

    print("Output files:")
    # One process per Input Document
    for process in list(metadata.individual_process_statuses):
        # output_gcs_destination format: gs://BUCKET/PREFIX/OPERATION_NUMBER/INPUT_FILE_NUMBER/
        # The Cloud Storage API requires the bucket name and URI prefix separately
        matches = re.match(r"gs://(.*?)/(.*)", process.output_gcs_destination)
        if not matches:
            print(
                "Could not parse output GCS destination:",
                process.output_gcs_destination,
            )
            continue

        output_bucket, output_prefix = matches.groups()

        # Get List of Document Objects from the Output Bucket
        output_blobs = storage_client.list_blobs(output_bucket, prefix=output_prefix)

        # Document AI may output multiple JSON files per source file
        for blob in output_blobs:
            # Document AI should only output JSON files to GCS
            if blob.content_type != "application/json":
                print(
                    f"Skipping non-supported file: {blob.name} - Mimetype: {blob.content_type}"
                )
                continue

            # Download JSON File as bytes object and convert to Document Object
            print(f"Fetching {blob.name}")
            document = documentai.Document.from_json(
                blob.download_as_bytes(), ignore_unknown_fields=True
            )

            # For a full list of Document object attributes, please reference this page:
            # https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.Document

            # Read the text recognition output from the processor
            print("The document contains the following text:")
            print(document.text)


if __name__ == "__main__":
    batch_process_documents(
        project_id=project_id,
        location=location,
        processor_id=processor_id,
        gcs_input_uri=gcs_input_uri,
        gcs_output_uri=gcs_output_uri,
        input_mime_type=input_mime_type,
        field_mask=field_mask,
    )

コードを実行します。小説全体のテキストが抽出され、コンソールに出力されます。

前のサンプルファイルよりもはるかに大きいので、処理に時間がかかるかもしれません（少しお待ちください）。

Batch Processing API の場合はオペレーション ID が返されます。タスクが完了したら、この ID を使用して GCS から出力を取得できます。

出力は次のようになります。

Waiting for operation projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION_NUMBER to complete...
Document processing complete.
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-0.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-1.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-10.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-11.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-12.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-13.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-14.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-15.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-16.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-17.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-18.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-2.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-3.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-4.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-5.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-6.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-7.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-8.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-9.json

This is a reproduction of a library book that was digitized
by Google as part of an ongoing effort to preserve the
information in books and make it universally accessible.
TM
Google books
https://books.google.com

.....

He nodded and went
out ... and in a moment
I heard Winnie-the-Pooh
-bump, bump, bump-go-ing up the stairs behind
him.
Digitized by
Google

10. ディレクトリに対するバッチ処理をリクエストする

ドキュメントを個別に処理するのではなく、ドキュメントのディレクトリ全体をまとめて処理したい場合もあります。batch_process_documents() メソッドは、ドキュメントのリストだけでなく、ディレクトリパスにも対応しています。

このステップでは、ドキュメントファイルのディレクトリ全体を処理する方法を説明します。コードのほとんどの部分は前のステップと同じです。違うのは BatchProcessRequest で GCS URI を送信している点です。

サンプルバケットのディレクトリには複数のファイルがあり、それぞれに小説の複数のページが含まれています。

gs://cloud-samples-data/documentai/codelabs/ocr/multi-document/

これらのファイルを直接読み込むことも、Cloud Storage バケットにコピーすることもできます。

前の手順のコードを再実行し、YOUR_INPUT_URI を Cloud Storage のディレクトリに置き換えます。

コードを実行します。指定した Cloud Storage ディレクトリにあるすべてのドキュメントファイルからテキストが抽出されます。

出力は次のようになります。

Waiting for operation projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION_NUMBER to complete...
Document processing complete.
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh_Page_0-0.json
Fetching docai-output/OPERATION_NUMBER/1/Winnie_the_Pooh_Page_1-0.json
Fetching docai-output/OPERATION_NUMBER/2/Winnie_the_Pooh_Page_10-0.json
Fetching docai-output/OPERATION_NUMBER/3/Winnie_the_Pooh_Page_12-0.json
Fetching docai-output/OPERATION_NUMBER/4/Winnie_the_Pooh_Page_16-0.json
Fetching docai-output/OPERATION_NUMBER/5/Winnie_the_Pooh_Page_7-0.json

Introduction
(I₂
F YOU happen to have read another
book about Christopher Robin, you may remember
th
CHAPTER I
IN WHICH We Are Introduced to
Winnie-the-Pooh and Some
Bees, and the Stories Begin
HERE is
10
WINNIE-THE-POOH
"I wonder if you've got such a thing as a balloon
about you?"
"A balloon?"
"Yes, 
12
WINNIE-THE-POOH
and you took your gun with you, just in case, as
you always did, and Winnie-the-P
16
WINNIE-THE-POOH
this song, and one bee sat down on the nose of the
cloud for a moment, and then g
WE ARE INTRODUCED
7
"Oh, help!" said Pooh, as he dropped ten feet on
the branch below him.
"If only

11. Document AI Toolbox を使用してバッチ処理のレスポンスを処理する

バッチ処理は、Cloud Storage との統合により、完了までに多くの手順が必要になります。入力ドキュメントのサイズによっては、Document 出力が複数の .json ファイルに「シャーディング」されることもあります。

Document AI Toolbox Python SDK は、Document AI を使用した後処理やその他の一般的なタスクを簡素化するために作成されました。このライブラリは、Document AI クライアントライブラリの代わりではなく、補足として使用することを目的としています。完全な仕様については、リファレンスドキュメントをご覧ください。

このステップでは、Document AI ツールボックスを使用してバッチ処理リクエストを行い、出力を取得する方法について説明します。

batch_processing_toolbox.py

"""
Makes a Batch Processing Request to Document AI using Document AI Toolbox
"""

from google.api_core.client_options import ClientOptions
from google.cloud import documentai
from google.cloud import documentai_toolbox

# TODO(developer): Fill these variables before running the sample.
project_id = "YOUR_PROJECT_ID"
location = "YOUR_PROCESSOR_LOCATION"  # Format is "us" or "eu"
processor_id = "YOUR_PROCESSOR_ID"  # Create processor before running sample
gcs_output_uri = "YOUR_OUTPUT_URI"  # Must end with a trailing slash `/`. Format: gs://bucket/directory/subdirectory/
processor_version_id = (
    "YOUR_PROCESSOR_VERSION_ID"  # Optional. Example: pretrained-ocr-v1.0-2020-09-23
)

# TODO(developer): If `gcs_input_uri` is a single file, `mime_type` must be specified.
gcs_input_uri = "YOUR_INPUT_URI"  # Format: `gs://bucket/directory/file.pdf`` or `gs://bucket/directory/``
input_mime_type = "application/pdf"
field_mask = "text,entities,pages.pageNumber"  # Optional. The fields to return in the Document object.


def batch_process_toolbox(
    project_id: str,
    location: str,
    processor_id: str,
    gcs_input_uri: str,
    gcs_output_uri: str,
    processor_version_id: str = None,
    input_mime_type: str = None,
    field_mask: str = None,
):
    # You must set the api_endpoint if you use a location other than "us".
    opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")

    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    if not gcs_input_uri.endswith("/") and "." in gcs_input_uri:
        # Specify specific GCS URIs to process individual documents
        gcs_document = documentai.GcsDocument(
            gcs_uri=gcs_input_uri, mime_type=input_mime_type
        )
        # Load GCS Input URI into a List of document files
        gcs_documents = documentai.GcsDocuments(documents=[gcs_document])
        input_config = documentai.BatchDocumentsInputConfig(gcs_documents=gcs_documents)
    else:
        # Specify a GCS URI Prefix to process an entire directory
        gcs_prefix = documentai.GcsPrefix(gcs_uri_prefix=gcs_input_uri)
        input_config = documentai.BatchDocumentsInputConfig(gcs_prefix=gcs_prefix)

    # Cloud Storage URI for the Output Directory
    gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
        gcs_uri=gcs_output_uri, field_mask=field_mask
    )

    # Where to write results
    output_config = documentai.DocumentOutputConfig(gcs_output_config=gcs_output_config)

    if processor_version_id:
        # The full resource name of the processor version, e.g.:
        # projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}
        name = client.processor_version_path(
            project_id, location, processor_id, processor_version_id
        )
    else:
        # The full resource name of the processor, e.g.:
        # projects/{project_id}/locations/{location}/processors/{processor_id}
        name = client.processor_path(project_id, location, processor_id)

    request = documentai.BatchProcessRequest(
        name=name,
        input_documents=input_config,
        document_output_config=output_config,
    )

    # BatchProcess returns a Long Running Operation (LRO)
    operation = client.batch_process_documents(request)

    # Operation Name Format: projects/{project_id}/locations/{location}/operations/{operation_id}
    documents = documentai_toolbox.document.Document.from_batch_process_operation(
        location=location, operation_name=operation.operation.name
    )

    for document in documents:
        # Read the text recognition output from the processor
        print("The document contains the following text:")
        # Truncated at 100 characters for brevity
        print(document.text[:100])


if __name__ == "__main__":
    batch_process_toolbox(
        project_id=project_id,
        location=location,
        processor_id=processor_id,
        gcs_input_uri=gcs_input_uri,
        gcs_output_uri=gcs_output_uri,
        input_mime_type=input_mime_type,
        field_mask=field_mask,
    )

12. 完了

ここでは、Document AI のオンライン処理、バッチ処理、Document AI Toolbox を使用して、小説のテキストを抽出しました。

ほかのドキュメントでもこの機能を試してみてください。また、プラットフォームで利用可能な他のプロセッサもご確認ください。

クリーンアップ

このチュートリアルで使用したリソースについて、Google Cloud アカウントに課金されないようにする手順は次のとおりです。

Cloud コンソールで、[リソースの管理] ページに移動します。
プロジェクトリストでプロジェクトを選択し、[削除] をクリックします。
ダイアログでプロジェクト ID を入力し、[シャットダウン] をクリックしてプロジェクトを削除します。

詳細

次の Codelab で Document AI について理解を深めてください。

リソース

ライセンス

この作業はクリエイティブ・コモンズの表示 2.0 汎用ライセンスにより使用許諾されています。

Document AI（Python）を使用した光学式文字認識（OCR）

1. 概要

Document AI とは

学習内容

必要なもの

アンケート

このチュートリアルをどのように使用されますか？

Python のご利用経験はどの程度ありますか？

Google Cloud サービスの使用経験はどの程度ありますか？

2. 設定と要件

セルフペース型の環境設定

Cloud Shell の起動

Cloud Shell をアクティブにする

3. Document AI API を有効にする

gcloud CLI を使用する

Cloud Console を使用する

4. プロセッサを作成してテストする

Cloud コンソール

Python Client ライブラリ

5. API リクエストを認証する

6. クライアント ライブラリをインストールする

7. サンプル PDF をダウンロードする

8. オンライン処理をリクエストする

online_processing.py

9. バッチ処理をリクエストする

Cloud Storage に PDF をアップロードする

batch_process_documents() メソッドの使用

batch_processing.py

10. ディレクトリに対するバッチ処理をリクエストする

11. Document AI Toolbox を使用してバッチ処理のレスポンスを処理する

batch_processing_toolbox.py

12. 完了

クリーンアップ

詳細

リソース

ライセンス

`gcloud` CLI を使用する

6. クライアントライブラリをインストールする

`batch_process_documents()` メソッドの使用