Optical Character Recognition (OCR) with Document AI (Python)

1. Overview

Lab Intro

What is Document AI?

Document AI is a document understanding solution that takes unstructured data (e.g. documents, emails, invoices, forms, etc.) and makes the data easier to understand, analyze, and consume. The API provides structure through content classification, entity extraction, advanced searching, and more.

In this lab, you will learn how to perform Optical Character Recognition using the Document AI API with Python.

We will utilize a PDF file of the classic novel "Winnie the Pooh" by A.A. Milne, which has recently become part of the Public Domain in the United States. This file was scanned and digitized by Google Books.

What you'll learn

  • How to enable the Document AI API
  • How to authenticate API requests
  • How to install the client library for Python
  • How to use the online and batch processing APIs
  • How to parse text from a PDF file

What you'll need

  • A Google Cloud Project
  • A Browser, such as Chrome or Firefox
  • Familiarity using Python (3.9+)

Survey

How will you use this tutorial?

Read it through only Read it and complete the exercises

How would you rate your experience with Python?

Novice Intermediate Proficient

How would you rate your experience with using Google Cloud services?

Novice Intermediate Proficient

2. Setup and Requirements

Self-paced environment setup

  1. Sign in to Cloud Console and create a new project or reuse an existing one. (If you don't already have a Gmail or Google Workspace account, you must create one.)

Select Project

New Project

Get Project ID

Remember the Project ID, a unique name across all Google Cloud projects. (The Project ID above has already been taken and will not work for you, sorry!). You must provide this ID later on as PROJECT_ID.

  1. Next, you must enable billing in Cloud Console in order to use Google Cloud resources.

Be sure to to follow any instructions in the "Cleaning up" section. The section advises you how to shut down resources so you don't incur billing beyond this tutorial. New users of Google Cloud are eligible for the $300USD Free Trial program.

Start Cloud Shell

While Google Cloud you can operate Google Cloud remotely from your laptop, this codelab uses Google Cloud Shell, a command line environment running in the Cloud.

Activate Cloud Shell

  1. From the Cloud Console, click Activate Cloud Shell Activate Cloud Shell

Activate Cloud Shell

If you've never started Cloud Shell before, you are presented with an intermediate screen (below the fold) describing what it is. If that's the case, click Continue (and you won't ever see it again). Here's what that one-time screen looks like:

Cloud Shell Intro

It should only take a few moments to provision and connect to Cloud Shell. Cloud Shell

Cloud Shell provides you with terminal access to a virtual machine hosted in the cloud. The virtual machine includes all the development tools that you'll need. It offers a persistent 5GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with simply a browser.

Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your Project ID.

  1. Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list

Command output

 Credentialed Accounts
ACTIVE  ACCOUNT
*      <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`
gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If it is not, you can set it with this command:

gcloud config set project <PROJECT_ID>

Command output

Updated property [core/project].

3. Enable the Document AI API

Before you can begin using Document AI, you must enable the API. Open the Cloud Console in your browser.

  1. Using the Search Bar at the top of the console, search for "Document AI API", then click Enable to use the API in your Google Cloud project

Search API

  1. Repeat the previous step for the Google Cloud Storage API.
  2. Alternatively, the APIs can be enabled using the following gcloud commands.
gcloud services enable documentai.googleapis.com
gcloud services enable storage.googleapis.com

You should see something like this:

Operation "operations/..." finished successfully.

Now, you can use Document AI!

4. Create and Test a Processor

You must first create an instance of the Document OCR processor that will perform the extraction. This can be completed using the Cloud Console or the Processor Management API.

Cloud Console

  1. In the console, navigate to the Document AI Platform OverviewDocument AI Overview Console
  2. Click Create Processor and select Document OCRProcessors
  3. Give it the name codelab-ocr (Or something else you'll remember) and select the closest region on the list.
  4. Click Create to create your processor
  5. Copy your Processor ID. You must use this in your code later. Processor ID

You can test out your processor in the console by uploading a document. Click Upload Test Document and select a document to parse.

You can download the PDF file below, which contains the first 3 pages of our novel.

Title Page

Your output should look this: Parsed Book

Python Client Library

Follow this codelab to learn how to manage Document AI processors with the Python Client Library:

Managing Document AI processors with Python - Codelab

5. Authenticate API requests

In order to make requests to the Document AI API, you must use a Service Account. A Service Account belongs to your project and it is used by the Python Client library to make API requests. Like any other user account, a service account is represented by an email address. In this section, you will use the Cloud SDK to create a service account and then create credentials you need to authenticate as the service account.

First, open Cloud Shell and set an environment variable with your PROJECT_ID which you will use throughout this codelab:

export GOOGLE_CLOUD_PROJECT=$(gcloud config get-value core/project)

Next, create a new service account to access the Document AI API by using:

gcloud iam service-accounts create my-docai-sa \
  --display-name "my-docai-service-account"

Next, give your service account permissions to access Document AI and Cloud Storage in your project.

gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} \
    --member="serviceAccount:my-docai-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com" \
    --role="roles/documentai.admin"

gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} \
    --member="serviceAccount:my-docai-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com" \
    --role="roles/storage.admin"

Next, create credentials that your Python code uses to login as your new service account. Create these credentials and save it as a JSON file ~/key.json by using the following command:

gcloud iam service-accounts keys create ~/key.json \
  --iam-account  my-docai-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com

Finally, set the GOOGLE_APPLICATION_CREDENTIALS environment variable, which is used by the library to find your credentials. To read more about this form of authentication, see the guide. The environment variable should be set to the full path of the credentials JSON file you created, by using:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"

6. Install the client library

Install the Python client libraries for Document AI and Cloud Storage:

pip3 install --upgrade google-cloud-documentai
pip3 install --upgrade google-cloud-storage

You should see something like this:

...
Installing collected packages: google-cloud-documentai
Successfully installed google-cloud-documentai-1.4.1
.
.
Installing collected packages: google-cloud-storage
Successfully installed google-cloud-storage-1.43.0

Now, you're ready to use the Document AI API!

7. Download the Sample PDF

We have a sample document which contains the first 3 pages of the novel.

You can download the PDF using the following link. Then upload it to the cloudshell instance.

You can also download it from our public Google Cloud Storage Bucket using gsutil.

gsutil cp gs://cloud-samples-data/documentai/codelabs/ocr/Winnie_the_Pooh_3_Pages.pdf .

8. Make an Online Processing Request

In this step, you'll process the first 3 pages of the novel using the online processing (synchronous) API. This method is best suited for smaller documents that are stored locally. Check out the full processor list for the maximum pages and file size for each processor type.

Create a file called online_processing.py and use the code below.

Replace YOUR_PROJECT_ID, YOUR_PROJECT_LOCATION, YOUR_PROCESSOR_ID, and the FILE_PATH with appropriate values for your environment.

online_processing.py

from google.api_core.client_options import ClientOptions
from google.cloud import documentai_v1 as documentai


PROJECT_ID = "YOUR_PROJECT_ID"
LOCATION = "YOUR_PROJECT_LOCATION"  # Format is 'us' or 'eu'
PROCESSOR_ID = "YOUR_PROCESSOR_ID"  # Create processor in Cloud Console

# The local file in your current working directory
FILE_PATH = "Winnie_the_Pooh_3_Pages.pdf"
# Refer to https://cloud.google.com/document-ai/docs/processors-list
# for supported file types
MIME_TYPE = "application/pdf"

# Instantiates a client
docai_client = documentai.DocumentProcessorServiceClient(
    client_options=ClientOptions(api_endpoint=f"{LOCATION}-documentai.googleapis.com")
)

# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
RESOURCE_NAME = docai_client.processor_path(PROJECT_ID, LOCATION, PROCESSOR_ID)

# Read the file into memory
with open(FILE_PATH, "rb") as image:
    image_content = image.read()

# Load Binary Data into Document AI RawDocument Object
raw_document = documentai.RawDocument(content=image_content, mime_type=MIME_TYPE)

# Configure the process request
request = documentai.ProcessRequest(name=RESOURCE_NAME, raw_document=raw_document)

# Use the Document AI client to process the sample form
result = docai_client.process_document(request=request)

document_object = result.document
print("Document processing complete.")
print(f"Text: {document_object.text}")

Run the code, which will extract the text and print it to the console.

You should see the following output if using our sample document:

Document processing complete.
Text: CHAPTER I
IN WHICH We Are Introduced to
Winnie-the-Pooh and Some
Bees, and the Stories Begin
Here is Edward Bear, coming
downstairs now, bump, bump, bump, on the back
of his head, behind Christopher Robin. It is, as far
as he knows, the only way of coming downstairs,
but sometimes he feels that there really is another
way, if only he could stop bumping for a moment
and think of it. And then he feels that perhaps there
isn't. Anyhow, here he is at the bottom, and ready
to be introduced to you. Winnie-the-Pooh.
When I first heard his name, I said, just as you
are going to say, "But I thought he was a boy?"
"So did I," said Christopher Robin.
"Then you can't call him Winnie?"
"I don't."
"But you said "

...

Digitized by
Google

9. Make a Batch Processing Request

Now, suppose that you want to read in the text from the entire novel.

  • Online Processing has limits on the number of pages and file size that can be sent and it only allows for one document file per API call.
  • Batch Processing allows for processing of larger/multiple files in an asynchronous method.

In this step, we will process the entire "Winnie the Pooh" novel with the Document AI Batch Processing API and output the text into a Google Cloud Storage Bucket.

Batch processing uses Long Running Operations to manage requests in an asynchronous manner, so we have to make the request and retrieve the output in a different manner than online processing. However, the output will be in the same Document object format whether using online or batch processing.

This step shows how to provide specific documents for Document AI to process. A later step will show how to process an entire directory of documents.

Upload PDF to Cloud Storage

The batch_process_documents() method currently accepts files from Google Cloud Storage. You can reference documentai_v1.types.BatchProcessRequest for more information on the object structure.

For this example, you can read the file directly from our sample bucket.

You can also copy the file into your own bucket using gsutil...

gsutil cp gs://cloud-samples-data/documentai/codelabs/ocr/Winnie_the_Pooh.pdf gs://YOUR_BUCKET_NAME/

...or you can download the sample file of the novel from the link below and upload it to your own bucket.

You will also need a GCS Bucket to store the output of the API.

You can follow the Cloud Storage Documentation to learn how to create storage buckets.

Using the batch_process_documents() method

Create a file called batch_processing.py and use the code below.

Replace the PROJECT_ID, LOCATION, PROCESSOR_ID, GCS_INPUT_URI and GCS_OUTPUT_URI with the appropriate values for your environment.

batch_processing.py

import re
from typing import List

from google.api_core.client_options import ClientOptions
from google.cloud import documentai_v1 as documentai
from google.cloud import storage


PROJECT_ID = "YOUR_PROJECT_ID"
LOCATION = "YOUR_PROJECT_LOCATION"  # Format is 'us' or 'eu'
PROCESSOR_ID = "YOUR_PROCESSOR_ID"  # Create processor in Cloud Console

# Format 'gs://input_bucket/directory/file.pdf'
GCS_INPUT_URI = "gs://cloud-samples-data/documentai/codelabs/ocr/Winnie_the_Pooh.pdf"
INPUT_MIME_TYPE = "application/pdf"

# Format 'gs://output_bucket/directory'
GCS_OUTPUT_URI = "YOUR_OUTPUT_BUCKET_URI"

# Instantiates a client
docai_client = documentai.DocumentProcessorServiceClient(
    client_options=ClientOptions(api_endpoint=f"{LOCATION}-documentai.googleapis.com")
)

# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
RESOURCE_NAME = docai_client.processor_path(PROJECT_ID, LOCATION, PROCESSOR_ID)

# Cloud Storage URI for the Input Document
input_document = documentai.GcsDocument(
    gcs_uri=GCS_INPUT_URI, mime_type=INPUT_MIME_TYPE
)

# Load GCS Input URI into a List of document files
input_config = documentai.BatchDocumentsInputConfig(
    gcs_documents=documentai.GcsDocuments(documents=[input_document])
)

# Cloud Storage URI for Output directory
gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
    gcs_uri=GCS_OUTPUT_URI
)

# Load GCS Output URI into OutputConfig object
output_config = documentai.DocumentOutputConfig(gcs_output_config=gcs_output_config)

# Configure Process Request
request = documentai.BatchProcessRequest(
    name=RESOURCE_NAME,
    input_documents=input_config,
    document_output_config=output_config,
)

# Batch Process returns a Long Running Operation (LRO)
operation = docai_client.batch_process_documents(request)

# Continually polls the operation until it is complete.
# This could take some time for larger files
# Format: projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION_ID
print(f"Waiting for operation {operation.operation.name} to complete...")
operation.result()

# NOTE: Can also use callbacks for asynchronous processing
#
# def my_callback(future):
#   result = future.result()
#
# operation.add_done_callback(my_callback)

print("Document processing complete.")

# Once the operation is complete,
# get output document information from operation metadata
metadata = documentai.BatchProcessMetadata(operation.metadata)

if metadata.state != documentai.BatchProcessMetadata.State.SUCCEEDED:
    raise ValueError(f"Batch Process Failed: {metadata.state_message}")

documents: List[documentai.Document] = []

# Storage Client to retrieve the output files from GCS
storage_client = storage.Client()

# One process per Input Document
for process in metadata.individual_process_statuses:

    # output_gcs_destination format: gs://BUCKET/PREFIX/OPERATION_NUMBER/0
    # The GCS API requires the bucket name and URI prefix separately
    output_bucket, output_prefix = re.match(
        r"gs://(.*?)/(.*)", process.output_gcs_destination
    ).groups()

    # Get List of Document Objects from the Output Bucket
    output_blobs = storage_client.list_blobs(output_bucket, prefix=output_prefix)

    # DocAI may output multiple JSON files per source file
    for blob in output_blobs:
        # Document AI should only output JSON files to GCS
        if ".json" not in blob.name:
            print(f"Skipping non-supported file type {blob.name}")
            continue

        print(f"Fetching {blob.name}")

        # Download JSON File and Convert to Document Object
        document = documentai.Document.from_json(
            blob.download_as_bytes(), ignore_unknown_fields=True
        )

        documents.append(document)

# Print Text from all documents
# Truncated at 100 characters for brevity
for document in documents:
    print(document.text[:100])

Run the code, and you should see the full novel text extracted and printed in your console.

This may take some time to complete as the file is much larger than the previous example. (Oh, bother...)

However, with the Batch Processing API, you will receive an Operation ID which can be used to get the output from GCS once the task is completed.

Your output should look something like this:

Waiting for operation projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION_NUMBER to complete...
Document processing complete.
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-0.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-1.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-10.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-11.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-12.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-13.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-14.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-15.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-16.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-17.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-18.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-2.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-3.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-4.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-5.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-6.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-7.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-8.json
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh-9.json

This is a reproduction of a library book that was digitized
by Google as part of an ongoing effort to preserve the
information in books and make it universally accessible.
TM
Google books
https://books.google.com

.....

He nodded and went
out ... and in a moment
I heard Winnie-the-Pooh
-bump, bump, bump-go-ing up the stairs behind
him.
Digitized by
Google

10. Make a Batch Processing Request for a directory

Sometimes, you may want to process an entire directory of documents, without listing each document individually. The batch_process_documents() method supports input of a list of specific documents or a directory path.

This step will show how to process a full directory of document files. Most of the code is the same as the previous step, the only difference is the GCS URI sent with the BatchProcessRequest.

We have a directory in our sample bucket that contains multiple pages of the novel in separate files.

  • gs://cloud-samples-data/documentai/codelabs/ocr/multi-document

You can read the files directly or copy them into your own Cloud Storage bucket.

Create a file called batch_processing_directory.py and use the code below.

Replace the PROJECT_ID, LOCATION, PROCESSOR_ID, GCS_INPUT_PREFIX and GCS_OUTPUT_URI with the appropriate values for your environment.

batch_processing_directory.py

import re
from typing import List

from google.api_core.client_options import ClientOptions
from google.cloud import documentai_v1 as documentai
from google.cloud import storage


PROJECT_ID = "YOUR_PROJECT_ID"
LOCATION = "YOUR_PROJECT_LOCATION"  # Format is 'us' or 'eu'
PROCESSOR_ID = "YOUR_PROCESSOR_ID"  # Create processor in Cloud Console

# Format 'gs://input_bucket/directory'
GCS_INPUT_PREFIX = "gs://cloud-samples-data/documentai/codelabs/ocr/multi-document"

# Format 'gs://output_bucket/directory'
GCS_OUTPUT_URI = "YOUR_OUTPUT_BUCKET_URI"

# Instantiates a client
docai_client = documentai.DocumentProcessorServiceClient(
    client_options=ClientOptions(api_endpoint=f"{LOCATION}-documentai.googleapis.com")
)

# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
RESOURCE_NAME = docai_client.processor_path(PROJECT_ID, LOCATION, PROCESSOR_ID)

# Cloud Storage URI for the Input Directory
gcs_prefix = documentai.GcsPrefix(gcs_uri_prefix=GCS_INPUT_PREFIX)

# Load GCS Input URI into Batch Input Config
input_config = documentai.BatchDocumentsInputConfig(gcs_prefix=gcs_prefix)

# Cloud Storage URI for Output directory
gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
    gcs_uri=GCS_OUTPUT_URI
)

# Load GCS Output URI into OutputConfig object
output_config = documentai.DocumentOutputConfig(gcs_output_config=gcs_output_config)

# Configure Process Request
request = documentai.BatchProcessRequest(
    name=RESOURCE_NAME,
    input_documents=input_config,
    document_output_config=output_config,
)

# Batch Process returns a Long Running Operation (LRO)
operation = docai_client.batch_process_documents(request)

# Continually polls the operation until it is complete.
# This could take some time for larger files
# Format: projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION_ID
print(f"Waiting for operation {operation.operation.name} to complete...")
operation.result()

# NOTE: Can also use callbacks for asynchronous processing
#
# def my_callback(future):
#   result = future.result()
#
# operation.add_done_callback(my_callback)

print("Document processing complete.")

# Once the operation is complete,
# get output document information from operation metadata
metadata = documentai.BatchProcessMetadata(operation.metadata)

if metadata.state != documentai.BatchProcessMetadata.State.SUCCEEDED:
    raise ValueError(f"Batch Process Failed: {metadata.state_message}")

documents: List[documentai.Document] = []

# Storage Client to retrieve the output files from GCS
storage_client = storage.Client()

# One process per Input Document
for process in metadata.individual_process_statuses:

    # output_gcs_destination format: gs://BUCKET/PREFIX/OPERATION_NUMBER/0
    # The GCS API requires the bucket name and URI prefix separately
    output_bucket, output_prefix = re.match(
        r"gs://(.*?)/(.*)", process.output_gcs_destination
    ).groups()

    # Get List of Document Objects from the Output Bucket
    output_blobs = storage_client.list_blobs(output_bucket, prefix=output_prefix)

    # DocAI may output multiple JSON files per source file
    for blob in output_blobs:
        # Document AI should only output JSON files to GCS
        if ".json" not in blob.name:
            print(f"Skipping non-supported file type {blob.name}")
            continue

        print(f"Fetching {blob.name}")

        # Download JSON File and Convert to Document Object
        document = documentai.Document.from_json(
            blob.download_as_bytes(), ignore_unknown_fields=True
        )

        documents.append(document)

# Print Text from all documents
# Truncated at 100 characters for brevity
for document in documents:
    print(document.text[:100])

Run the code, and you should see the extracted text from all of the document files in the Cloud Storage directory.

Your output should look something like this:

Waiting for operation projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION_NUMBER to complete...
Document processing complete.
Fetching docai-output/OPERATION_NUMBER/0/Winnie_the_Pooh_Page_0-0.json
Fetching docai-output/OPERATION_NUMBER/1/Winnie_the_Pooh_Page_1-0.json
Fetching docai-output/OPERATION_NUMBER/2/Winnie_the_Pooh_Page_10-0.json
Fetching docai-output/OPERATION_NUMBER/3/Winnie_the_Pooh_Page_12-0.json
Fetching docai-output/OPERATION_NUMBER/4/Winnie_the_Pooh_Page_16-0.json
Fetching docai-output/OPERATION_NUMBER/5/Winnie_the_Pooh_Page_7-0.json

Introduction
(I₂
F YOU happen to have read another
book about Christopher Robin, you may remember
th
CHAPTER I
IN WHICH We Are Introduced to
Winnie-the-Pooh and Some
Bees, and the Stories Begin
HERE is
10
WINNIE-THE-POOH
"I wonder if you've got such a thing as a balloon
about you?"
"A balloon?"
"Yes, 
12
WINNIE-THE-POOH
and you took your gun with you, just in case, as
you always did, and Winnie-the-P
16
WINNIE-THE-POOH
this song, and one bee sat down on the nose of the
cloud for a moment, and then g
WE ARE INTRODUCED
7
"Oh, help!" said Pooh, as he dropped ten feet on
the branch below him.
"If only 

11. Congratulations

You've successfully used Document AI to extract text from a novel using the Online Processing and Batch Processing APIs.

We encourage you to experiment with other documents and explore the other processors available on the platform.

Clean Up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial:

  • In the Cloud Console, go to the Manage resources page.
  • In the project list, select your project then click Delete.
  • In the dialog, type the project ID and then click Shut down to delete the project.

Learn More

Continue learning about Document AI with these follow-up Codelabs.

Resources

License

This work is licensed under a Creative Commons Attribution 2.0 Generic License.