Managing Document AI processors with Python

What is Document AI?

Document AI is a platform that lets you extract insights from your documents. At heart, it offers a growing list of document processors (also called parsers or splitters, depending on their functionality).

There are two ways you can manage Document AI processors:

  • manually, from the web console;
  • programmatically, using the Document AI API.

Here is an example screenshot showing your processor list, both from the web console and from Python code:

ef3b4e44c0abebbd.png

In this lab, you will focus on managing Document AI processors programmatically with the Python client library.

What you'll see

  • How to enable Document AI
  • How to set up your project
  • How to set up Python
  • How to list processor types
  • How to create processors
  • How to list processors
  • How to analyze a document
  • How to enable/disable processors
  • How to delete processors

What you'll need

  • A Google Cloud project
  • A browser, such as Chrome or Firefox
  • Familiarity using Python (3.7+)

Survey

How will you use this tutorial?

Read it through only Read it and complete the exercises

How would you rate your experience with Python?

Novice Intermediate Proficient

How would you rate your experience with using Google Cloud services?

Novice Intermediate Proficient

Self-paced environment setup

  1. Sign-in to the Google Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one.

96a9c957bc475304.png

b9a10ebdf5b5a448.png

a1e3c01a38fa61c2.png

  • The Project name is the display name for this project's participants. It is a character string not used by Google APIs, and you can update it at any time.
  • The Project ID must be unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference the Project ID (and it is typically identified as PROJECT_ID), so if you don't like it, generate another random one, or, you can try your own and see if it's available. Then it's "frozen" after the project is created.
  • There is a third value, a Project Number which some APIs use. Learn more about all three of these values in the documentation.
  1. Next, you'll need to enable billing in the Cloud Console in order to use Cloud resources/APIs. Running through this codelab shouldn't cost much, if anything at all. To shut down resources so you don't incur billing beyond this tutorial, follow any "clean-up" instructions found at the end of the codelab. New users of Google Cloud are eligible for the $300 USD Free Trial program.

Start Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this lab you will be using Cloud Shell, a command line environment running in the Cloud.

Activate Cloud Shell

  1. From the Cloud Console, click Activate Cloud Shell 4292cbf4971c9786.png.

bce75f34b2c53987.png

If you've never started Cloud Shell before, you're presented with an intermediate screen (below the fold) describing what it is. If that's the case, click Continue (and you won't ever see it again). Here's what that one-time screen looks like:

70f315d7b402b476.png

It should only take a few moments to provision and connect to Cloud Shell.

fbe3a0674c982259.png

This virtual machine is loaded with all the development tools you need. It offers a persistent 5GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with simply a browser or your Chromebook.

Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your project ID.

  1. Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list

Command output

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`
  1. Run the following command in Cloud Shell to confirm that the gcloud command knows about your project:
gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If it is not, you can set it with this command:

gcloud config set project <PROJECT_ID>

Command output

Updated property [core/project].

Before you can begin using Document AI, run the following command in Cloud Shell to enable the Document AI API:

gcloud services enable documentai.googleapis.com

You should see something like this:

Operation "operations/..." finished successfully.

Now, you can use Document AI!

First, create and navigate to your working directory:

mkdir ~/documentai-processors
cd ~/documentai-processors

Then, set the PROJECT_ID and GOOGLE_APPLICATION_CREDENTIALS environment variables:

export PROJECT_ID=$(gcloud config get-value core/project)
export GOOGLE_APPLICATION_CREDENTIALS=~/documentai-processors/key.json

For your code to make authorized API calls from your current shell session, a best practice is to use a service account. A service account is a special kind of account, owned by your project, and identified by its email address (like any other user account).

Create a service account:

SERVICE_ACCOUNT_NAME="documentai-processors-codelab"
SERVICE_ACCOUNT="$SERVICE_ACCOUNT_NAME@$PROJECT_ID.iam.gserviceaccount.com"

gcloud iam service-accounts create $SERVICE_ACCOUNT_NAME

Grant it the "Document AI Editor" role (so your code can manage processors):

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member "serviceAccount:$SERVICE_ACCOUNT" \
  --role "roles/documentai.editor"

Then, create and download the service account key (a JSON file):

gcloud iam service-accounts keys create $GOOGLE_APPLICATION_CREDENTIALS \
  --iam-account $SERVICE_ACCOUNT

Verify that the GOOGLE_APPLICATION_CREDENTIALS environment variable (used by the Document AI client library) points to the full path of the JSON file:

cat $GOOGLE_APPLICATION_CREDENTIALS

You should see the content of the service account key:

{
  "type": "service_account",
  ...
}

Your project and credentials are configured! Next, set up your Python environment...

Create a Python virtual environment to isolate dependencies:

virtualenv venv

Activate the virtual environment:

source venv/bin/activate

Install IPython, the Document AI client library, and python-tabulate (which you'll use to pretty-print the request results):

pip install ipython google-cloud-documentai tabulate

You should see something like this:

...
Installing collected packages: ..., tabulate, ipython, google-cloud-documentai
Successfully installed ... cloud-documentai-1.0.0 ...

Now, you're ready to use the Document AI client library!

In the next steps, you'll use an interactive Python interpreter called IPython, which you installed in the previous step. Start a session by running ipython in Cloud Shell:

ipython

You should see something like this:

Python 3.7.3 (default, Jan 22 2021, 20:04:44)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.27.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

Copy the following code into your IPython session:

import os
from typing import Iterator, Optional, Sequence, Tuple

import google.cloud.documentai_v1beta3 as documentai
from tabulate import tabulate

PROJECT_ID = os.getenv("PROJECT_ID", "")
API_LOCATION = "eu"  # Choose "us" or "eu"

assert PROJECT_ID, "PROJECT_ID is undefined"
assert API_LOCATION in ("us", "eu"), "API_LOCATION is incorrect"

# Test processors
document_ocr_display_name = "document-ocr"
form_parser_display_name = "form-parser"

test_processor_display_names_and_types = (
    (document_ocr_display_name, "OCR_PROCESSOR"),
    (form_parser_display_name, "FORM_PARSER_PROCESSOR"),
)


def get_client() -> documentai.DocumentProcessorServiceClient:
    client_options = dict(api_endpoint=f"{API_LOCATION}-documentai.googleapis.com")
    return documentai.DocumentProcessorServiceClient(client_options=client_options)


def get_parent() -> str:
    return f"projects/{PROJECT_ID}/locations/{API_LOCATION}"
    

You're ready to make your first request and list the processor types...

Before creating a processor, you need to choose its type among the growing list of processor types. You can retrieve this list with fetch_processor_types.

Add the following functions into your IPython session:

def fetch_processor_types() -> Sequence[documentai.ProcessorType]:
    client = get_client()
    response = client.fetch_processor_types(parent=get_parent())
    return response.processor_types


def print_processor_types(processor_types: Sequence[documentai.ProcessorType]):
    def sort_key(pt):
        return (not pt.allow_creation, pt.category, pt.type_)

    sorted_processor_types = sorted(processor_types, key=sort_key)
    data = processor_type_tabular_data(sorted_processor_types)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Processor types: {len(sorted_processor_types)}")


def processor_type_tabular_data(
    processor_types: Sequence[documentai.ProcessorType],
) -> Iterator[Tuple[str, str, str, str]]:
    def locations(pt):
        return ", ".join(sorted(loc.location_id for loc in pt.available_locations))

    yield ("type", "category", "allow_creation", "locations")
    yield ("left", "left", "left", "left")
    if not processor_types:
        yield ("-", "-", "-", "-")
        return
    for pt in processor_types:
        yield (pt.type_, pt.category, f"{pt.allow_creation}", locations(pt))
    

List the processor types:

processor_types = fetch_processor_types()
print_processor_types(processor_types)

You should see something like this:

+--------------------------+-------------+----------------+-----------+
| type                     | category    | allow_creation | locations |
+--------------------------+-------------+----------------+-----------+
| FORM_PARSER_PROCESSOR    | GENERAL     | True           | eu, us    |
| OCR_PROCESSOR            | GENERAL     | True           | eu, us    |
...
| EXPENSE_PROCESSOR        | SPECIALIZED | False          | eu, us    |
...
| FR_NATIONAL_ID_PROCESSOR | SPECIALIZED | False          | eu, us    |
| INVOICE_PROCESSOR        | SPECIALIZED | False          | eu, us    |
...
| US_PASSPORT_PROCESSOR    | SPECIALIZED | False          | eu, us    |
+--------------------------+-------------+----------------+-----------+
→ Processor types: 28

Now, you have all the info needed to create processors...

To create a processor, call create_processor with a display name and a processor type.

Add the following function:

def create_processor(display_name: str, type: str) -> documentai.Processor:
    client = get_client()
    processor_info = documentai.Processor(display_name=display_name, type_=type)
    return client.create_processor(parent=get_parent(), processor=processor_info)
    

Create the test processors:

for display_name, type in test_processor_display_names_and_types:
    print(f"Creating {display_name} ({type})...")
    create_processor(display_name, type)
print("Done")

You should get the following:

Creating document-ocr (OCR_PROCESSOR)...
Creating form-parser (FORM_PARSER_PROCESSOR)...
Done

You have created new processors!

Next, see how to list the processors...

list_processors returns the list of all the processors belonging to your project.

Add the following functions:

def list_processors() -> Sequence[documentai.Processor]:
    client = get_client()
    response = client.list_processors(parent=get_parent())
    return response.processors


def print_processors(processors: Optional[Sequence[documentai.Processor]] = None):
    def sort_key(processor):
        return processor.display_name

    if processors is None:
        processors = list_processors()
    sorted_processors = sorted(processors, key=sort_key)
    data = processor_tabular_data(sorted_processors)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Processors: {len(sorted_processors)}")


def processor_tabular_data(
    processors: Sequence[documentai.Processor],
) -> Iterator[Tuple[str, str, str]]:
    yield ("display_name", "type", "state")
    yield ("left", "left", "left")
    if not processors:
        yield ("-", "-", "-")
        return
    for processor in processors:
        yield (processor.display_name, processor.type_, processor.state.name)
    

Call the functions:

processors = list_processors()
print_processors(processors)

You should get the following:

+--------------+-----------------------+---------+
| display_name | type                  | state   |
+--------------+-----------------------+---------+
| document-ocr | OCR_PROCESSOR         | ENABLED |
| form-parser  | FORM_PARSER_PROCESSOR | ENABLED |
+--------------+-----------------------+---------+
→ Processors: 2

To retrieve a processor by its display name, add the following function:

def get_processor(
    display_name: str, processors: Optional[Sequence[documentai.Processor]] = None
) -> Optional[documentai.Processor]:
    if processors is None:
        processors = list_processors()
    for processor in processors:
        if processor.display_name == display_name:
            return processor
    return None
    

Test the function:

processor = get_processor(document_ocr_display_name, processors)

assert processor is not None
print(processor)

You should see something like this:

name: "projects/PROJECT_NUM/locations/LOCATION/processors/PROCESSOR_ID"
type_: "OCR_PROCESSOR"
display_name: "document-ocr"
state: ENABLED
...

Now, you know how to list your processors and retrieve one of them from its display name. Next, see how to use a processor...

Documents can be processed in two ways:

  • synchronously with process_document, to analyze a single document and directly use the results;
  • asynchronously with batch_process_documents, to launch a batch processing on multiple or large documents.

Your test document ( PDF) is the scan of a questionnaire filled out with handwritten answers. Download it into your working directory, from your IPython session:

!gsutil cp gs://cloud-samples-data/documentai/form.pdf .

Check the content of your working directory:

!ls

You should have the following:

form.pdf  key.json  venv

As you have a single local file to analyze, you can use the synchronous process_document method. Add the following function:

def process_file(
    processor: documentai.Processor, file_path: str, mime_type: str
) -> documentai.Document:
    client = get_client()

    with open(file_path, "rb") as document_file:
        document_content = document_file.read()

    document = documentai.RawDocument(content=document_content, mime_type=mime_type)
    request = documentai.ProcessRequest(raw_document=document, name=processor.name)

    response = client.process_document(request)
    return response.document
    

As your document is a questionnaire, choose the form parser. This general processor detects form fields, in addition to extracting the text (printed and handwritten), which all processors do.

Analyze the document:

processor = get_processor(form_parser_display_name)
assert processor is not None

file_path = "./form.pdf"
mime_type = "application/pdf"

document = process_file(processor, file_path, mime_type)

All processors run an Optical Character Recognition (OCR) first pass on the whole document. Check out the text detected by the OCR pass:

document.text.split("\n")

You should see something like to the following:

['FakeDoc M.D.',
 'HEALTH INTAKE FORM',
 'Please fill out the questionnaire carefully. The information you provide will be used to complete',
 'your health profile and will be kept confidential.',
 'Date:',
 'Sally',
 'Walker',
 'Name:',
 '9/14/19',
 'DOB: 09/04/1986',
 'Address: 24 Barney Lane City: Towaco State: NJ Zip: 07082',
 'Email: Sally, waller@cmail.com Phone #: (906) 917-3486',
 'Gender: F',
 'Single Occupation: Software Engineer',
 'Referred By: None',
 'Emergency Contact: Eva Walker Emergency Contact Phone: (906) 334-8976',
 'Marital Status:',
  ...
]

Add the following functions to print out the detected form fields:

def print_form_fields(document: documentai.Document):
    sorted_form_fields = form_fields_sorted_by_ocr_order(document)
    data = form_field_tabular_data(sorted_form_fields, document)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Form fields: {len(sorted_form_fields)}")


def form_field_tabular_data(
    form_fields: Sequence[documentai.Document.Page.FormField],
    document: documentai.Document,
) -> Iterator[Tuple[str, str, str]]:
    yield ("name", "value", "confidence")
    yield ("right", "left", "right")
    if not form_fields:
        yield ("-", "-", "-")
        return
    for form_field in form_fields:
        name = text_from_anchor(form_field.field_name.text_anchor, document)
        value = text_from_anchor(form_field.field_value.text_anchor, document)
        confidence = form_field.field_value.confidence
        yield (name, value, f"{confidence:.1%}")
    

Also add these utility functions:

def form_fields_sorted_by_ocr_order(
    document: documentai.Document,
) -> Sequence[documentai.Document.Page.FormField]:
    def sort_key(form_field):
        # Sort according to the field name detected position
        text_anchor = form_field.field_name.text_anchor
        return text_anchor.text_segments[0].start_index if text_anchor else 0

    form_fields = (
        form_field for page in document.pages for form_field in page.form_fields
    )
    return sorted(form_fields, key=sort_key)


def text_from_anchor(
    text_anchor: documentai.Document.TextAnchor, document: documentai.Document
) -> str:
    text = "".join(
        document.text[segment.start_index : segment.end_index]
        for segment in text_anchor.text_segments
    )
    return text[:-1] if text.endswith("\n") else text
    

Print the detected form fields:

print_form_fields(document)

You should get a printout like the following:

+--------------+-------------------------+------------+
|         name | value                   | confidence |
+--------------+-------------------------+------------+
|        Date: | 9/14/19                 |     100.0% |
|        Name: | Sally                   |      99.7% |
|              | Walker                  |            |
|         DOB: | 09/04/1986              |     100.0% |
|     Address: | 24 Barney Lane          |      99.9% |
|        City: | Towaco                  |      99.8% |
|       State: | NJ                      |      99.7% |
|         Zip: | 07082                   |      99.5% |
|       Email: | Sally, waller@cmail.com |      99.6% |
|     Phone #: | (906) 917-3486          |     100.0% |
|      Gender: | F                       |     100.0% |
|  Occupation: | Software Engineer       |     100.0% |
| Referred By: | None                    |     100.0% |
...
+--------------+-------------------------+------------+
→ Form fields: 17

Check out how the field names and values have been detected ( PDF). Here is the top half of the questionnaire:

b755fe48b16ba8a1.png

You have analyzed a form (containing both printed and handwritten text) and detected its fields with high confidence: your pixels have been transformed into structured data!

With disable_processor and enable_processor, you can control whether a processor can be used.

Add the following functions:

def update_processor_state(processor: documentai.Processor, enable_processor: bool):
    client = get_client()
    if enable_processor:
        request = documentai.EnableProcessorRequest(name=processor.name)
        operation = client.enable_processor(request)
    else:
        request = documentai.DisableProcessorRequest(name=processor.name)
        operation = client.disable_processor(request)
    operation.result()


def enable_processor(processor: documentai.Processor):
    update_processor_state(processor, True)


def disable_processor(processor: documentai.Processor):
    update_processor_state(processor, False)
    

Disable the form parser processor and check the state of your processors:

processor = get_processor(form_parser_display_name)
assert processor is not None

disable_processor(processor)
print_processors()

You should get the following:

+--------------+-----------------------+----------+
| display_name | type                  | state    |
+--------------+-----------------------+----------+
| document-ocr | OCR_PROCESSOR         | ENABLED  |
| form-parser  | FORM_PARSER_PROCESSOR | DISABLED |
+--------------+-----------------------+----------+
→ Processors: 2

Re-enable the form parser processor:

enable_processor(processor)
print_processors()

You should get the following:

+--------------+-----------------------+---------+
| display_name | type                  | state   |
+--------------+-----------------------+---------+
| document-ocr | OCR_PROCESSOR         | ENABLED |
| form-parser  | FORM_PARSER_PROCESSOR | ENABLED |
+--------------+-----------------------+---------+
→ Processors: 2

And next, the ultimate processor management method (deletion)...

Finally, check out how to use the delete_processor method.

Add the following function:

def delete_processor(processor: documentai.Processor):
    client = get_client()
    operation = client.delete_processor(name=processor.name)
    operation.result()
    

Delete your test processors:

processors_to_delete = [dn for dn, _ in test_processor_display_names_and_types]

print(f"Processors to delete: {len(processors_to_delete)}")
for processor in list_processors():
    if processor.display_name not in processors_to_delete:
        continue
    print(f"  Deleting {processor.display_name}...")
    delete_processor(processor)

print()
print_processors()

You should get the following:

Processors to delete: 2
  Deleting form-parser...
  Deleting document-ocr...

+--------------+------+-------+
| display_name | type | state |
+--------------+------+-------+
| -            | -    | -     |
+--------------+------+-------+
→ Processors: 0

You've covered all the processor management methods! You're almost done...

You learned how to manage Document AI processors using Python!

Clean up

To clean up your development environment, from Cloud Shell:

  • If you're still in your IPython session, enter the exit command to go back to the shell.
  • Stop using the Python virtual environment with the deactivate command.
  • Delete your working directory: cd ~ ; rm -rf ~/documentai-processors/

To delete your Google Cloud project, from Cloud Shell:

  • Retrieve your current project ID: PROJECT_ID=$(gcloud config get-value core/project)
  • Make sure this is the project you wish to delete: echo $PROJECT_ID
  • Delete the project: gcloud projects delete $PROJECT_ID

Learn more

License

This work is licensed under a Creative Commons Attribution 2.0 Generic License.