Managing Document AI processors with Python

1. Overview


What is Document AI?

Document AI is a platform that lets you extract insights from your documents. At heart, it offers a growing list of document processors (also called parsers or splitters, depending on their functionality).

There are two ways you can manage Document AI processors:

  • manually, from the web console;
  • programmatically, using the Document AI API.

Here is an example screenshot showing your processor list, both from the web console and from Python code:


In this lab, you will focus on managing Document AI processors programmatically with the Python client library.

What you'll see

  • How to enable Document AI
  • How to set up your project
  • How to set up Python
  • How to list processor types
  • How to create processors
  • How to list processors
  • How to analyze a document
  • How to enable/disable processors
  • How to delete processors

What you'll need

  • A Google Cloud project
  • A browser, such as Chrome or Firefox
  • Familiarity using Python


How will you use this tutorial?

Read it through only Read it and complete the exercises

How would you rate your experience with Python?

Novice Intermediate Proficient

How would you rate your experience with using Google Cloud services?

Novice Intermediate Proficient

2. Setup and requirements

Self-paced environment setup

  1. Sign-in to the Google Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one.




  • The Project name is the display name for this project's participants. It is a character string not used by Google APIs. You can update it at any time.
  • The Project ID must be unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference the Project ID (it is typically identified as PROJECT_ID). If you don't like the generated ID, you may generate another random one. Alternatively, you can try your own and see if it's available. It cannot be changed after this step and will remain for the duration of the project.
  • For your information, there is a third value, a Project Number which some APIs use. Learn more about all three of these values in the documentation.
  1. Next, you'll need to enable billing in the Cloud Console to use Cloud resources/APIs. Running through this codelab shouldn't cost much, if anything at all. To shut down resources so you don't incur billing beyond this tutorial, you can delete the resources you created or delete the whole project. New users of Google Cloud are eligible for the $300 USD Free Trial program.

Start Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this lab you will be using Cloud Shell, a command line environment running in the Cloud.

Activate Cloud Shell

  1. From the Cloud Console, click Activate Cloud Shell 853e55310c205094.png.


If you've never started Cloud Shell before, you're presented with an intermediate screen (below the fold) describing what it is. If that's the case, click Continue (and you won't ever see it again). Here's what that one-time screen looks like:


It should only take a few moments to provision and connect to Cloud Shell.


This virtual machine is loaded with all the development tools you need. It offers a persistent 5GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with simply a browser or your Chromebook.

Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your project ID.

  1. Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list

Command output

 Credentialed Accounts
*       <my_account>@<>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`
  1. Run the following command in Cloud Shell to confirm that the gcloud command knows about your project:
gcloud config list project

Command output

project = <PROJECT_ID>

If it is not, you can set it with this command:

gcloud config set project <PROJECT_ID>

Command output

Updated property [core/project].

3. Enabling Document AI

Before you can begin using Document AI, run the following command in Cloud Shell to enable the Document AI API:

gcloud services enable

You should see something like this:

Operation "operations/..." finished successfully.

Now, you can use Document AI!

4. Project setup

First, create and navigate to your working directory:

mkdir ~/documentai-processors
cd ~/documentai-processors

Then, set the PROJECT_ID and GOOGLE_APPLICATION_CREDENTIALS environment variables:

export PROJECT_ID=$(gcloud config get-value core/project)
export GOOGLE_APPLICATION_CREDENTIALS=~/documentai-processors/key.json

For your code to make authorized API calls from your current shell session, a best practice is to use a service account. A service account is a special kind of account, owned by your project, and identified by its email address (like any other user account).

Create a service account:


gcloud iam service-accounts create $SERVICE_ACCOUNT_NAME

Grant it the "Document AI Editor" role (so your code can manage processors):

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member "serviceAccount:$SERVICE_ACCOUNT" \
  --role "roles/documentai.editor"

Then, create and download the service account key (a JSON file):

gcloud iam service-accounts keys create $GOOGLE_APPLICATION_CREDENTIALS \
  --iam-account $SERVICE_ACCOUNT

Verify that the GOOGLE_APPLICATION_CREDENTIALS environment variable (used by the Document AI client library) points to the full path of the JSON file:


You should see the content of the service account key:

  "type": "service_account",

Your project and credentials are configured! Next, set up your Python environment...

5. Python setup

Create a Python virtual environment to isolate dependencies:

virtualenv venv

Activate the virtual environment:

source venv/bin/activate

Install IPython, the Document AI client library, and python-tabulate (which you'll use to pretty-print the request results):

pip install ipython google-cloud-documentai tabulate

You should see something like this:

Installing collected packages: ..., tabulate, ipython, google-cloud-documentai
Successfully installed ... cloud-documentai-2.0.1 ...

Now, you're ready to use the Document AI client library!

In the next steps, you'll use an interactive Python interpreter called IPython, which you installed in the previous step. Start a session by running ipython in Cloud Shell:


You should see something like this:

Python 3.9.2 (default, Feb 28 2021, 17:03:44)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

Copy the following code into your IPython session:

import os
from typing import Iterator, Optional, Sequence, Tuple, cast

import as docai
from google.api_core.client_options import ClientOptions
from tabulate import tabulate

PROJECT_ID = os.getenv("PROJECT_ID", "")
API_LOCATION = "eu"  # Choose "us" or "eu"

assert PROJECT_ID, "PROJECT_ID is undefined"
assert API_LOCATION in ("us", "eu"), "API_LOCATION is incorrect"

# Test processors
document_ocr_display_name = "document-ocr"
form_parser_display_name = "form-parser"

test_processor_display_names_and_types = (
    (document_ocr_display_name, "OCR_PROCESSOR"),
    (form_parser_display_name, "FORM_PARSER_PROCESSOR"),

def get_client() -> docai.DocumentProcessorServiceClient:
    client_options = ClientOptions(
    return docai.DocumentProcessorServiceClient(client_options=client_options)

def get_parent(client: docai.DocumentProcessorServiceClient) -> str:
    return client.common_location_path(PROJECT_ID, API_LOCATION)

def get_client_and_parent() -> Tuple[docai.DocumentProcessorServiceClient, str]:
    client = get_client()
    parent = get_parent(client)
    return client, parent

You're ready to make your first request and list the processor types...

6. Listing processor types

Before creating a processor, you need to choose its type among the growing list of processor types. You can retrieve this list with fetch_processor_types.

Add the following functions into your IPython session:

def fetch_processor_types() -> Sequence[docai.ProcessorType]:
    client, parent = get_client_and_parent()
    response = client.fetch_processor_types(parent=parent)
    return cast(Sequence[docai.ProcessorType], response.processor_types)

def print_processor_types(processor_types: Sequence[docai.ProcessorType]):
    def sort_key(pt):
        return (not pt.allow_creation, pt.category, pt.type_)

    sorted_processor_types = sorted(processor_types, key=sort_key)
    data = processor_type_tabular_data(sorted_processor_types)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Processor types: {len(sorted_processor_types)}")

def processor_type_tabular_data(
    processor_types: Sequence[docai.ProcessorType],
) -> Iterator[Tuple[str, str, str, str]]:
    def locations(pt):
        return ", ".join(sorted(loc.location_id for loc in pt.available_locations))

    yield ("type", "category", "allow_creation", "locations")
    yield ("left", "left", "left", "left")
    if not processor_types:
        yield ("-", "-", "-", "-")
    for pt in processor_types:
        yield (
            cast(str, pt.type_),
            cast(str, pt.category),

List the processor types:

processor_types = fetch_processor_types()

You should see something like this:

| type                  | category    | allow_creation | locations  |
| FORM_PARSER_PROCESSOR | GENERAL     | True           | eu, us     |
| OCR_PROCESSOR         | GENERAL     | True           | eu, us     |
| EXPENSE_PROCESSOR     | SPECIALIZED | True           | eu, us     |
→ Processor types: 35

Now, you have all the info needed to create processors...

7. Creating processors

To create a processor, call create_processor with a display name and a processor type.

Add the following function:

def create_processor(display_name: str, type: str) -> docai.Processor:
    client, parent = get_client_and_parent()
    processor = docai.Processor(display_name=display_name, type_=type)
    return client.create_processor(parent=parent, processor=processor)

Create the test processors:

for display_name, type in test_processor_display_names_and_types:
    print(f"Creating {display_name} ({type})...")
    create_processor(display_name, type)

You should get the following:

Creating document-ocr (OCR_PROCESSOR)...
Creating form-parser (FORM_PARSER_PROCESSOR)...

You have created new processors!

Next, see how to list the processors...

8. Listing processors

list_processors returns the list of all the processors belonging to your project.

Add the following functions:

def list_processors() -> Sequence[docai.Processor]:
    client, parent = get_client_and_parent()
    response = client.list_processors(parent=parent)
    return response.processors

def print_processors(processors: Optional[Sequence[docai.Processor]] = None):
    def sort_key(processor):
        return processor.display_name

    if processors is None:
        processors = list_processors()
    sorted_processors = sorted(processors, key=sort_key)
    data = processor_tabular_data(sorted_processors)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Processors: {len(sorted_processors)}")

def processor_tabular_data(
    processors: Sequence[docai.Processor],
) -> Iterator[Tuple[str, str, str]]:
    yield ("display_name", "type", "state")
    yield ("left", "left", "left")
    if not processors:
        yield ("-", "-", "-")
    for processor in processors:
        yield (
            cast(str, processor.display_name),
            cast(str, processor.type_),

Call the functions:

processors = list_processors()

You should get the following:

| display_name | type                  | state   |
| document-ocr | OCR_PROCESSOR         | ENABLED |
→ Processors: 2

To retrieve a processor by its display name, add the following function:

def get_processor(
    display_name: str, processors: Optional[Sequence[docai.Processor]] = None
) -> Optional[docai.Processor]:
    if processors is None:
        processors = list_processors()
    for processor in processors:
        if processor.display_name == display_name:
            return processor
    return None

Test the function:

processor = get_processor(document_ocr_display_name, processors)

assert processor is not None

You should see something like this:

name: "projects/PROJECT_NUM/locations/LOCATION/processors/PROCESSOR_ID"
display_name: "document-ocr"
state: ENABLED

Now, you know how to list your processors and retrieve one of them from its display name. Next, see how to use a processor...

9. Analyzing a document

Documents can be processed in two ways:

  • synchronously with process_document, to analyze a single document and directly use the results;
  • asynchronously with batch_process_documents, to launch a batch processing on multiple or large documents.

Your test document ( PDF) is the scan of a questionnaire filled out with handwritten answers. Download it into your working directory, from your IPython session:

!gsutil cp gs://cloud-samples-data/documentai/form.pdf .

Check the content of your working directory:


You should have the following:

form.pdf  key.json  venv

As you have a single local file to analyze, you can use the synchronous process_document method. Add the following function:

def process_file(
    processor: docai.Processor, file_path: str, mime_type: str
) -> docai.Document:
    client = get_client()

    with open(file_path, "rb") as document_file:
        document_content =

    document = docai.RawDocument(content=document_content, mime_type=mime_type)
    request = docai.ProcessRequest(raw_document=document,

    response = client.process_document(request)
    return cast(docai.Document, response.document)

As your document is a questionnaire, choose the form parser. This general processor detects form fields, in addition to extracting the text (printed and handwritten), which all processors do.

Analyze the document:

processor = get_processor(form_parser_display_name)
assert processor is not None

file_path = "./form.pdf"
mime_type = "application/pdf"

document = process_file(processor, file_path, mime_type)

All processors run an Optical Character Recognition (OCR) first pass on the whole document. Check out the text detected by the OCR pass:


You should see something like to the following:

['FakeDoc M.D.',
 'Please fill out the questionnaire carefully. The information you provide will be used to complete',
 'your health profile and will be kept confidential.',
 'DOB: 09/04/1986',
 'Address: 24 Barney Lane City: Towaco State: NJ Zip: 07082',
 'Email: Sally, Phone #: (906) 917-3486',
 'Gender: F',
 'Single Occupation: Software Engineer',
 'Referred By: None',
 'Emergency Contact: Eva Walker Emergency Contact Phone: (906) 334-8976',
 'Marital Status:',

Add the following functions to print out the detected form fields:

def print_form_fields(document: docai.Document):
    sorted_form_fields = form_fields_sorted_by_ocr_order(document)
    data = form_field_tabular_data(sorted_form_fields, document)
    headers = next(data)
    colalign = next(data)

    print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
    print(f"→ Form fields: {len(sorted_form_fields)}")

def form_field_tabular_data(
    form_fields: Sequence[docai.Document.Page.FormField],
    document: docai.Document,
) -> Iterator[Tuple[str, str, str]]:
    yield ("name", "value", "confidence")
    yield ("right", "left", "right")
    if not form_fields:
        yield ("-", "-", "-")
    for form_field in form_fields:
        name_layout = cast(docai.Document.Page.Layout, form_field.field_name)
        value_layout = cast(docai.Document.Page.Layout, form_field.field_value)
        name_anchor = cast(docai.Document.TextAnchor, name_layout.text_anchor)
        value_anchor = cast(docai.Document.TextAnchor, value_layout.text_anchor)
        name = text_from_anchor(name_anchor, document)
        value = text_from_anchor(value_anchor, document)
        confidence = value_layout.confidence
        yield (name, value, f"{confidence:.1%}")

Also add these utility functions:

def form_fields_sorted_by_ocr_order(
    document: docai.Document,
) -> Sequence[docai.Document.Page.FormField]:
    def sort_key(form_field):
        # Sort according to the field name detected position
        text_anchor = form_field.field_name.text_anchor
        return text_anchor.text_segments[0].start_index if text_anchor else 0

    pages = cast(Sequence[docai.Document.Page], document.pages)
    form_fields = (
        for page in pages
        for form_field in cast(
            Sequence[docai.Document.Page.FormField], page.form_fields
    return sorted(form_fields, key=sort_key)

def text_from_anchor(
    text_anchor: docai.Document.TextAnchor, document: docai.Document
) -> str:
    full_text = cast(str, document.text)
    segments = cast(
        Sequence[docai.Document.TextAnchor.TextSegment], text_anchor.text_segments
    text = "".join(
        full_text[segment.start_index : segment.end_index] for segment in segments
    return text[:-1] if text.endswith("\n") else text

Print the detected form fields:


You should get a printout like the following:

|         name | value                   | confidence |
|        Date: | 9/14/19                 |     100.0% |
|        Name: | Sally                   |      99.7% |
|              | Walker                  |            |
|         DOB: | 09/04/1986              |     100.0% |
|     Address: | 24 Barney Lane          |      99.9% |
|        City: | Towaco                  |      99.8% |
|       State: | NJ                      |      99.7% |
|         Zip: | 07082                   |      99.5% |
|       Email: | Sally, |      99.6% |
|     Phone #: | (906) 917-3486          |     100.0% |
|      Gender: | F                       |     100.0% |
|  Occupation: | Software Engineer       |     100.0% |
| Referred By: | None                    |     100.0% |
→ Form fields: 17

Check out how the field names and values have been detected ( PDF). Here is the top half of the questionnaire:


You have analyzed a form (containing both printed and handwritten text) and detected its fields with high confidence: your pixels have been transformed into structured data!

10. Enabling/disabling processors

With disable_processor and enable_processor, you can control whether a processor can be used.

Add the following functions:

def update_processor_state(processor: docai.Processor, enable_processor: bool):
    client = get_client()
    if enable_processor:
        request = docai.EnableProcessorRequest(
        operation = client.enable_processor(request)
        request = docai.DisableProcessorRequest(
        operation = client.disable_processor(request)

def enable_processor(processor: docai.Processor):
    update_processor_state(processor, True)

def disable_processor(processor: docai.Processor):
    update_processor_state(processor, False)

Disable the form parser processor and check the state of your processors:

processor = get_processor(form_parser_display_name)
assert processor is not None


You should get the following:

| display_name | type                  | state    |
| document-ocr | OCR_PROCESSOR         | ENABLED  |
→ Processors: 2

Re-enable the form parser processor:


You should get the following:

| display_name | type                  | state   |
| document-ocr | OCR_PROCESSOR         | ENABLED |
→ Processors: 2

And next, the ultimate processor management method (deletion)...

11. Deleting processors

Finally, check out how to use the delete_processor method.

Add the following function:

def delete_processor(processor: docai.Processor):
    client = get_client()
    operation = client.delete_processor(name=cast(str,

Delete your test processors:

processors_to_delete = [dn for dn, _ in test_processor_display_names_and_types]

print(f"Processors to delete: {len(processors_to_delete)}")
for processor in list_processors():
    if processor.display_name not in processors_to_delete:
    print(f"  Deleting {processor.display_name}...")


You should get the following:

Processors to delete: 2
  Deleting form-parser...
  Deleting document-ocr...

| display_name | type | state |
| -            | -    | -     |
→ Processors: 0

You've covered all the processor management methods! You're almost done...

12. Congratulations!


You learned how to manage Document AI processors using Python!

Clean up

To clean up your development environment, from Cloud Shell:

  • If you're still in your IPython session, enter the exit command to go back to the shell.
  • Stop using the Python virtual environment with the deactivate command.
  • Delete your working directory: cd ~ ; rm -rf ~/documentai-processors/

To delete your Google Cloud project, from Cloud Shell:

  • Retrieve your current project ID: PROJECT_ID=$(gcloud config get-value core/project)
  • Make sure this is the project you wish to delete: echo $PROJECT_ID
  • Delete the project: gcloud projects delete $PROJECT_ID

Learn more


This work is licensed under a Creative Commons Attribution 2.0 Generic License.