Form Parsing with Document AI (Python)

1. Introduction

c65b9ae04aa1853.png

In this codelab, you will learn how to use the Document AI Form Parser to parse a handwritten form with Python.

We will use a simple medical intake form as an example, but this procedure will work with any generalized form supported by DocAI.

Prerequisites

This codelab builds upon content presented in other Document AI Codelabs.

It is recommended that you complete the following Codelabs before proceeding.

What you'll learn

  • How to parse and extract data from a scanned form using the Document AI Form Parser.

What you'll need

  • A Google Cloud Project
  • A Browser, such as Chrome or Firefox
  • Knowledge of Python 3

Survey

How will you use this tutorial?

Read it through only Read it and complete the exercises

How would you rate your experience with Python?

Novice Intermediate Proficient

How would you rate your experience with using Google Cloud services?

Novice Intermediate Proficient

2. Setup and Requirements

This codelab assumes you have completed the Document AI setup steps listed in the Document AI OCR Codelab.

Please complete the following steps before proceeding:

You will also need to install Pandas, an Open Source Data Analysis library for Python.

pip3 install --upgrade pandas

3. Create a Form Parser processor

You must first create a Form Parser processor instance to use in the Document AI Platform for this tutorial.

  1. In the console, navigate to the Document AI Platform Overview
  2. Click Create Processor and select Form ParserProcessors
  3. Specify a processor name and select your region from the list.
  4. Click Create to create your processor
  5. Copy your processor ID. You must use this in your code later.

Test processor in the Cloud Console

You can test out your processor in the console by uploading a document. Click Upload Document and select a form to parse. You can download and use this sample form if you do not have one available to use.

Health Form

Your output should look this: Parsed Form

4. Download the Sample Form

We have a sample document which contains a simple medical intake form.

You can download the PDF using the following link. Then upload it to the cloudshell instance.

Alternatively, you can download it from our public Google Cloud Storage Bucket using gsutil.

gsutil cp gs://cloud-samples-data/documentai/codelabs/form-parser/intake-form.pdf .

Confirm the file is downloaded to your cloudshell using the below command:

ls -ltr intake-form.pdf

5. Extract Form Key/Value Pairs

In this step you will use the online processing API to call the form parser processor you created previously. Then, you will extract the key value pairs found in the document.

Online processing is for sending a single document and waiting for the response. You can also use batch processing if you want to send multiple files or if the file size exceeds the online processing maximum pages. You can review how to do this in the OCR Codelab.

The code for making a process request is identical for every processor type aside from the Processor ID.

The Document response object contains a list of pages from the input document.

Each page object contains a list of form fields and their locations in the text.

The following code iterates through each page and extracts each key, value and confidence score. This is structured data that can more easily stored in databases or used in other applications.

Create a file called form_parser.py and use the code below.

form_parser.py

import pandas as pd
from google.cloud import documentai_v1 as documentai


def online_process(
    project_id: str,
    location: str,
    processor_id: str,
    file_path: str,
    mime_type: str,
) -> documentai.Document:
    """
    Processes a document using the Document AI Online Processing API.
    """

    opts = {"api_endpoint": f"{location}-documentai.googleapis.com"}

    # Instantiates a client
    documentai_client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    resource_name = documentai_client.processor_path(project_id, location, processor_id)

    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

        # Load Binary Data into Document AI RawDocument Object
        raw_document = documentai.RawDocument(
            content=image_content, mime_type=mime_type
        )

        # Configure the process request
        request = documentai.ProcessRequest(
            name=resource_name, raw_document=raw_document
        )

        # Use the Document AI client to process the sample form
        result = documentai_client.process_document(request=request)

        return result.document


def trim_text(text: str):
    """
    Remove extra space characters from text (blank, newline, tab, etc.)
    """
    return text.strip().replace("\n", " ")


PROJECT_ID = "YOUR_PROJECT_ID"
LOCATION = "YOUR_PROJECT_LOCATION"  # Format is 'us' or 'eu'
PROCESSOR_ID = "FORM_PARSER_ID"  # Create processor in Cloud Console

# The local file in your current working directory
FILE_PATH = "form.pdf"
# Refer to https://cloud.google.com/document-ai/docs/processors-list
# for supported file types
MIME_TYPE = "application/pdf"

document = online_process(
    project_id=PROJECT_ID,
    location=LOCATION,
    processor_id=PROCESSOR_ID,
    file_path=FILE_PATH,
    mime_type=MIME_TYPE,
)

names = []
name_confidence = []
values = []
value_confidence = []

for page in document.pages:
    for field in page.form_fields:
        # Get the extracted field names
        names.append(trim_text(field.field_name.text_anchor.content))
        # Confidence - How "sure" the Model is that the text is correct
        name_confidence.append(field.field_name.confidence)

        values.append(trim_text(field.field_value.text_anchor.content))
        value_confidence.append(field.field_value.confidence)

# Create a Pandas Dataframe to print the values in tabular format.
df = pd.DataFrame(
    {
        "Field Name": names,
        "Field Name Confidence": name_confidence,
        "Field Value": values,
        "Field Value Confidence": value_confidence,
    }
)

print(df)

Run your code now and you should see the text extracted and printed in your console.

You should see the following output if using our sample document:

$ python3 form_parser.py
                                           Field Name  Field Name Confidence                                        Field Value  Field Value Confidence
0                                            Phone #:               0.999982                                     (906) 917-3486                0.999982
1                                  Emergency Contact:               0.999972                                         Eva Walker                0.999972
2                                     Marital Status:               0.999951                                             Single                0.999951
3                                             Gender:               0.999933                                                  F                0.999933
4                                         Occupation:               0.999914                                  Software Engineer                0.999914
5                                        Referred By:               0.999862                                               None                0.999862
6                                               Date:               0.999858                                            9/14/19                0.999858
7                                                DOB:               0.999716                                         09/04/1986                0.999716
8                                            Address:               0.999147                                     24 Barney Lane                0.999147
9                                               City:               0.997718                                             Towaco                0.997718
10                                              Name:               0.997345                                       Sally Walker                0.997345
11                                             State:               0.996944                                                 NJ                0.996944
...

6. Congratulations

Congratulations, you've successfully used the Document AI API to extract data from a handwritten form. We encourage you to experiment with other form documents.

Clean Up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial:

  • In the Cloud Console, go to the Manage resources page.
  • In the project list, select your project then click Delete.
  • In the dialog, type the project ID and then click Shut down to delete the project.

Learn More

Continue learning about Document AI with these follow-up Codelabs.

Resources

License

This work is licensed under a Creative Commons Attribution 2.0 Generic License.