Form Parsing with Document AI (Python)

1. Introduction

In this codelab, you will learn how to use the Document AI Form Parser to parse a handwritten form with Python.

We will use a simple medical intake form as an example, but this procedure will work with any generalized form supported by DocAI.

Prerequisites

This codelab builds upon content presented in other Document AI Codelabs.

It is recommended that you complete the following Codelabs before proceeding.

Optical Character Recognition (OCR) with Document AI (Python)

What you'll learn

How to parse and extract data from a scanned form using the Document AI Form Parser.

What you'll need

A Google Cloud Project
A Browser, such as Chrome or Firefox
Knowledge of Python 3

Survey

How will you use this tutorial?

Read it through only

Read it and complete the exercises

How would you rate your experience with Python?

Novice

Intermediate

Proficient

How would you rate your experience with using Google Cloud services?

Novice

Intermediate

Proficient

2. Setup and Requirements

This codelab assumes you have completed the Document AI setup steps listed in the Document AI OCR Codelab.

Please complete the following steps before proceeding:

You will also need to install Pandas, an Open Source Data Analysis library for Python.

pip3 install --upgrade pandas

3. Create a Form Parser processor

You must first create a Form Parser processor instance to use in the Document AI Platform for this tutorial.

In the console, navigate to the Document AI Platform Overview
Click Create Processor and select Form Parser
Specify a processor name and select your region from the list.
Click Create to create your processor
Copy your processor ID. You must use this in your code later.

Test processor in the Cloud Console

You can test out your processor in the console by uploading a document. Click Upload Document and select a form to parse. You can download and use this sample form if you do not have one available to use.

Health Form

Your output should look this: Parsed Form

4. Download the Sample Form

We have a sample document which contains a simple medical intake form.

You can download the PDF using the following link. Then upload it to the Cloud Shell instance.

Alternatively, you can download it from our public Google Cloud Storage Bucket using gsutil.

gsutil cp gs://cloud-samples-data/documentai/codelabs/form-parser/intake-form.pdf .

Confirm the file is downloaded to your Cloud Shell using the below command:

ls -ltr intake-form.pdf

5. Extract Form Key/Value Pairs

In this step you will use the online processing API to call the form parser processor you created previously. Then, you will extract the key value pairs found in the document.

Online processing is for sending a single document and waiting for the response. You can also use batch processing if you want to send multiple files or if the file size exceeds the online processing maximum pages. You can review how to do this in the OCR Codelab.

The code for making a process request is identical for every processor type aside from the Processor ID.

The Document response object contains a list of pages from the input document.

Each page object contains a list of form fields and their locations in the text.

The following code iterates through each page and extracts each key, value and confidence score. This is structured data that can more easily stored in databases or used in other applications.

Create a file called form_parser.py and use the code below.

form_parser.py

import pandas as pd
from google.cloud import documentai_v1 as documentai


def online_process(
    project_id: str,
    location: str,
    processor_id: str,
    file_path: str,
    mime_type: str,
) -> documentai.Document:
    """
    Processes a document using the Document AI Online Processing API.
    """

    opts = {"api_endpoint": f"{location}-documentai.googleapis.com"}

    # Instantiates a client
    documentai_client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    resource_name = documentai_client.processor_path(project_id, location, processor_id)

    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

        # Load Binary Data into Document AI RawDocument Object
        raw_document = documentai.RawDocument(
            content=image_content, mime_type=mime_type
        )

        # Configure the process request
        request = documentai.ProcessRequest(
            name=resource_name, raw_document=raw_document
        )

        # Use the Document AI client to process the sample form
        result = documentai_client.process_document(request=request)

        return result.document


def trim_text(text: str):
    """
    Remove extra space characters from text (blank, newline, tab, etc.)
    """
    return text.strip().replace("\n", " ")


PROJECT_ID = "YOUR_PROJECT_ID"
LOCATION = "YOUR_PROJECT_LOCATION"  # Format is 'us' or 'eu'
PROCESSOR_ID = "FORM_PARSER_ID"  # Create processor in Cloud Console

# The local file in your current working directory
FILE_PATH = "intake-form.pdf"
# Refer to https://cloud.google.com/document-ai/docs/processors-list
# for supported file types
MIME_TYPE = "application/pdf"

document = online_process(
    project_id=PROJECT_ID,
    location=LOCATION,
    processor_id=PROCESSOR_ID,
    file_path=FILE_PATH,
    mime_type=MIME_TYPE,
)

names = []
name_confidence = []
values = []
value_confidence = []

for page in document.pages:
    for field in page.form_fields:
        # Get the extracted field names
        names.append(trim_text(field.field_name.text_anchor.content))
        # Confidence - How "sure" the Model is that the text is correct
        name_confidence.append(field.field_name.confidence)

        values.append(trim_text(field.field_value.text_anchor.content))
        value_confidence.append(field.field_value.confidence)

# Create a Pandas Dataframe to print the values in tabular format.
df = pd.DataFrame(
    {
        "Field Name": names,
        "Field Name Confidence": name_confidence,
        "Field Value": values,
        "Field Value Confidence": value_confidence,
    }
)

print(df)

Run your code now and you should see the text extracted and printed in your console.

You should see the following output if using our sample document:

$ python3 form_parser.py
                                           Field Name  Field Name Confidence                                        Field Value  Field Value Confidence
0                                            Phone #:               0.999982                                     (906) 917-3486                0.999982
1                                  Emergency Contact:               0.999972                                         Eva Walker                0.999972
2                                     Marital Status:               0.999951                                             Single                0.999951
3                                             Gender:               0.999933                                                  F                0.999933
4                                         Occupation:               0.999914                                  Software Engineer                0.999914
5                                        Referred By:               0.999862                                               None                0.999862
6                                               Date:               0.999858                                            9/14/19                0.999858
7                                                DOB:               0.999716                                         09/04/1986                0.999716
8                                            Address:               0.999147                                     24 Barney Lane                0.999147
9                                               City:               0.997718                                             Towaco                0.997718
10                                              Name:               0.997345                                       Sally Walker                0.997345
11                                             State:               0.996944                                                 NJ                0.996944
...

6. Parse Tables

The Form Parser is also able to extract data from tables within documents. In this step, we will download a new sample document and extract data from the table. Since we are loading the data into Pandas, this data can be output to a CSV file and many other formats with a single method call.

Download the Sample Form with Tables

We have a sample document which contains a sample form and a table.

You can download the PDF using the following link. Then upload it to the Cloud Shell instance.

Alternatively, you can download it from our public Google Cloud Storage Bucket using gsutil.

gsutil cp gs://cloud-samples-data/documentai/codelabs/form-parser/form_with_tables.pdf .

Confirm the file is downloaded to your Cloud Shell using the below command:

ls -ltr form_with_tables.pdf

Extract Table Data

The processing request for table data is exactly the same as for extracting key-value pairs. The difference is which fields we extract the data from in the response. Table data is stored in the pages[].tables[] field.

This example extracts information about from the table header rows and body rows for each table and page, then prints out the table and saves the table as a CSV file.

Create a file called table_parsing.py and use the code below.

table_parsing.py

# type: ignore[1]
"""
Uses Document AI online processing to call a form parser processor
Extracts the tables and data in the document.
"""
from os.path import splitext
from typing import List, Sequence

import pandas as pd
from google.cloud import documentai


def online_process(
    project_id: str,
    location: str,
    processor_id: str,
    file_path: str,
    mime_type: str,
) -> documentai.Document:
    """
    Processes a document using the Document AI Online Processing API.
    """

    opts = {"api_endpoint": f"{location}-documentai.googleapis.com"}

    # Instantiates a client
    documentai_client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    resource_name = documentai_client.processor_path(project_id, location, processor_id)

    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

        # Load Binary Data into Document AI RawDocument Object
        raw_document = documentai.RawDocument(
            content=image_content, mime_type=mime_type
        )

        # Configure the process request
        request = documentai.ProcessRequest(
            name=resource_name, raw_document=raw_document
        )

        # Use the Document AI client to process the sample form
        result = documentai_client.process_document(request=request)

        return result.document


def get_table_data(
    rows: Sequence[documentai.Document.Page.Table.TableRow], text: str
) -> List[List[str]]:
    """
    Get Text data from table rows
    """
    all_values: List[List[str]] = []
    for row in rows:
        current_row_values: List[str] = []
        for cell in row.cells:
            current_row_values.append(
                text_anchor_to_text(cell.layout.text_anchor, text)
            )
        all_values.append(current_row_values)
    return all_values


def text_anchor_to_text(text_anchor: documentai.Document.TextAnchor, text: str) -> str:
    """
    Document AI identifies table data by their offsets in the entirety of the
    document's text. This function converts offsets to a string.
    """
    response = ""
    # If a text segment spans several lines, it will
    # be stored in different text segments.
    for segment in text_anchor.text_segments:
        start_index = int(segment.start_index)
        end_index = int(segment.end_index)
        response += text[start_index:end_index]
    return response.strip().replace("\n", " ")


PROJECT_ID = "YOUR_PROJECT_ID"
LOCATION = "YOUR_PROJECT_LOCATION"  # Format is 'us' or 'eu'
PROCESSOR_ID = "FORM_PARSER_ID"  # Create processor before running sample

# The local file in your current working directory
FILE_PATH = "form_with_tables.pdf"
# Refer to https://cloud.google.com/document-ai/docs/file-types
# for supported file types
MIME_TYPE = "application/pdf"

document = online_process(
    project_id=PROJECT_ID,
    location=LOCATION,
    processor_id=PROCESSOR_ID,
    file_path=FILE_PATH,
    mime_type=MIME_TYPE,
)

header_row_values: List[List[str]] = []
body_row_values: List[List[str]] = []

# Input Filename without extension
output_file_prefix = splitext(FILE_PATH)[0]

for page in document.pages:
    for index, table in enumerate(page.tables):
        header_row_values = get_table_data(table.header_rows, document.text)
        body_row_values = get_table_data(table.body_rows, document.text)

        # Create a Pandas Dataframe to print the values in tabular format.
        df = pd.DataFrame(
            data=body_row_values,
            columns=pd.MultiIndex.from_arrays(header_row_values),
        )

        print(f"Page {page.page_number} - Table {index}")
        print(df)

        # Save each table as a CSV file
        output_filename = f"{output_file_prefix}_pg{page.page_number}_tb{index}.csv"
        df.to_csv(output_filename, index=False)