1. Introduction
In this codelab, you will learn how to use the Document AI Form Parser to parse a handwritten form with Python.
We will use a simple medical intake form as an example, but this procedure will work with any generalized form supported by DocAI.
Prerequisites
This codelab builds upon content presented in other Document AI Codelabs.
It is recommended that you complete the following Codelabs before proceeding.
What you'll learn
- How to parse and extract data from a scanned form using the Document AI Form Parser.
What you'll need
Survey
How will you use this tutorial?
How would you rate your experience with Python?
How would you rate your experience with using Google Cloud services?
2. Setup and Requirements
This codelab assumes you have completed the Document AI setup steps listed in the Document AI OCR Codelab.
Please complete the following steps before proceeding:
You will also need to install Pandas, an Open Source Data Analysis library for Python.
pip3 install --upgrade pandas
3. Create a Form Parser processor
You must first create a Form Parser processor instance to use in the Document AI Platform for this tutorial.
- In the console, navigate to the Document AI Platform Overview
- Click Create Processor and select Form Parser
- Specify a processor name and select your region from the list.
- Click Create to create your processor
- Copy your processor ID. You must use this in your code later.
Test processor in the Cloud Console
You can test out your processor in the console by uploading a document. Click Upload Document and select a form to parse. You can download and use this sample form if you do not have one available to use.
Your output should look this:
4. Download the Sample Form
We have a sample document which contains a simple medical intake form.
You can download the PDF using the following link. Then upload it to the Cloud Shell instance.
Alternatively, you can download it from our public Google Cloud Storage Bucket using gsutil
.
gsutil cp gs://cloud-samples-data/documentai/codelabs/form-parser/intake-form.pdf .
Confirm the file is downloaded to your Cloud Shell using the below command:
ls -ltr intake-form.pdf
5. Extract Form Key/Value Pairs
In this step you will use the online processing API to call the form parser processor you created previously. Then, you will extract the key value pairs found in the document.
Online processing is for sending a single document and waiting for the response. You can also use batch processing if you want to send multiple files or if the file size exceeds the online processing maximum pages. You can review how to do this in the OCR Codelab.
The code for making a process request is identical for every processor type aside from the Processor ID.
The Document response object contains a list of pages from the input document.
Each page
object contains a list of form fields and their locations in the text.
The following code iterates through each page and extracts each key, value and confidence score. This is structured data that can more easily stored in databases or used in other applications.
Create a file called form_parser.py
and use the code below.
form_parser.py
import pandas as pd
from google.cloud import documentai_v1 as documentai
def online_process(
project_id: str,
location: str,
processor_id: str,
file_path: str,
mime_type: str,
) -> documentai.Document:
"""
Processes a document using the Document AI Online Processing API.
"""
opts = {"api_endpoint": f"{location}-documentai.googleapis.com"}
# Instantiates a client
documentai_client = documentai.DocumentProcessorServiceClient(client_options=opts)
# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
resource_name = documentai_client.processor_path(project_id, location, processor_id)
# Read the file into memory
with open(file_path, "rb") as image:
image_content = image.read()
# Load Binary Data into Document AI RawDocument Object
raw_document = documentai.RawDocument(
content=image_content, mime_type=mime_type
)
# Configure the process request
request = documentai.ProcessRequest(
name=resource_name, raw_document=raw_document
)
# Use the Document AI client to process the sample form
result = documentai_client.process_document(request=request)
return result.document
def trim_text(text: str):
"""
Remove extra space characters from text (blank, newline, tab, etc.)
"""
return text.strip().replace("\n", " ")
PROJECT_ID = "YOUR_PROJECT_ID"
LOCATION = "YOUR_PROJECT_LOCATION" # Format is 'us' or 'eu'
PROCESSOR_ID = "FORM_PARSER_ID" # Create processor in Cloud Console
# The local file in your current working directory
FILE_PATH = "intake-form.pdf"
# Refer to https://cloud.google.com/document-ai/docs/processors-list
# for supported file types
MIME_TYPE = "application/pdf"
document = online_process(
project_id=PROJECT_ID,
location=LOCATION,
processor_id=PROCESSOR_ID,
file_path=FILE_PATH,
mime_type=MIME_TYPE,
)
names = []
name_confidence = []
values = []
value_confidence = []
for page in document.pages:
for field in page.form_fields:
# Get the extracted field names
names.append(trim_text(field.field_name.text_anchor.content))
# Confidence - How "sure" the Model is that the text is correct
name_confidence.append(field.field_name.confidence)
values.append(trim_text(field.field_value.text_anchor.content))
value_confidence.append(field.field_value.confidence)
# Create a Pandas Dataframe to print the values in tabular format.
df = pd.DataFrame(
{
"Field Name": names,
"Field Name Confidence": name_confidence,
"Field Value": values,
"Field Value Confidence": value_confidence,
}
)
print(df)
Run your code now and you should see the text extracted and printed in your console.
You should see the following output if using our sample document:
$ python3 form_parser.py Field Name Field Name Confidence Field Value Field Value Confidence 0 Phone #: 0.999982 (906) 917-3486 0.999982 1 Emergency Contact: 0.999972 Eva Walker 0.999972 2 Marital Status: 0.999951 Single 0.999951 3 Gender: 0.999933 F 0.999933 4 Occupation: 0.999914 Software Engineer 0.999914 5 Referred By: 0.999862 None 0.999862 6 Date: 0.999858 9/14/19 0.999858 7 DOB: 0.999716 09/04/1986 0.999716 8 Address: 0.999147 24 Barney Lane 0.999147 9 City: 0.997718 Towaco 0.997718 10 Name: 0.997345 Sally Walker 0.997345 11 State: 0.996944 NJ 0.996944 ...
6. Parse Tables
The Form Parser is also able to extract data from tables within documents. In this step, we will download a new sample document and extract data from the table. Since we are loading the data into Pandas, this data can be output to a CSV file and many other formats with a single method call.
Download the Sample Form with Tables
We have a sample document which contains a sample form and a table.
You can download the PDF using the following link. Then upload it to the Cloud Shell instance.
Alternatively, you can download it from our public Google Cloud Storage Bucket using gsutil
.
gsutil cp gs://cloud-samples-data/documentai/codelabs/form-parser/form_with_tables.pdf .
Confirm the file is downloaded to your Cloud Shell using the below command:
ls -ltr form_with_tables.pdf
Extract Table Data
The processing request for table data is exactly the same as for extracting key-value pairs. The difference is which fields we extract the data from in the response. Table data is stored in the pages[].tables[]
field.
This example extracts information about from the table header rows and body rows for each table and page, then prints out the table and saves the table as a CSV file.
Create a file called table_parsing.py
and use the code below.
table_parsing.py
# type: ignore[1]
"""
Uses Document AI online processing to call a form parser processor
Extracts the tables and data in the document.
"""
from os.path import splitext
from typing import List, Sequence
import pandas as pd
from google.cloud import documentai
def online_process(
project_id: str,
location: str,
processor_id: str,
file_path: str,
mime_type: str,
) -> documentai.Document:
"""
Processes a document using the Document AI Online Processing API.
"""
opts = {"api_endpoint": f"{location}-documentai.googleapis.com"}
# Instantiates a client
documentai_client = documentai.DocumentProcessorServiceClient(client_options=opts)
# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
resource_name = documentai_client.processor_path(project_id, location, processor_id)
# Read the file into memory
with open(file_path, "rb") as image:
image_content = image.read()
# Load Binary Data into Document AI RawDocument Object
raw_document = documentai.RawDocument(
content=image_content, mime_type=mime_type
)
# Configure the process request
request = documentai.ProcessRequest(
name=resource_name, raw_document=raw_document
)
# Use the Document AI client to process the sample form
result = documentai_client.process_document(request=request)
return result.document
def get_table_data(
rows: Sequence[documentai.Document.Page.Table.TableRow], text: str
) -> List[List[str]]:
"""
Get Text data from table rows
"""
all_values: List[List[str]] = []
for row in rows:
current_row_values: List[str] = []
for cell in row.cells:
current_row_values.append(
text_anchor_to_text(cell.layout.text_anchor, text)
)
all_values.append(current_row_values)
return all_values
def text_anchor_to_text(text_anchor: documentai.Document.TextAnchor, text: str) -> str:
"""
Document AI identifies table data by their offsets in the entirety of the
document's text. This function converts offsets to a string.
"""
response = ""
# If a text segment spans several lines, it will
# be stored in different text segments.
for segment in text_anchor.text_segments:
start_index = int(segment.start_index)
end_index = int(segment.end_index)
response += text[start_index:end_index]
return response.strip().replace("\n", " ")
PROJECT_ID = "YOUR_PROJECT_ID"
LOCATION = "YOUR_PROJECT_LOCATION" # Format is 'us' or 'eu'
PROCESSOR_ID = "FORM_PARSER_ID" # Create processor before running sample
# The local file in your current working directory
FILE_PATH = "form_with_tables.pdf"
# Refer to https://cloud.google.com/document-ai/docs/file-types
# for supported file types
MIME_TYPE = "application/pdf"
document = online_process(
project_id=PROJECT_ID,
location=LOCATION,
processor_id=PROCESSOR_ID,
file_path=FILE_PATH,
mime_type=MIME_TYPE,
)
header_row_values: List[List[str]] = []
body_row_values: List[List[str]] = []
# Input Filename without extension
output_file_prefix = splitext(FILE_PATH)[0]
for page in document.pages:
for index, table in enumerate(page.tables):
header_row_values = get_table_data(table.header_rows, document.text)
body_row_values = get_table_data(table.body_rows, document.text)
# Create a Pandas Dataframe to print the values in tabular format.
df = pd.DataFrame(
data=body_row_values,
columns=pd.MultiIndex.from_arrays(header_row_values),
)
print(f"Page {page.page_number} - Table {index}")
print(df)
# Save each table as a CSV file
output_filename = f"{output_file_prefix}_pg{page.page_number}_tb{index}.csv"
df.to_csv(output_filename, index=False)
Run your code now and you should see the text extracted and printed in your console.
You should see the following output if using our sample document:
$ python3 table_parsing.py Page 1 - Table 0 Item Description 0 Item 1 Description 1 1 Item 2 Description 2 2 Item 3 Description 3 Page 1 - Table 1 Form Number: 12345678 0 Form Date: 2020/10/01 1 Name: First Last 2 Address: 123 Fake St
You should also have two new CSV files in the directory you are running the code from.
$ ls form_with_tables_pg1_tb0.csv form_with_tables_pg1_tb1.csv table_parsing.py
7. Congratulations
Congratulations, you've successfully used the Document AI API to extract data from a handwritten form. We encourage you to experiment with other form documents.
Clean Up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial:
- In the Cloud Console, go to the Manage resources page.
- In the project list, select your project then click Delete.
- In the dialog, type the project ID and then click Shut down to delete the project.
Learn More
Continue learning about Document AI with these follow-up Codelabs.
- Specialized Processors with Document AI (Python)
- Managing Document AI processors with Python
- Document AI: Human in the Loop
Resources
License
This work is licensed under a Creative Commons Attribution 2.0 Generic License.