1. Overview
What is Document AI?
Document AI is a platform that lets you extract insights from your documents. At heart, it offers a growing list of document processors (also called parsers or splitters, depending on their functionality).
There are two ways you can manage Document AI processors:
- manually, from the web console;
- programmatically, using the Document AI API.
Here is an example screenshot showing your processor list, both from the web console and from Python code:
In this lab, you will focus on managing Document AI processors programmatically with the Python client library.
What you'll see
- How to enable Document AI
- How to set up your project
- How to set up Python
- How to list processor types
- How to create processors
- How to list processors
- How to analyze a document
- How to enable/disable processors
- How to delete processors
What you'll need
Survey
How will you use this tutorial?
How would you rate your experience with Python?
How would you rate your experience with using Google Cloud services?
2. Setup and requirements
Self-paced environment setup
- Sign-in to the Google Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one.
- The Project name is the display name for this project's participants. It is a character string not used by Google APIs, and you can update it at any time.
- The Project ID must be unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference the Project ID (and it is typically identified as
PROJECT_ID
), so if you don't like it, generate another random one, or, you can try your own and see if it's available. Then it's "frozen" after the project is created. - There is a third value, a Project Number which some APIs use. Learn more about all three of these values in the documentation.
- Next, you'll need to enable billing in the Cloud Console in order to use Cloud resources/APIs. Running through this codelab shouldn't cost much, if anything at all. To shut down resources so you don't incur billing beyond this tutorial, follow any "clean-up" instructions found at the end of the codelab. New users of Google Cloud are eligible for the $300 USD Free Trial program.
Start Cloud Shell
While Google Cloud can be operated remotely from your laptop, in this lab you will be using Cloud Shell, a command line environment running in the Cloud.
Activate Cloud Shell
- From the Cloud Console, click Activate Cloud Shell
.
If you've never started Cloud Shell before, you're presented with an intermediate screen (below the fold) describing what it is. If that's the case, click Continue (and you won't ever see it again). Here's what that one-time screen looks like:
It should only take a few moments to provision and connect to Cloud Shell.
This virtual machine is loaded with all the development tools you need. It offers a persistent 5GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with simply a browser or your Chromebook.
Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your project ID.
- Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list
Command output
Credentialed Accounts ACTIVE ACCOUNT * <my_account>@<my_domain.com> To set the active account, run: $ gcloud config set account `ACCOUNT`
- Run the following command in Cloud Shell to confirm that the gcloud command knows about your project:
gcloud config list project
Command output
[core] project = <PROJECT_ID>
If it is not, you can set it with this command:
gcloud config set project <PROJECT_ID>
Command output
Updated property [core/project].
3. Enabling Document AI
Before you can begin using Document AI, run the following command in Cloud Shell to enable the Document AI API:
gcloud services enable documentai.googleapis.com
You should see something like this:
Operation "operations/..." finished successfully.
Now, you can use Document AI!
4. Project setup
First, create and navigate to your working directory:
mkdir ~/documentai-processors cd ~/documentai-processors
Then, set the PROJECT_ID
and GOOGLE_APPLICATION_CREDENTIALS
environment variables:
export PROJECT_ID=$(gcloud config get-value core/project) export GOOGLE_APPLICATION_CREDENTIALS=~/documentai-processors/key.json
For your code to make authorized API calls from your current shell session, a best practice is to use a service account. A service account is a special kind of account, owned by your project, and identified by its email address (like any other user account).
Create a service account:
SERVICE_ACCOUNT_NAME="documentai-processors-codelab" SERVICE_ACCOUNT="$SERVICE_ACCOUNT_NAME@$PROJECT_ID.iam.gserviceaccount.com" gcloud iam service-accounts create $SERVICE_ACCOUNT_NAME
Grant it the "Document AI Editor" role (so your code can manage processors):
gcloud projects add-iam-policy-binding $PROJECT_ID \ --member "serviceAccount:$SERVICE_ACCOUNT" \ --role "roles/documentai.editor"
Then, create and download the service account key (a JSON file):
gcloud iam service-accounts keys create $GOOGLE_APPLICATION_CREDENTIALS \ --iam-account $SERVICE_ACCOUNT
Verify that the GOOGLE_APPLICATION_CREDENTIALS
environment variable (used by the Document AI client library) points to the full path of the JSON file:
cat $GOOGLE_APPLICATION_CREDENTIALS
You should see the content of the service account key:
{ "type": "service_account", ... }
Your project and credentials are configured! Next, set up your Python environment...
5. Python setup
Create a Python virtual environment to isolate dependencies:
virtualenv venv
Activate the virtual environment:
source venv/bin/activate
Install IPython, the Document AI client library, and python-tabulate (which you'll use to pretty-print the request results):
pip install ipython google-cloud-documentai tabulate
You should see something like this:
... Installing collected packages: ..., tabulate, ipython, google-cloud-documentai Successfully installed ... cloud-documentai-1.0.0 ...
Now, you're ready to use the Document AI client library!
In the next steps, you'll use an interactive Python interpreter called IPython, which you installed in the previous step. Start a session by running ipython
in Cloud Shell:
ipython
You should see something like this:
Python 3.7.3 (default, Jan 22 2021, 20:04:44) Type 'copyright', 'credits' or 'license' for more information IPython 7.27.0 -- An enhanced Interactive Python. Type '?' for help. In [1]:
Copy the following code into your IPython session:
import os
from typing import Iterator, Optional, Sequence, Tuple
import google.cloud.documentai_v1beta3 as documentai
from tabulate import tabulate
PROJECT_ID = os.getenv("PROJECT_ID", "")
API_LOCATION = "eu" # Choose "us" or "eu"
assert PROJECT_ID, "PROJECT_ID is undefined"
assert API_LOCATION in ("us", "eu"), "API_LOCATION is incorrect"
# Test processors
document_ocr_display_name = "document-ocr"
form_parser_display_name = "form-parser"
test_processor_display_names_and_types = (
(document_ocr_display_name, "OCR_PROCESSOR"),
(form_parser_display_name, "FORM_PARSER_PROCESSOR"),
)
def get_client() -> documentai.DocumentProcessorServiceClient:
client_options = dict(api_endpoint=f"{API_LOCATION}-documentai.googleapis.com")
return documentai.DocumentProcessorServiceClient(client_options=client_options)
def get_parent() -> str:
return f"projects/{PROJECT_ID}/locations/{API_LOCATION}"
You're ready to make your first request and list the processor types...
6. Listing processor types
Before creating a processor, you need to choose its type among the growing list of processor types. You can retrieve this list with fetch_processor_types
.
Add the following functions into your IPython session:
def fetch_processor_types() -> Sequence[documentai.ProcessorType]:
client = get_client()
response = client.fetch_processor_types(parent=get_parent())
return response.processor_types
def print_processor_types(processor_types: Sequence[documentai.ProcessorType]):
def sort_key(pt):
return (not pt.allow_creation, pt.category, pt.type_)
sorted_processor_types = sorted(processor_types, key=sort_key)
data = processor_type_tabular_data(sorted_processor_types)
headers = next(data)
colalign = next(data)
print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
print(f"→ Processor types: {len(sorted_processor_types)}")
def processor_type_tabular_data(
processor_types: Sequence[documentai.ProcessorType],
) -> Iterator[Tuple[str, str, str, str]]:
def locations(pt):
return ", ".join(sorted(loc.location_id for loc in pt.available_locations))
yield ("type", "category", "allow_creation", "locations")
yield ("left", "left", "left", "left")
if not processor_types:
yield ("-", "-", "-", "-")
return
for pt in processor_types:
yield (pt.type_, pt.category, f"{pt.allow_creation}", locations(pt))
List the processor types:
processor_types = fetch_processor_types()
print_processor_types(processor_types)
You should see something like this:
+--------------------------+-------------+----------------+-----------+ | type | category | allow_creation | locations | +--------------------------+-------------+----------------+-----------+ | FORM_PARSER_PROCESSOR | GENERAL | True | eu, us | | OCR_PROCESSOR | GENERAL | True | eu, us | ... | EXPENSE_PROCESSOR | SPECIALIZED | False | eu, us | ... | FR_NATIONAL_ID_PROCESSOR | SPECIALIZED | False | eu, us | | INVOICE_PROCESSOR | SPECIALIZED | False | eu, us | ... | US_PASSPORT_PROCESSOR | SPECIALIZED | False | eu, us | +--------------------------+-------------+----------------+-----------+ → Processor types: 28
Now, you have all the info needed to create processors...
7. Creating processors
To create a processor, call create_processor
with a display name and a processor type.
Add the following function:
def create_processor(display_name: str, type: str) -> documentai.Processor:
client = get_client()
processor_info = documentai.Processor(display_name=display_name, type_=type)
return client.create_processor(parent=get_parent(), processor=processor_info)
Create the test processors:
for display_name, type in test_processor_display_names_and_types:
print(f"Creating {display_name} ({type})...")
create_processor(display_name, type)
print("Done")
You should get the following:
Creating document-ocr (OCR_PROCESSOR)... Creating form-parser (FORM_PARSER_PROCESSOR)... Done
You have created new processors!
Next, see how to list the processors...
8. Listing processors
list_processors
returns the list of all the processors belonging to your project.
Add the following functions:
def list_processors() -> Sequence[documentai.Processor]:
client = get_client()
response = client.list_processors(parent=get_parent())
return response.processors
def print_processors(processors: Optional[Sequence[documentai.Processor]] = None):
def sort_key(processor):
return processor.display_name
if processors is None:
processors = list_processors()
sorted_processors = sorted(processors, key=sort_key)
data = processor_tabular_data(sorted_processors)
headers = next(data)
colalign = next(data)
print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
print(f"→ Processors: {len(sorted_processors)}")
def processor_tabular_data(
processors: Sequence[documentai.Processor],
) -> Iterator[Tuple[str, str, str]]:
yield ("display_name", "type", "state")
yield ("left", "left", "left")
if not processors:
yield ("-", "-", "-")
return
for processor in processors:
yield (processor.display_name, processor.type_, processor.state.name)
Call the functions:
processors = list_processors()
print_processors(processors)
You should get the following:
+--------------+-----------------------+---------+ | display_name | type | state | +--------------+-----------------------+---------+ | document-ocr | OCR_PROCESSOR | ENABLED | | form-parser | FORM_PARSER_PROCESSOR | ENABLED | +--------------+-----------------------+---------+ → Processors: 2
To retrieve a processor by its display name, add the following function:
def get_processor(
display_name: str, processors: Optional[Sequence[documentai.Processor]] = None
) -> Optional[documentai.Processor]:
if processors is None:
processors = list_processors()
for processor in processors:
if processor.display_name == display_name:
return processor
return None
Test the function:
processor = get_processor(document_ocr_display_name, processors)
assert processor is not None
print(processor)
You should see something like this:
name: "projects/PROJECT_NUM/locations/LOCATION/processors/PROCESSOR_ID" type_: "OCR_PROCESSOR" display_name: "document-ocr" state: ENABLED ...
Now, you know how to list your processors and retrieve one of them from its display name. Next, see how to use a processor...
9. Analyzing a document
Documents can be processed in two ways:
- synchronously with
process_document
, to analyze a single document and directly use the results; - asynchronously with
batch_process_documents
, to launch a batch processing on multiple or large documents.
Your test document ( PDF) is the scan of a questionnaire filled out with handwritten answers. Download it into your working directory, from your IPython session:
!gsutil cp gs://cloud-samples-data/documentai/form.pdf .
Check the content of your working directory:
!ls
You should have the following:
form.pdf key.json venv
As you have a single local file to analyze, you can use the synchronous process_document
method. Add the following function:
def process_file(
processor: documentai.Processor, file_path: str, mime_type: str
) -> documentai.Document:
client = get_client()
with open(file_path, "rb") as document_file:
document_content = document_file.read()
document = documentai.RawDocument(content=document_content, mime_type=mime_type)
request = documentai.ProcessRequest(raw_document=document, name=processor.name)
response = client.process_document(request)
return response.document
As your document is a questionnaire, choose the form parser. This general processor detects form fields, in addition to extracting the text (printed and handwritten), which all processors do.
Analyze the document:
processor = get_processor(form_parser_display_name)
assert processor is not None
file_path = "./form.pdf"
mime_type = "application/pdf"
document = process_file(processor, file_path, mime_type)
All processors run an Optical Character Recognition (OCR) first pass on the whole document. Check out the text detected by the OCR pass:
document.text.split("\n")
You should see something like to the following:
['FakeDoc M.D.',
'HEALTH INTAKE FORM',
'Please fill out the questionnaire carefully. The information you provide will be used to complete',
'your health profile and will be kept confidential.',
'Date:',
'Sally',
'Walker',
'Name:',
'9/14/19',
'DOB: 09/04/1986',
'Address: 24 Barney Lane City: Towaco State: NJ Zip: 07082',
'Email: Sally, waller@cmail.com Phone #: (906) 917-3486',
'Gender: F',
'Single Occupation: Software Engineer',
'Referred By: None',
'Emergency Contact: Eva Walker Emergency Contact Phone: (906) 334-8976',
'Marital Status:',
...
]
Add the following functions to print out the detected form fields:
def print_form_fields(document: documentai.Document):
sorted_form_fields = form_fields_sorted_by_ocr_order(document)
data = form_field_tabular_data(sorted_form_fields, document)
headers = next(data)
colalign = next(data)
print(tabulate(data, headers, tablefmt="pretty", colalign=colalign))
print(f"→ Form fields: {len(sorted_form_fields)}")
def form_field_tabular_data(
form_fields: Sequence[documentai.Document.Page.FormField],
document: documentai.Document,
) -> Iterator[Tuple[str, str, str]]:
yield ("name", "value", "confidence")
yield ("right", "left", "right")
if not form_fields:
yield ("-", "-", "-")
return
for form_field in form_fields:
name = text_from_anchor(form_field.field_name.text_anchor, document)
value = text_from_anchor(form_field.field_value.text_anchor, document)
confidence = form_field.field_value.confidence
yield (name, value, f"{confidence:.1%}")
Also add these utility functions:
def form_fields_sorted_by_ocr_order(
document: documentai.Document,
) -> Sequence[documentai.Document.Page.FormField]:
def sort_key(form_field):
# Sort according to the field name detected position
text_anchor = form_field.field_name.text_anchor
return text_anchor.text_segments[0].start_index if text_anchor else 0
form_fields = (
form_field for page in document.pages for form_field in page.form_fields
)
return sorted(form_fields, key=sort_key)
def text_from_anchor(
text_anchor: documentai.Document.TextAnchor, document: documentai.Document
) -> str:
text = "".join(
document.text[segment.start_index : segment.end_index]
for segment in text_anchor.text_segments
)
return text[:-1] if text.endswith("\n") else text
Print the detected form fields:
print_form_fields(document)
You should get a printout like the following:
+--------------+-------------------------+------------+ | name | value | confidence | +--------------+-------------------------+------------+ | Date: | 9/14/19 | 100.0% | | Name: | Sally | 99.7% | | | Walker | | | DOB: | 09/04/1986 | 100.0% | | Address: | 24 Barney Lane | 99.9% | | City: | Towaco | 99.8% | | State: | NJ | 99.7% | | Zip: | 07082 | 99.5% | | Email: | Sally, waller@cmail.com | 99.6% | | Phone #: | (906) 917-3486 | 100.0% | | Gender: | F | 100.0% | | Occupation: | Software Engineer | 100.0% | | Referred By: | None | 100.0% | ... +--------------+-------------------------+------------+ → Form fields: 17
Check out how the field names and values have been detected ( PDF). Here is the top half of the questionnaire:
You have analyzed a form (containing both printed and handwritten text) and detected its fields with high confidence: your pixels have been transformed into structured data!
10. Enabling/disabling processors
With disable_processor
and enable_processor
, you can control whether a processor can be used.
Add the following functions:
def update_processor_state(processor: documentai.Processor, enable_processor: bool):
client = get_client()
if enable_processor:
request = documentai.EnableProcessorRequest(name=processor.name)
operation = client.enable_processor(request)
else:
request = documentai.DisableProcessorRequest(name=processor.name)
operation = client.disable_processor(request)
operation.result()
def enable_processor(processor: documentai.Processor):
update_processor_state(processor, True)
def disable_processor(processor: documentai.Processor):
update_processor_state(processor, False)
Disable the form parser processor and check the state of your processors:
processor = get_processor(form_parser_display_name)
assert processor is not None
disable_processor(processor)
print_processors()
You should get the following:
+--------------+-----------------------+----------+ | display_name | type | state | +--------------+-----------------------+----------+ | document-ocr | OCR_PROCESSOR | ENABLED | | form-parser | FORM_PARSER_PROCESSOR | DISABLED | +--------------+-----------------------+----------+ → Processors: 2
Re-enable the form parser processor:
enable_processor(processor)
print_processors()
You should get the following:
+--------------+-----------------------+---------+ | display_name | type | state | +--------------+-----------------------+---------+ | document-ocr | OCR_PROCESSOR | ENABLED | | form-parser | FORM_PARSER_PROCESSOR | ENABLED | +--------------+-----------------------+---------+ → Processors: 2
And next, the ultimate processor management method (deletion)...
11. Deleting processors
Finally, check out how to use the delete_processor
method.
Add the following function:
def delete_processor(processor: documentai.Processor):
client = get_client()
operation = client.delete_processor(name=processor.name)
operation.result()
Delete your test processors:
processors_to_delete = [dn for dn, _ in test_processor_display_names_and_types]
print(f"Processors to delete: {len(processors_to_delete)}")
for processor in list_processors():
if processor.display_name not in processors_to_delete:
continue
print(f" Deleting {processor.display_name}...")
delete_processor(processor)
print()
print_processors()
You should get the following:
Processors to delete: 2 Deleting form-parser... Deleting document-ocr... +--------------+------+-------+ | display_name | type | state | +--------------+------+-------+ | - | - | - | +--------------+------+-------+ → Processors: 0
You've covered all the processor management methods! You're almost done...
12. Congratulations!
You learned how to manage Document AI processors using Python!
Clean up
To clean up your development environment, from Cloud Shell:
- If you're still in your IPython session, enter the
exit
command to go back to the shell. - Stop using the Python virtual environment with the
deactivate
command. - Delete your working directory:
cd ~ ; rm -rf ~/documentai-processors/
To delete your Google Cloud project, from Cloud Shell:
- Retrieve your current project ID:
PROJECT_ID=$(gcloud config get-value core/project)
- Make sure this is the project you wish to delete:
echo $PROJECT_ID
- Delete the project:
gcloud projects delete $PROJECT_ID
Learn more
- Try Document AI in your browser: https://cloud.google.com/document-ai/docs/drag-and-drop
- Document AI processor details: https://cloud.google.com/document-ai/docs/processors-list
- Python on Google Cloud: https://cloud.google.com/python
- Cloud Client Libraries for Python: https://googlecloudplatform.github.io/google-cloud-python
License
This work is licensed under a Creative Commons Attribution 2.0 Generic License.