Use Document AI to Intelligently Split Documents on Logical Boundaries. (Python)

c65b9ae04aa1853.png

What is Document AI?

The Document AI API is a document understanding solution that takes unstructured data, such as documents, emails, and so on, and makes the data easier to understand, analyze, and consume. The API provides structure through content classification, entity extraction, advanced searching, and more.

In this tutorial, you focus on using the Document AI API with Python. The tutorial demonstrates how to use Document Splitter to parse a simple PDF document with multiple scanned files to separate documents on page logical boundaries.

What you'll learn

  • How to enable the Document AI API
  • How to authenticate API requests
  • How to install the client library for Python
  • How to parse data from a multipage document and detect page logical boundaries

What you'll need

  • A Google Cloud Project
  • A Browser, such as Chrome or Firefox
  • Knowledge of Python 3

Survey

How will you use this tutorial?

Read it through only Read it and complete the exercises

How would you rate your experience with Python?

Novice Intermediate Proficient

How would you rate your experience with using Google Cloud services?

Novice Intermediate Proficient

Self-paced environment setup

  1. Sign in to Cloud Console and create a new project or reuse an existing one. (If you don't already have a Gmail or G Suite account, you must create one.)

Remember the project ID, a unique name across all Google Cloud projects. (Your name above has already been taken and will not work for you, sorry!). You must provide this ID later on as PROJECT_ID.

  1. Next, you must enable billing in Cloud Console in order to use Google Cloud resources.

Be sure to to follow any instructions in the "Cleaning up" section. The section advises you how to shut down resources so you don't incur billing beyond this tutorial. New users of Google Cloud are eligible for the $300USD Free Trial program.

Start Cloud Shell

While Google Cloud you can operate Google Cloud remotely from your laptop, this codelab uses Google Cloud Shell, a command line environment running in the Cloud.

Activate Cloud Shell

  1. From the Cloud Console, click Activate Cloud Shell H7JlbhKGHITmsxhQIcLwoe5HXZMhDlYue4K-SPszMxUxDjIeWfOHBfxDHYpmLQTzUmQ7Xx8o6OJUlANnQF0iBuUyfp1RzVad_4nCa0Zz5LtwBlUZFXFCWFrmrWZLqg1MkZz2LdgUDQ.

zlNW0HehB_AFW1qZ4AyebSQUdWm95n7TbnOr7UVm3j9dFcg6oWApJRlC0jnU1Mvb-IQp-trP1Px8xKNwt6o3pP6fyih947sEhOFI4IRF0W7WZk6hFqZDUGXQQXrw21GuMm2ecHrbzQ

If you've never started Cloud Shell before, you are presented with an intermediate screen (below the fold) describing what it is. If that's the case, click Continue (and you won't ever see it again). Here's what that one-time screen looks like:

kEPbNAo_w5C_pi9QvhFwWwky1cX8hr_xEMGWySNIoMCdi-Djx9AQRqWn-__DmEpC7vKgUtl-feTcv-wBxJ8NwzzAp7mY65-fi2LJo4twUoewT1SUjd6Y3h81RG3rKIkqhoVlFR-G7w

It should only take a few moments to provision and connect to Cloud Shell.

pTv5mEKzWMWp5VBrg2eGcuRPv9dLInPToS-mohlrqDASyYGWnZ_SwE-MzOWHe76ZdCSmw0kgWogSJv27lrQE8pvA5OD6P1I47nz8vrAdK7yR1NseZKJvcxAZrPb8wRxoqyTpD-gbhA

Cloud Shell provides you with terminal access to a virtual machine hosted in the cloud. The virtual machine includes all the development tools that you'll need. It offers a persistent 5GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with simply a browser or your Chromebook.

Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your project ID.

  1. Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list

Command output

 Credentialed Accounts
ACTIVE  ACCOUNT
*      <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`
gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If it is not, you can set it with this command:

gcloud config set project <PROJECT_ID>

Command output

Updated property [core/project].

Before you can begin using Document AI, you must enable the API. Open the Cloud Console in your browser.

  1. Click Navigation menu ☰ > APIs & Services > Library. Search API
  2. Search for "Document AI API," then click Enable to use the API in your Google Cloud project

You must first create an instance of the Document Splitter processor to use in the Document AI Platform for this tutorial.

  1. In the console, navigate to the Document AI Platform Overview
  2. Click Create Processor and select Document SplitterProcessors
  3. Specify a processor name and select your region from the list.
  4. Click Create to create your processor
  5. Copy your processor ID. You must use this in your code later.

In order to make requests to the Document AI API, you must use a Service Account. A Service Account belongs to your project and it is used by the Google Client Python library to make API requests. Like any other user account, a service account is represented by an email address. In this section, you will use the Cloud SDK to create a service account and then create credentials you need to authenticate as the service account.

First, set an environment variable with your PROJECT_ID which you will use throughout this codelab:

export GOOGLE_CLOUD_PROJECT=$(gcloud config get-value core/project)

Next, create a new service account to access the Document AI API by using:

gcloud iam service-accounts create my-docai-sa \
  --display-name "my-docai-service-account"

Next, create credentials that your Python code uses to login as your new service account. Create these credentials and save it as a JSON file "~/key.json" by using the following command:

gcloud iam service-accounts keys create ~/key.json \
  --iam-account  my-docai-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com

Finally, set the GOOGLE_APPLICATION_CREDENTIALS environment variable, which is used by the library to find your credentials. To read more about this form authentication, see the guide. The environment variable should be set to the full path of the credentials JSON file you created, by using:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"

We have a sample multipage document to use stored in Google Cloud Storage. Use the following command to download it to your working directory.

gsutil cp gs://cloud-samples-data/documentai/multi-document.pdf .

Confirm the file is downloaded to your cloudshell using the below command:

ls -ltr multi-document.pdf

Install the client library:

pip3 install --upgrade google-cloud-documentai

You should see something like this:

...
Installing collected packages: google-cloud-documentai
Successfully installed google-cloud-documentai-0.3.0

Now, you're ready to use the Document AI API!

Start Interactive Python

In this tutorial, you'll use an interactive Python interpreter called IPython. Start a session by running ipython in Cloud Shell. This command runs the Python interpreter in an interactive session.

ipython

You should see something like this:

Python 3.7.3 (default, Jul 25 2020, 13:03:44)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

In this step you make a process document call using the synchronous endpoint. For processing large amounts of documents at a time you can also use the asynchronous API, to learn more about using the Document Splitter APIs, read the guide here.

Copy the following code into your iPython session:

project_id= 'YOUR_PROJECT_ID' 
location = 'YOUR_PROJECT_LOCATION' # Format is 'us' or 'eu'
processor_id = 'YOUR_PROCESSOR_ID' # Create processor in Cloud Console
file_path = 'multi-document.pdf' # The local file in your current working directory

from google.cloud import documentai_v1beta3 as documentai

def process_document_sample(
    project_id: str, location: str, processor_id: str, file_path: str
):
    # Instantiates a client
    opts = {"api_endpoint": f"{location}-documentai.googleapis.com"}
    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"

    with open(file_path, "rb") as image:
        image_content = image.read()

    # Read the file into memory
    document = {"content": image_content, "mime_type": "application/pdf"}

    # Configure the process request
    request = {"name": name, "document": document}

    # Recognizes text entities in the PDF document
    result = client.process_document(request=request)

    document = result.document

    print("Document processing complete.")

    # For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document

    document_pages = document.pages

    # Read the text recognition output from the processor
    text = document.text
    print("The document contains the following text (first 100 charactes):")
    print(text[:100])

# Run document processing
process_document_sample(project_id, location, processor_id, file_path)

Run your code now and you should see the text extracted and printed in your console. In the next steps, you extract structured data that suggests the page locations to split documents in the multipage document file.

Now you can extract the suggested page split locations for the multipage document and their corresponding confidence scores. The Document response object contains a list of entities detected from the input document. Each entity object contains a page_anchor field with a list of page reference fields suggesting the split locations in the text.

The following code iterates through each detected entity and extracts and prints suggested pages to split the multipage document with the confidence score.

At the bottom of your process_document_sample() function, paste the code below:

    # Read the detected page split from the processor
    print("\nThe processor detected the following page split entities:")
    print_pages_split(text, document)


def print_pages_split(text: str, document: dict):
    """
    Document AI identifies possible page splits
    in document. This function converts page splits
    to text snippets and prints it.    
    """
    for i, entity in enumerate(document.entities):
        confidence = entity.confidence
        text_entity = ''
        for segment in entity.text_anchor.text_segments:
            start = segment.start_index
            end = segment.end_index
            text_entity += text[start:end]
        pages = [p.page for p in entity.page_anchor.page_refs]
        print(f"*** Entity number: {i}, Split Confidence: {confidence} ***")
        print(f"*** Pages numbers: {[p for p in pages]} ***\nText snippet: {text_entity[:100]}")

Now run your code. You should see the following output if using our sample document:

Document processing complete.
The document contains the following text (first 100 charactes):
FakeDoc M.D.
HEALTH INTAKE FORM
Please fill out the questionnaire carefully. The information you pro

The processor detected the following page split entities:
*** Entity number: 0, Split Confidence: 0.21864357590675354 ***
*** Pages numbers: [0, 1] ***
Text snippet: FakeDoc M.D.
HEALTH INTAKE FORM
Please fill out the questionnaire carefully. The information you pro
*** Entity number: 1, Split Confidence: 0.970017671585083 ***
*** Pages numbers: [2] ***
Text snippet: Invoice
DATE: 01/01/1970
INVOICE: NO. 001
FROM: Company ABC
user@companyabc.com
TO: John Doe
johndoe

Congratulations, you've successfully used the Document AI API to extract page logical boundaries from a multipage document. We encourage you to experiment with other documents.

Clean Up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial:

  • In the Cloud Console, go to the Manage resources page.
  • In the project list, select your project then click Delete.
  • In the dialog, type the project ID and then click Shut down to delete the project.

Learn More

License

This work is licensed under a Creative Commons Attribution 2.0 Generic License.