Document AI Workbench - Uptraining

1. Introduction

Document AI is a document understanding solution that takes unstructured data, such as documents, emails, and so on, and makes the data easier to understand, analyze, and consume.

By using uptraining through Document AI Workbench, you can achieve higher document processing accuracy by providing additional labeled examples for Specialized Document Types and creating a new model version.

In this lab, you will create an Invoice Parser processor, configure the processor for uptraining, label example documents, and uptrain the processor.

The document dataset used in this lab consists of randomly-generated invoices for a fictional piping company.

Prerequisites

This codelab builds upon content presented in other Document AI Codelabs.

It is recommended that you complete the following Codelabs before proceeding.

What you'll learn

  • Configure Uptraining for an Invoice Parser processor.
  • Label Document AI training data using the annotation tool.
  • Train a new model version.
  • Evaluate the accuracy of the new model version.

What you'll need

2. Getting set up

This codelab assumes you have completed the Document AI Setup steps listed in the Introductory Codelab.

Please complete the following steps before proceeding:

3. Create a Processor

You must first create an Invoice Parser processor to use for this lab.

  1. In the console, navigate to the Document AI Overview page.

docai-uptraining-codelab-01

  1. Click Create Processor, scroll down to Specialized (or type "Invoice Parser" in the search bar) and select Invoice Parser.

docai-uptraining-codelab-02

  1. Give it the name codelab-invoice-uptraining (Or something else you'll remember) and select the closest region on the list.

docai-uptraining-codelab-03

  1. Click Create to create your processor. You should then see the Processor Overview page.

docai-uptraining-codelab-04

4. Create a Dataset

In order to train our processor, we will have to create a dataset with training and testing data to help the processor identify the entities we want to extract.

You will need to create a new bucket in Cloud Storage to store the dataset. Note: This should not be the same bucket where your documents are currently stored.

  1. Open Cloud Shell and run the following commands to create a bucket. Alternatively, create a new bucket in the Cloud Console. Save this bucket name, you will need it later.
export PROJECT_ID=$(gcloud config get-value project)

gsutil mb -p $PROJECT_ID "gs://${PROJECT_ID}-uptraining-codelab"
  1. Go to the Dataset tab, and click on Create Dataset

docai-uptraining-codelab-05

  1. Paste the bucket name from the bucket you created in step one into the Destination Path field. (Don't include gs://)

docai-uptraining-codelab-06

  1. Wait for the dataset to be created, then it should direct you to the Dataset management page.

docai-uptraining-codelab-07

5. Import a Test Document

Now, let's import a sample invoice pdf into our dataset.

  1. Click on Import Documents

docai-uptraining-codelab-08

  1. We have a sample PDF for you to use in this lab. Copy and paste the following link into the Source Path box. Leave the "Data split" as "Unassigned" for now. Click Import.
cloud-samples-data/documentai/codelabs/uptraining/pdfs

docai-uptraining-codelab-09

  1. Wait for the document to import. This took less than 1 minute in my tests.

docai-uptraining-codelab-10

  1. When the import completes, you should see the document in the Dataset management UI. Click on it to enter the labeling console.

docai-uptraining-codelab-11

6. Label the Test Document

Next, we will identify text elements and labels for the entities we would like to extract. These labels will be used to train our model to parse this specific document structure and identify the correct types.

  1. You should now be in the labeling console, which will look something like this.

docai-uptraining-codelab-12

  1. Click on the "Select Text" Tool, then highlight the text "McWilliam Piping International Piping Company" and assign the label supplier_name. You can use the text filter to search for label names.

docai-uptraining-codelab-13

  1. Highlight the text "14368 Pipeline Ave Chino, CA 91710" and assign the label supplier_address.

docai-uptraining-codelab-14

  1. Highlight the text "10001" and assign the label invoice_id.

docai-uptraining-codelab-15

  1. Highlight the text "2020-01-02" and assign the label due_date.

docai-uptraining-codelab-16

  1. Switch to the "Bounding Box" tool. Highlight the text "Knuckle Couplers" and assign the label line_item/description.

docai-uptraining-codelab-17

  1. Highlight the text "9" and assign the label line_item/quantity.

docai-uptraining-codelab-18

  1. Highlight the text "74.43" and assign the label line_item/unit_price.

docai-uptraining-codelab-19

  1. Highlight the text "669.87" and assign the label line_item/amount.

docai-uptraining-codelab-20

  1. Repeat the previous 4 steps for the next two line items. It should look like this when complete.

docai-uptraining-codelab-21

  1. Highlight the text "1,419.57" (next to Subtotal) and assign the label net_amount.

docai-uptraining-codelab-22

  1. Highlight the text "113.57" (next to Tax) and assign the label total_tax_amount.

docai-uptraining-codelab-23

  1. Highlight the text "1,533.14" (next to Total) and assign the label total_amount.

docai-uptraining-codelab-24

  1. Highlight one of the "$" characters and assign the label currency.

docai-uptraining-codelab-25

  1. The labeled document should look like this when complete. Note, you can make adjustments to these labels by clicking on the bounding box in the document or the label name/value on the left side menu. Click Save when you are finished labeling.

docai-uptraining-codelab-26

  1. Here is the full list of labels and values

Label Name

Text

supplier_name

McWilliam Piping International Piping Company

supplier_address

14368 Pipeline Ave Chino, CA 91710

invoice_id

10001

due_date

2020-01-02

line_item/description

Knuckle Couplers

line_item/quantity

9

line_item/unit_price

74.43

line_item/amount

669.87

line_item/description

PVC Pipe 12 Inch

line_item/quantity

7

line_item/unit_price

15.90

line_item/amount

111.30

line_item/description

Copper Pipe

line_item/quantity

7

line_item/unit_price

91.20

line_item/amount

638.40

net_amount

1,419.57

total_tax_amount

113.57

total_amount

1,533.14

currency

$

7. Assign Document to Training Set

You should now be back at the Dataset management console. Notice that the number of Labeled and Unlabeled documents have changed as well as the numbers of active labels.

docai-uptraining-codelab-27

  1. We need to assign this document to either the "Training" or "Test" set. Click on the Document.

docai-uptraining-codelab-28

  1. Click Assign to Set, then click on Training.

docai-uptraining-codelab-29

  1. Notice the Data Split numbers have changed.

docai-uptraining-codelab-30

8. Import Pre-Labeled Data

Document AI Uptraining requires a minimum of 10 documents in both the training and test sets, along with 10 instances of each label in each set.

It's recommended to have at least 50 documents in each set with 50 instances of each label for best performance. More training data generally equates to higher accuracy.

It will take a long time to manually label 100 documents, so we have some pre-labeled documents that you can import for this lab.

You can import pre-labeled document files in the Document.json format. These can be results from calling a processor and verifying the accuracy using Human in the Loop (HITL).

  1. Click on Import Documents.

docai-uptraining-codelab-30

  1. Copy/Paste the following Cloud Storage path and assign it to the Training set.
cloud-samples-data/documentai/codelabs/uptraining/training
  1. Click on Add Another Bucket. Then Copy/Paste the following Cloud Storage path and assign it to the Test set.
cloud-samples-data/documentai/codelabs/uptraining/test

docai-uptraining-codelab-31

  1. Click Import and wait for the documents to import. This will take longer than last time because there are more documents to process. In my tests, this took about 6 minutes. You can leave this page and return later.

docai-uptraining-codelab-32

  1. Once complete, you should see the documents in the Dataset management page.

docai-uptraining-codelab-33

9. Edit Labels

The sample documents we are using for this example do not contain every label supported by the Invoice Parser. We will need to mark the labels we are not using as inactive before training. You can also follow similar steps to add a custom label before Uptraining.

  1. Click on Manage Labels in the bottom-left corner.

docai-uptraining-codelab-33

  1. You should now be in the Label Management console.

docai-uptraining-codelab-34

  1. Use the Checkboxes and the Disable/Enable buttons to mark ONLY the following labels as Enabled.
    • currency
    • due_date
    • invoice_id
    • line_item/amount
    • line_item/description
    • line_item/quantity
    • line_item/unit_price
    • net_amount
    • supplier_address
    • supplier_name
    • total_amount
    • total_tax_amount
  2. The Console should look like this when complete. Click Save when finished.

docai-uptraining-codelab-35

  1. Click on the Back arrow to return to the Dataset management console. Notice that the labels with 0 instances have been marked as Inactive.

docai-uptraining-codelab-36

10. Optional: Auto-label newly imported documents

When importing unlabeled documents for a processor with an existing deployed processor version, you can use Auto-labeling to save time on labeling.

  1. On the Train page, Click Import Documents.
  2. Copy and paste the following {{storage_name}} path. This directory contains 5 unlabeled invoice PDFs. From the Data split dropdown list, select Training.
    cloud-samples-data/documentai/Custom/Invoices/PDF_Unlabeled
    
  3. In the Auto-labeling section, select the Import with auto-labeling checkbox.
  4. Select an existing processor version to label the documents.
  • For example: pretrained-invoice-v1.3-2022-07-15
  1. Click Import and wait for the documents to import. You can leave this page and return later.
  • When complete, the documents appear in the Train page in the Auto-labeled section.
  1. You cannot use auto-labeled documents for training or testing without marking them as labeled. Go to the Auto-labeled section to view the auto-labeled documents.
  2. Select the first document to enter the labeling console.
  3. Verify the labels, bounding boxes, and values to ensure they are correct. Label any values that were omitted.
  4. Select Mark as labeled when finished.
  5. Repeat the label verification for each auto-labeled document, then return to the Train page to use the data for training.

11. Uptrain the Model

Now, we are ready to begin training our Invoice Parser.

  1. Click Train New Version

docai-uptraining-codelab-36

  1. Give your version a name that you'll remember, such as codelab-uptraining-test-1. The Base version is the model version this new version will be built from. If you're using a new processor, the only option should be Google Pretrained Next with Uptraining

docai-uptraining-codelab-37

  1. (Optional) You can also select View Label Stats to see metrics about the labels in your dataset.

docai-uptraining-codelab-38

  1. Click on Start Training to begin the Uptraining process. You should be redirected to the Dataset management page. You can view the training status on the right side. Training will take a few hours to complete. You can leave this page and return later.

docai-uptraining-codelab-39

  1. If you click on the version name, you will be directed to the Manage Versions page, which shows the Version ID and the current status of the Training Job.

docai-uptraining-codelab-40

12. Test the New Model Version

Once the Training Job is complete (it took about 1 hour in my tests), you can now test out the new model version and start using it for predictions.

  1. Go to the Manage Versions page. Here you can see the current status and F1 Score.

docai-uptraining-codelab-41

  1. We will need to deploy this model version before it can be used. Click on the vertical dots on the right side and select Deploy Version.

docai-uptraining-codelab-42

  1. Select Deploy from the pop-up window, when wait for the version to deploy. This will take a few minutes to complete. After it's deployed, you can also set this version as the Default Version.

docai-uptraining-codelab-43

  1. Once it's finished deploying, go to the Evaluate Tab. Then click on the Version dropdown and select our newly-created version.

docai-uptraining-codelab-44

  1. On this page, you can view evaluation metrics including the F1 score, Precision and Recall for the full document as well as individual labels. You can read more about these metrics in the AutoML Documentation.
  2. Download the PDF File linked below. This is a sample document that was not included in the Training or Test set.

  1. Click on Upload Test Document and select the PDF file.

docai-uptraining-codelab-45

  1. The extracted entities should look something like this.

docai-uptraining-codelab-46

13. Conclusion

Congratulations, you've successfully used Document AI to uptrain an Invoice Parser. You can now use this processor to parse invoices just as you would for any Specialized Processor.

You can refer to the Specialized Processors Codelab to review how to handle the processing response.

Cleanup

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial:

  • In the Cloud Console, go to the Manage resources page.
  • In the project list, select your project then click Delete.
  • In the dialog, type the project ID and then click Shut down to delete the project.

Resources

License

This work is licensed under a Creative Commons Attribution 2.0 Generic License.