Document AI doesnt see some numbers on training screen

Hello.

I am trying to improve "Invoice Parser" processor to know some additional labels. Major problem, that it did not see some numbers. I tested "OCR processor" and it extracted this numbers without issue, but on training screen when I select this numbers, like on screenshot, I get nothing in value. Even if I am correct value, it will not allow me to train model, because it will say this selected labels is empty in documents. How I can fix this issue? I have multiple documents and all have such issues. It does not skip all numbers, but like skip some of them and ignore them (especially "0" in this table).

 

Screenshot 2024-07-02 at 14.57.19.png

Solved Solved
3 8 326
1 ACCEPTED SOLUTION

You're right @oleks_vasyliev , 

As I replicate it in my end the custom extractor has better recognition with number such as 0. 

McMaco_0-1720464425209.png

For the pricing you may refer to this Document AI Pricing for more information. 

Hope this helps.

View solution in original post

8 REPLIES 8

Hi oleks_vasyliev,

There are several reasons why Document AI might be struggling to recognize zeros in your training data. Here's a consolidated view of the potential causes and solutions:

Labeling Issues:

  • Incorrect Labeling: Double-check your labeling to ensure zeros are accurately selected. They might be accidentally skipped, labeled as spaces, or merged with adjacent characters.
  • Label Format: Verify that the labeling tool uses the expected format (bounding boxes or text annotations) and positions them correctly around the zeros.

Label documents -  are required to train, up-train, or evaluate a processor version.

Data Quality and Preprocessing:

 

 

  • Document Clarity: Analyze your training documents. Are zeros clear and well-formatted, or are there issues with blurry scans, small font sizes, or low contrast with the background?
  • Preprocessing Options: If available, explore Document AI's preprocessing settings. Look for options that might improve small character recognition, adjust contrast specifically for numbers, or filter out noise that might obscure zeros.

Data considerations and recommendations - The quality and the amount of your data determines the quality of the training, uptraining, and evaluation.

In addition, you may refer to the below items:

  • Clear Labeling: Ensure consistent and accurate labeling of zeros throughout your training data.
  • Data Quality: Use high-quality training documents with clear and well-formatted zeros.
  • Balanced Training: Balance your training data set to avoid biasing the model towards more frequent characters.
  • Iterative Training: Train, test, and refine your model iteratively, adjusting labeling, preprocessing, or the data set based on your findings.

By addressing these factors and implementing the appropriate solutions, you should be able to improve Document AI's ability to recognize zeros in your training data and successfully train your model.

I hope this helps.

 

  • Label Format: Verify that the labeling tool uses the expected format (bounding boxes or text annotations) and positions them correctly around the zeros - I tried both - it dont want to select numbers 
  • Clear Labeling: Ensure consistent and accurate labeling of zeros throughout your training data. - checked, this is number field in table - it may have zeros
  • Data Quality: Use high-quality training documents with clear and well-formatted zeros - if OCR can extract this numbers, so looks like no issues
  • Balanced Training: Balance your training data set to avoid biasing the model towards  more frequent characters - I need this numbers, so I train model to extract this numbers
  • Iterative Training: Train, test, and refine your model iteratively, adjusting labeling, preprocessing, or the data set based on your findings - I cannot train if I cannot select
  • Preprocessing Options: If available, explore Document AI's preprocessing settings. Look for options that might improve small character recognition, adjust contrast specifically for numbers, or filter out noise that might obscure zeros - where can I activate this settings on training screen? It is invoice parser, not OCR parser

 

Here video with example:

  • first zero in row cannot be selected
  • second zero in row can be selected
  • third zero in row cannot be selected again

Here for you example by trying to use both tools

Bounding boxes: https://youtube.com/shorts/kTBfqhKMT4A?feature=share

Text annotations (it is even visible, that this numbers have no grey background boxes and cannot be selected): https://youtube.com/shorts/CCm7uCVpSnA?feature=share

I hope videos will explain what is the issue.

 

I try to use a different parser, Expense Parser to be specific. here's the result in the tab of Evaluate and Test:

Screenshot 2024-07-05 8.55.13 PM.png

Hope this helps.

Thanks for reply. Sorry, but already spend weeks to train "invoice parser" for new labels.

Do you want me to say, that "Expense Parser" have better training ability to recognize numbers, than "Invoice parser"? Why train playground different for this parsers, if both will be used by human, not robot? Also, does this mean even if I need parse bills and invoice parser is logical, I still need to use "Expense Parser" because somehow it better? Just need to know which is best one to select in future.

No worries oleks_vasyliev, I am here to help you out with this matter. I understand that switching parser may take you a lot of effort and time. Here are some documentation regarding invoice parser that might help you out to understand why zeros are not recognized by the processor. As well as the limits of Document AI depending on the processor you are using. 

If none of these suggestions resolve the issue, consider reaching out to Google Cloud support for further assistance. Thank you.

Thanks. 

Last question @McMaco . If we switch to "Custom Extractor", is it can handle such cases better (with this numbers), than invoice/expense parsers (but will cost us 3 times more for such functionality)? Thanks

You're right @oleks_vasyliev , 

As I replicate it in my end the custom extractor has better recognition with number such as 0. 

McMaco_0-1720464425209.png

For the pricing you may refer to this Document AI Pricing for more information. 

Hope this helps.

Thanks. Bad, that even in your example it miss zero at second row (MAR/23) 😞