Created 11 months ago

Active 1 month ago

Viewed 397 times

7 replies

Has anyone worked in a company where they extract large amounts of text using OCR and then clean the text to be as accurate as possible?

How is this done?

Say I digitize a lot of legal documents, run OCR on them. Now we know that OCR accuracy is not very good. How to make sure that the document is as close to the original scanned copy as possible?

How to build systems around it? Are there any companies working on these problems?

Many OCR models recognize specific characters wrongly. I believe this is a pattern that can be learnt by an NLP algorithm, and we can highly improve the accuracy of the combined model. Do you think there is any merit to this idea? How to get started working on it?

edited Aug 26, 2023 at 6:38

created Aug 23, 2023 at 18:47

Anmol Deep

7 replies

Sorted by:

Rajdeep Borgohain

Hi, we have a system which works on the extraction of Design Documents. Right now, we are using two OCRs which fit our use case.

Pytesseract: Yes, some might argue, but this works better than a lot of other quite famous OCR libraries
Google Vision API: For documents that we don't want to compromise accuracy.

We did compare a lot of different OCR libraries, but almost all failed to extract long text content. Even tried a couple of paid api's apart from Google Vision. But ended up using it.

Aug 27, 2023 at 18:43

Anmol Deep

Author

Thanks for responding Rajdeep. I will try Google Vision API. Apart from this, do you do any post processing on the data? My work is on indic language documents, and there are a lot of errors in the OCR output, for example, it recognizes प्र as प , स as रा and so on. To me, on first glance, it seems like a systemic problem, and I am thinking about applying some NLP technique to get rid of these errors. What are your thoughts on this?

Aug 28, 2023 at 13:54

Dave

You are asking a lot of questions here. I can't answer these. But I can explain my own situation.

I have been using Google API for API conversion of old New Zealand Electoral Rolls.

Here's an example:

One Area Electoral Roll contains around 50 JPG file images. Each image contains around 400 names/address/job description.

My testing convinced me that Google OCR gave the best results. Significantly better than say Tesseract OCR

Basically speaking my setup is:

OCR conversion - Google script using Drive API

Google Drive Export - Node Javascript with Google Authentication

File Naming/Administration - Powershell script

Error correction/Formatting - Python script using libraries: difflib, re, and many others. I have also used publicly available street name files to build my own Python dictionaries for street-names, street-types & suburbs. These are used with the above Python modules to correct a lot of text errors.

I am close to finishing. My objective is to output the data as SQL inserts (files formatted for SQL database input)

Usage example:

(i) I want to be able to search for all people who lived in a certain street back in the day

(ii) All people who had a particular job description back then

Note to anyone planning to develop using Google Drive API:

The draw back of using a Google API is that you are at the mercy of Google. I found this out around October 2022 when my Google Drive output changed significantly. Google eventually admitted to what they described as a 'meta bug' in the API. They up dated me for about 6 months telling me they were fixing it. But eventually they gave up and told me they were closing the issue. This created a lot of extra work for me.

Feb 28 at 18:53

Anmol Deep

Author

Hello @Dave, thanks for taking the time to share your work. Since I asked the question, I have come across a lot of literature for Post-OCR correction. I will share some videos, which I found very helpful, at least from a theoretical standpoint.

Basically I was trying to ask whether there are some methods to correct OCR errors using the context of the recognized text (the words surrounding the recognized word). There are some, but they are very limited and expensive/inaccessible.

Post OCR Correction at Adobe talk

Feb 28 at 19:02

Dave

Hi,

Thank you for the reply. And the Adobe talk video. Perhaps my thinking is simplistic. But I have tested Adobe OCR (Acrobat Pro) and the Google Drive equivalent OCR conversion was significantly superior.

Surely, in the first place, the main factor to consider with OCR error correction is the quality of the OCR result. The better the OCR result, the less OCR correction is required.

I think I should know because my project is OCR conversion of old (historic) documents. The data has been entered using a type-writer. The paper has crumpled in places. There are even stains here and there masking the text. And I am still getting a high accuracy rate of conversion. I would estimate it at better than 95% true to the original text. With most errors caused by physical damage to the print/paper of the original text. Not the OCR conversion itself.

My suggestion is to try it.

(i) Store a scanned/photographic JPG image of text on Google Drive

(ii) Right click in browser (not synced Drive)

(iii) Choose "Open with Google Docs"

If you have a good sample you should get a good result. For an English example with simple formatting you should get close to 100% accuracy. Its just the formatting that needs cleaning up.

Feb 29 at 4:46

Anmol Deep

Author

Dave I would like to collaborate with you on your work. Please send me a connection request on linkedin: https://linkedin.com/in/anmoldeep1

I will schedule a meeting to discuss further and collaborate.

Mar 12 at 11:21

Hello

Hello how are you guys thank you for the discussion I want to ask for language like arabic let say is there any post processing OCR tools or not? I am working with a pdf book that I analyze using AbbyFineReader OCR and still I need a way to post process the book without manually adjusting it? any help please?

Jun 3 at 11:37

Share perspectives, advice, and insights

Use Discussions to engage in deeper dialogue, have opinion-based conversations, and exchange perspectives about a technical concept. See full Discussions guidelines.

Discussions is different than Q&A

Discussions exists separately from the traditional question-and-answer space. If you have a specific programming question, go to Stack Overflow Q&A to post your question.

Be welcoming and patient

All users are expected to treat one another with kindness and respect. Remember, everyone is here to learn, and sometimes while learning, people make mistakes. See code of conduct.

No resume or job listings

Discussions are not for sharing your resume or job listing.

Avoid self-promotion

If your post happens to be about your product or website, you must disclose your affiliation. See spam guidelines and best practices.

Collectives™ on Stack Overflow

How are OCR texts post-processed to increase accuracy of recognition?

7 replies