Welcome Guest! To enable all features, please Login or Register.

Notification

Error

Improve redaction using OCR Advantage - OCR Advantage in not recognizing Social Security numeric characters accurately

Options

View

Last

Unread

Previous Topic Next Topic

This topic and its replies were posted before the current version of LEADTOOLS was released and may no longer be applicable.

#1 Posted : Wednesday, November 1, 2017 1:49:52 PM(UTC)

Rob Cline

Groups: Registered
Posts: 52

Thanks: 14 times

We use the OCR portion of Leadtools to recognize documents submitted to be recorded in our county.
For public viewing, we are required to redact all Social Security numbers in any document submitted to us.
However, the SSN portions of most Death Certificates are NOT being recognized on most documents of this type.
I believe that OCR Advantage is not recognizing all of the numbers characters as numbers.
I am hoping this is either a settings or pre-processing issue.
All our documents are CCITT4 tif multi-page formatted.

I have tried testing the attached heavily redacted certificate against the demo app "C# OCR Advantage Demo".
Please do not make this attachment available to the forum.
It is for your internal use only.

Nothing I do seems to get the demo to accurately recognize the SSN portion of the certificate.
It continually recognizes the last 9 of the SSN as a g, even though the internal 9 is recognized as a 9.
Is it possible to just recognize number characters only, and would that improve the process?
We don't care about converting images to text per see.

Can you recommend how I should pre-process this type of tiff image to get better results.
To further clarify, forms processing can only partially help with Death Certificates because there many different forms of this type of document.

Can you specifically advise me using the demo application and this sample, just to keep things as simple as possible.

However, if there are setting that are not covered by the demo that I could use in my application, then please let me know.
I will then attempt to incorporate the changes to our batch redaction application.


	Try the latest version of LEADTOOLS for free for 60 days by downloading the evaluation: https://www.leadtools.com/downloads Wanna join the discussion? Login to your LEADTOOLS Support account or Register a new forum account.

#2 Posted : Wednesday, November 1, 2017 2:28:42 PM(UTC)

Joe Z

Groups: Registered, Tech Support, Administrators
Posts: 63

Thanks: 2 times
Was thanked: 4 time(s) in 4 post(s)

Rob,

Within the SSN Ocr Zone, you can set the zone to only recognize numerical digits. You can do this by setting the CharacterFilters property for that particular zone. To recognize only numerals, you would use the "OcrZoneCharacterFilter.Digit" enumeration item from the OcrZoneCharacterFilters enumeration list. Note that the OCR Advantage Engine only supports the "Digit" and "Plus" items from this list. I'll provide links to our documentations pages below.

https://www.leadtools.com/help/leadtools/v19/dh/fo/ocrzone-characterfilters.html
https://www.leadtools.com/help/leadtools/v19/dh/fo/ocrzonecharacterfilters.html

Here is a link to another forum post which contains an example project of setting the CharacterFilters property.
https://www.leadtools.com/support/forum/posts/t10959-HOW-TO--Set-Character-Filters-in-OCR-Professional-Engine

Joe Zhan
Developer Support Engineer
LEAD Technologies, Inc.

#3 Posted : Thursday, November 2, 2017 11:40:47 AM(UTC)

Rob Cline

Groups: Registered
Posts: 52

Thanks: 14 times

Thank you for your reply.

I was not able to get the sample application to run that used Character Filtering.
However, I was able to figure out how to apply character filtering on a page in the demo app "C# OCR Advantage Demo".
But, that was on a zone by zone basis and seemed very impractical.
I was not able to determine how to apply the character filter at the recognize page level.

But, I was able to determine through trial and error that using Cleanup tool "Despeckle" followed by "Fix broken letters" greatly improved the accuracy.
We already were using Recognition setting of "Accurate" in our current program.

It took me a bit to determine that "Fix broken letters" was actually running the following Image Processing Command "MinimumCommand(2)".
Evidently, this is a form of the dilate command to darken the edges of images or, in my case, characters.

I will attempt to add the two image process commands above to a simplified version of our redaction program and see if I can get a significantly higher accuracy on our Death Certificates.

#4 Posted : Tuesday, November 7, 2017 4:02:56 PM(UTC)

Joe Z

Groups: Registered, Tech Support, Administrators
Posts: 63

Thanks: 2 times
Was thanked: 4 time(s) in 4 post(s)

Rob,

Let us know if you have any questions regarding our Image Processing functions. The two that you mentioned "Despeckle" and "Minimum" Commands are linked below for your reference.

https://www.leadtools.com/help/leadtools/v19/dh/po/despecklecommand.html
https://www.leadtools.com/help/leadtools/v19/dh/po/minimumcommand.html

Joe Zhan
Developer Support Engineer
LEAD Technologies, Inc.

1 user thanked Joe Z for this useful post.

Rob Cline on 11/7/2017(UTC)

#5 Posted : Tuesday, November 7, 2017 5:06:47 PM(UTC)

Rob Cline

Groups: Registered
Posts: 52

Thanks: 14 times

The more familiar I get with the prior programmers code, the more I have been able to understand it and refactor it based on what I have learned from the demo app "C# OCR Advantage Demo".

I had to also add some page level preprocessing methods since I discovered that a few of our documents come in with some individual pages in landscape mode, therefore they need to be rotated and/or deskewed as well.

However, I am coming to the realization that no matter how I try to tweak this batch process, there will always be some document pages that are just so corrupted as received in the system, that the OCR process will not be able to read the SSN.
There always seems to be a need for a human to review the output.

Is this your experience as well when it comes to performing OCR on non-form based projects (i.e. images can be almost any kind of document at all)?
Or are there batch systems based on your tools, that perform at a near 100% accuracy with some kind of automated adjustments?

#6 Posted : Thursday, November 9, 2017 10:32:33 AM(UTC)

Joe Z

Groups: Registered, Tech Support, Administrators
Posts: 63

Thanks: 2 times
Was thanked: 4 time(s) in 4 post(s)

Rob,

The difficulty with OCR is that input files vary widely. Due to this, the OCR process cannot have a 100% complete accuracy. However, several steps can be taken to improve the recognition accuracy.

One way to improve your recognition accuracy is by providing more test files/scenarios. By providing more test files to OCR, you can test your set of image preprocessing commands more thoroughly. Once you have a set of image processing techniques that improves the recognition results of many files, you will know that the particular set of techniques within that batch, improves recognition results.

Additionally you can perform multiple passes to obtain more recognition results. By obtaining more recognition results, you can take the many results and average them to obtain a more accurate recognition result. Finally, you can validate your output to ensure it is the proper format.

Joe Zhan
Developer Support Engineer
LEAD Technologies, Inc.

#7 Posted : Thursday, November 9, 2017 11:15:11 AM(UTC)

Rob Cline

Groups: Registered
Posts: 52

Thanks: 14 times

Thanks. That confirms my suspicions.

I am now testing against hundreds of different documents (we process around 700 documents a day).
I will start applying a set of business rules to the OCR process to maximize the accuracy in an ongoing basis.

You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.