OCR: More Than Just Text

Posted on 2015-06-15 Greg

OCR is a hot topic around LEAD at the moment. In addition to our recent Code Project Tutorial and blog post about using OCR in a distributed application, LEAD just released some major updates to its Advantage OCR engine.

Most customers are using OCR for one of three reasons:

They want to extract the raw text from an image. For example, using Forms Recognition and Processing to extract a specific field such as address, name, phone number, etc.
They want to archive a digital copy of the document with searchable text. For eample: text only or image over text PDF and PDF/A.
They want to create an editable document from an image. For example, Microsoft Word DOC or DOCX supports both images and text within the same file and is easily editable by a number of proprietary and open-source applications.

To handle these typical scenarios, there are two primary components of any OCR engine. First and foremost, it must be able to accurately recognize all of the text in the image. If all you need is the raw text data as in #1 above, you could stop there. However, this level of recognition is insufficient if your objective is to archive or publish a well formatted PDF file or to save a Word document for editing. Recognizing formatted text is a very complex algorithm that must determine the font face, size, styling (i.e. italics, bold etc.), line information, spacing and more. In my experience, formatted text recognition is where you really start to see different engines show their strengths and weaknesses.

As as example, check out the screenshots below. Figure 1 represents the original image to be recognized; it uses a serifed, mono-spaced font with a slight skew angle, dots to clean up, and two columns of text. Notice in Figure 2 — which shows a third party OCR Engine — that the text was recognized correctly, but had several errors in the formatting. The most obvious error is in the spacing between words and incorrect carriage returns. Less noticeable, but just as important, is that it returned a sans-serif, proportional font with errors in the styling. Figure 3 shows the output with LEADTOOLS' latest Advantage OCR engine, which is nearly perfect.

[caption id="attachment_676" align="aligncenter" width="628" caption="Figure 1: Original TIFF"]

[/caption]
[caption id="attachment_677" align="aligncenter" width="628" caption="Figure 2: Low quality formatting from some engines"]

[/caption]
[caption id="attachment_678" align="aligncenter" width="628" caption="Figure 3: LEADTOOLS Advantage OCR Engine"]

[/caption]
"That's great, but why should I care if I only need raw text?"

That's an intuitive and valid question if you are only concerned with raw text extraction. If getting the font, style and spacing correct isn't a deciding factor, what is? Nine times out of ten it will come down to speed. Thankfully, we have programmers like you in mind! The LEADTOOLS Advantage OCR Engine provides optimization options that bypass the formatted text recognition, resulting in an average speed increase of 15-20%. Who said you can't make everyone happy?

Thanks,
Otis Goodwin