Most customers are using OCR for one of three reasons:
- They want to extract the raw text from an image. For example, using Forms Recognition and Processing to extract a specific field such as address, name, phone number, etc.
- They want to archive a digital copy of the document with searchable text. For eample: text only or image over text PDF and PDF/A.
- They want to create an editable document from an image. For example, Microsoft Word DOC or DOCX supports both images and text within the same file and is easily editable by a number of proprietary and open-source applications.
As as example, check out the screenshots below. Figure 1 represents the original image to be recognized; it uses a serifed, mono-spaced font with a slight skew angle, dots to clean up, and two columns of text. Notice in Figure 2 — which shows a third party OCR Engine — that the text was recognized correctly, but had several errors in the formatting. The most obvious error is in the spacing between words and incorrect carriage returns. Less noticeable, but just as important, is that it returned a sans-serif, proportional font with errors in the styling. Figure 3 shows the output with LEADTOOLS’ latest Advantage OCR engine, which is nearly perfect.
“That’s great, but why should I care if I only need raw text?”
That’s an intuitive and valid question if you are only concerned with raw text extraction. If getting the font, style and spacing correct isn’t a deciding factor, what is? Nine times out of ten it will come down to speed. Thankfully, we have programmers like you in mind! The LEADTOOLS Advantage OCR Engine provides optimization options that bypass the formatted text recognition, resulting in an average speed increase of 15-20%. Who said you can’t make everyone happy?