In This Topic ▼

OCR Confidence Reporting

For some applications, it may be important to know the reliability of the recognized text generated by the engine. These applications may require having additional confidence information for the recognized characters and/or words.

The engine can provide confidence information for the correctness of the recognized text in two different ways:

The Engine's output-marking feature (see: OCR Engine Specific Settings) enables the IOcrDocument.Save, IOcrDocument.SaveXml or IOcrPage.GetText methods to place a user-defined character sequence into the final output document before suspiciously-recognized characters and/or words (recognition results with low confidence). Alternatively, the suspicious characters and/or words can be set to be a particular color in the output document.

In another approach, the Engine can generate output which consists of structured data for each recognized character. In this output there is one structure or record for each character. The character code of the recognized entity is the primary field. Other fields include the coordinates of the character on the image, the zone to which the character belongs to, the font information for the character, and the confidence information.

Output-marking is supported by most output converters. Marking low confidence recognition with color requires that the output format (e.g. MS Word) supports colored text.

A possible output for the marking feature might be as follows:

"We would like to ask you some questions, ta*king around 15 minutes"

The previous text extract was generated using the output-marking feature, in which the asterisk ('*') character was set to mark the suspiciously recognized characters in the output.

More information can be retrieved directly into application memory by calling IOcrPage.GetRecognizedCharacters, just after calling IOcrPage.Recognize and IOcrPage.GetText. The IOcrPage.GetRecognizedCharacters call provides the most detailed information about the recognized data. It results in an OcrCharacter structure for each recognized character.

There are three properties in the OcrCharacter structure, which provide character recognition confidence information: the OcrCharacter.Confidence, OcrCharacter.WordIsCertain and the OcrCharacter.LeadingSpacesConfidence properties.

The OcrCharacter.WordIsCertain property expresses the certainty/uncertainty of the word this character is part of.

The OcrCharacter.Confidence property expresses the certainty of the character recognition, and ranges between 0 and 100. A value of 100 means that the Engine recognized the character with high confidence. In some cases a word may have some or all characters that are individually suspicious but the characters are not be marked suspicious in OcrCharacter.WordIsCertain. This is usually a result of language or user dictionary checking. It means the word was validated by the checking subsystem.

The OcrCharacter.LeadingSpacesConfidence property ranges between 0 and 100, and it expresses the confidence of the value in the OcrCharacter.LeadingSpaces property of the structure, (i.e., whether the Engine is certain about the amount of space estimated to be in front of the recognized character).

Applications that examine character confidence information can use a threshold value. Below that value a character is treated as a suspicious result. A value of 64 is best for this purpose. A value of 64 or more indicates high confidence that the character was recognized correctly. A value less than 64 marks that code as suspicious.