Recognizing Document Pages

Each zone on a page has a recognition module associated with it through the ZONEDATA.RecogModule member. This recognition module provides information about the type of information contained in the zone and how to recognize that data. Depending on the type of recognition module, there may be additional options available for use during recognition. For example, if a zone is associated with a Multi-lingual Omnifont Recognition module (MOR), then other recognition options for this module can be set using L_DocSetMOROptions. To get the current MOR options, use L_DocGetMOROptions.

Similarly, if a zone is associated with a Hand Printed Numeral Recognition module, then other recognition options can be set using L_DocSetHandPrintOptions or retrieved using L_DocGetHandPrintOptions. If the zone is associated with an Optical Mark Recognition module (OMR), other recognition options can be set using L_DocSetOMROptions and retrieved using L_DocGetOMROptions.

For some general information about available recognition modules, refer to An Overview of Recognition Modules.

Depending on the type of recognition module associated with a zone, it may be beneficial to trade-off between the accuracy of recognition and the speed of recognition. Using the L_DocSetRecognizeModuleTradeOff function you can tell the OCR engine to perform the most accurate recognition, the fastest recognition, or a balanced recognition. To get the current trade-off setting for the OCR engine, call L_DocGetRecognizeModuleTradeOff.

If the host PC has two processors or a hyper-threaded one, using the Parallel Recognition Mode can speed up the recognition process by allowing the two recognition engines to run in parallel. The Parallel Recognition Mode may be used when a zone is associated with any of the following recognition modules: RECOGMODULE_MTEXT_OMNIFONT, RECOGMODULE_OMNIFONT_FRX, or RECOGMODULE_OMNIFONT_PLUS3W. To determine whether the Parallel Recognition Mode is enabled, call L_DocIsParallelRecognitionEnabled. To enable or disable the Parallel Recognition Mode, call L_DocEnableParallelRecognition.

When all necessary recognition options have been set, the page(s) can be recognized by calling L_DocRecognize. To get information on the status of the recognition process during recognition, pass a valid pointer to a RECOGNIZESTATUSCALLBACK function to the L_DocRecognize function.

After recognition is complete, the recognized characters can be obtained and the recognition results can be saved to a file or to memory.

The collection of characters recognized for a specific page can be obtained using L_DocGetRecognizedCharacters. To add any characters to this collection of recognized characters, call L_DocSetRecognizedCharacters. When this collection of recognized characters is no longer needed, it should be freed by calling L_DocFreeRecognizedCharacters.

Once the characters for a specific page have been determined using L_DocGetRecognizedCharacters, L_DocGetRecognizedWords can be called to combine the recognized characters into words. To change the contents of the recognized words, change the set of recognized characters by calling L_DocSetRecognizedCharacters. To save the updated recognized characters to a file, call L_DocSaveResultsToFile. To save the results into memory, call L_DocSaveResultsToMemory. When the collection of recognized words is no longer needed, it should be freed by calling L_DocFreeRecognizedWords.

The recognition results can be saved to a file by calling L_DocSaveResultsToFile. The type of material exported to a file, the method in which the material is stored and the file type in which it is stored can all be controlled using L_DocSetRecognitionResultOptions. To get the current recognition results settings, call L_DocGetRecognitionResultOptions.

When saving recognition results to a file, you can use L_DocEnumOutputFileFormats to enumerate all available output file formats supported by the OCR engine. This function will report each file format to an ENUMOUTPUTFILEFORMATS callback function. To get specific information about a particular output file format, call L_DocGetTextFormatInfo.

The recognition results can also be saved to memory by calling L_DocSaveResultsToMemory. When the memory is no longer needed it should be freed by calling L_DocFreeMemoryResults.

To get or set special characters used in the recognition process, use L_DocGetSpecialChar and L_DocSetSpecialChar.

Finally, to get the status of the OCR engine at any time, use L_DocGetStatus.