Parsing Text with the Document Library

The LEADDocument class supports parsing the text of any page using LEADTOOLS SVG or OCR technologies. This allows applications to perform actions such as full text search, highlight text on the document and create text-based annotation review objects. The LEADTOOLS Document Viewer Library and the Document Viewer Demo is one such example.

Text can be parsed in one of two ways:

It is preferable to extract the text data using the SVG engine for 100% accuracy and maximum speed. If the SVG data is not available, then OCR should be used. The LEADDocument class provides support for performing the above automatically while hiding all the internal details. The user of the class will obtain the text data in the same manner regardless of whether SVG or OCR was used.

The text can be obtained per page using the DocumentPage.GetText method. This will return a DocumentPageText object that contains information on each character found on the page including its location, size and code. This information is unified regardless of whether SVG or OCR was used. The class also contains helper methods to organize these characters into words, lines or a simple string object. For more information, refer to DocumentPageText.

If caching is used with the document, then subsequent calls to DocumentPage.GetText will fetch the data from the cache and it is not parsed again to speed up the operation.

When DocumentPage.GetText is called, the LEADDocument object will use the options set in DocumentText to determine how the text is parsed. These settings are in the LEADDocument.Text property and are global to all the pages of the document. These include:

LEADTOOLS OCR and SVG technologies are completely thread safe and the user can parse any number of pages at the same times from any number of threads.

For an example, refer to DocumentText and DocumentPage.GetText.

Help Version 20.0.2018.9.5
Products | Support | Contact Us | Copyright Notices
© 1991-2018 LEAD Technologies, Inc. All Rights Reserved.

LEADTOOLS Imaging, Medical, and Document