Parsing Text with the Document Library

The Document class supports parsing the text of any page using LEADTOOLS SVG or OCR technologies. This allows applications to perform actions such as full text searching, highlighting text on the document, and creating text-based annotation review objects. The LEADTOOLS Document Viewer Library and the Document Viewer Demo is one such example.

Text can be parsed in one of two ways:

If the document type supports SVG (Scalable Vector Graphics), then the text can be parsed from the SVG data directly. This provides 100% accuracy, speed, support for any language, and will ignore logos and other graphics items from the text result.

Searchable PDF files, Microsoft Office Documents (DOC/DOCX, XLS/XLSX, PPT/PPTX), HTML, ePub, Text, SVG, CAD files (DWG, DXG, DWF), IOCA/MODCA are examples of just a few of the file formats that can be parsed by LEADTOOLS using the SVG engine.
If the document type does not support SVG, then the LEADTOOLS OCR engine can be used to parse the text. The LEADDocument class will perform the recognition operation internally using the OCR settings provided by the user (such as which languages and spell check engines to use) to parse the text and return it.

Raster PDF files, TIFF, JPEG, and PNG are examples of such formats. These are raster image formats that do not contain any text data. However, OCR can be used to recognize and read any text from the image.

It is preferable to extract the text data using the SVG engine for 100% accuracy and maximum speed. If the SVG data is not available, then OCR should be used. The LEADDocument class provides support for performing the above automatically while hiding all the internal details. The user of the class will obtain the text data in the same manner regardless of whether SVG or OCR was used.

The text can be obtained per page using GetText. This will return a DocumentPageText object that contains information about each character found on the page including its location, size, and code. This information is uniform regardless of whether SVG or OCR is used. The class also contains helper methods to organize these characters into words, lines, or a simple string object. Refer to DocumentPageText for more information.

If caching is used with the document, then subsequent calls to GetText will fetch the data from the cache, but it is not parsed again (to speed up the operation).

When GetText is called, the LEADDocument object will use the options set in DocumentText to determine how to parse the text. These settings are in the Text property and are global to all the pages of the document. These settings include:

DocumentText.TextExtractionMode. This is set to DocumentTextExtractionMode.Auto by default, which means to use SVG if supported; otherwise, use OCR. Change this value, if needed, to disable SVG or disable OCR if required by your application. Note that if you set the value to a mode that is not available (for example, to DocumentTextExtractionMode.SvgOnly when the document type does not support SVG), then DocumentPageText will succeed but will return an empty object.
DocumentText.ImagesRecognitionMode. This is set to DocumentTextImagesRecognitionMode.Auto by default, and indicates how to treat the image elements encountered in the SVG representation of the page during text extraction.

In all cases, a DocumentPageText object will be returned to the user with the same exact information regardless of the extraction mode.
The OCR engine instance set in the service. This is an instance of any LEADTOOLS OCR Engine that will be used when OCR is invoked by the document. Internally, the engine will create an OCR page for the image of the page, call recognize, and then parse the results into a DocumentPageText object.

LEADTOOLS OCR and SVG technologies are completely thread-safe and any number of pages can be parsed at the same time from any number of threads.

For an example, GetText.

Parsing Text with the Document Library

See Also