Directory Word Search for Images and Documents: 25 Projects in 25 Days

Directory Word Search Screenshot
Directory Word Search Screenshot
Directory Word Search Folder Selection
Directory Word Search Folder Selection
Directory Word Search In Progress
Directory Word Search In Progress
Directory Word Search Complete
Directory Word Search Complete

As part of the LEAD Technologies 25th anniversary, we are creating 25 projects in 25 days to celebrate LEAD’s depth of features and ease of use. Today’s project comes from Nathan.

What it Does

This application will search both raster image and document files for text and return matching files using LEADTOOLS Version 19.

Features Used

Development Progress Journal

Hello, my name is Nathan and I am going to write an application that will search a directory of files for a specific word, regardless of whether they are in a raster image or a document format. With LEADTOOLS I should be able to find the word.

I am going to start by building a UI, and then I will work on getting the word search to work with just document formats using our Document class.
Documentation: Document class

That didn’t take long at all, maybe 30 minutes. Now I need to add OCR capabilities so that we can check raster images for the word as well. Since OCR is slower I will add a check box to allow the user to opt-in for OCR.
Documentation: IOcrEngine

That only took about an hour, plus I had to move some pieces around. Everything is in my UI thread which is going to cause some complications. I am going to create a background thread for all of the work to be done on, and create some states for the UI to limit what the user can do in certain scenarios.

That took about 2 hours, I’m relatively new to multi-threading in .NET. Now the UI thread is separate from all the work that is being done and the user can pause or cancel the operation at hand.

I’d like to speed things up since large directories going one file at a time can take quite a bit of time. I’m going to use Parallel.ForEach loops to process not only multiple documents but, multiple pages within each document at a time.

That took about 2 hours, I ran into a few complications I had to work out, but luckily LEADTOOLS OCR and the Document class are thread-safe so that didn’t require any extra coding to allow multiple documents and multiple pages to be done at the same time.

Now we can search a directory full of documents for a specific word and open the documents as they are found without interrupting the processing, and it’s really fast since we are doing several documents at a time and several pages within those documents at a time.

I’m going to do some code clean up and then I’ll be done.

After about 20 minutes of code cleanup, I am finished! In less than 6 hours I was able to create a complete multi-threaded application that allows me to search for text within a directory full of raster images or documents that’s pretty astounding and there is no way I would have been able to accomplish this without LEADTOOLS.

A future improvement for this demo would be to load the images into the viewer and highlight the area(s) where the search term is found.

Download the Project

The source code for this sample project can be downloaded from here. To run the project, extract it to the C:\LEADTOOLS 19\Examples\DotNet\CS directory.

This entry was posted in OCR and tagged , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *