X
    Categories: Document Imaging

How to Clean up a Document Image

Few things have as great an impact on document imaging as cleanup. Its benefits reach far beyond the the improved visuals and readability. Document image cleanup is the salt to the meatier document imaging technologies like OCR, barcode, PDF, forms recognition, archival, and the like—it enhances them.

  • Accuracy – Images are rarely perfect. Colors, angles, imperfections in the original document all have an affect on the accuracy of recognition technologies. By correctly aligning the image and removing obstacles around the important areas, the recognition processes can scan the image and look for the predictable patterns that make up the text and data that you wish to extract.
  • Compression – Most compression algorithms work by finding ways to cleverly group pixels together in a way that uses fewer bits but still reconstruct the image to its original (or close to original for lossy methods), uncompressed data. In the document world, this is especially the case for black and white images. By minimizing unnecessary artifacts like dots, hole punches, and borders, the single-color runs are longer and therefore compress better.
  • Speed – With fewer unnecessary pixels getting in the way, nearly every algorithm can do its work more efficiently.

Using LEADTOOLS for your document image cleanup

Enough of the why, you probably knew a lot of that already. Let’s get into the fun stuff of how to do it and what’s available in LEADTOOLS! Here are some of the most popular cleanup functions that can easily be applied to any and every image to give your more advanced functions a better base to run on.

Inverted Image

Bitonal images can become inverted for many reasons. Scanner settings, inverted palette, color masks, or conversion from one format to another can all cause the pixels that should be black be white and vice versa. This function is a great first step that can be run on every image just in case.


InvertedPageCommand invertedPage = new InvertedPageCommand(InvertedPageCommandFlags.Process);
invertedPage.Run(img);

Despeckle

Speckles often appear from dust on the image, scanner, or through half-toning. It works for both black speckles on a white background and white speckles on a black background. Run this function to remove them and create longer runs of like-pixel colors.


DespeckleCommand despecklePage = new DespeckleCommand();
despecklePage.Run(img);

Line Removal

The two most common sources for lines are tables and folded paper. In both cases, the narrow horizontal or vertical line can be detected and removed, even when it intersects with machine printed or hand-written text. This is a must-have for any recognition technology.


LineRemoveCommand lnPage = new LineRemoveCommand();
lnPage.Type = LineRemoveCommandType.Horizontal;
lnPage.Flags = LineRemoveCommandFlags.UseGap;
lnPage.GapLength = 2;
lnPage.MaximumLineWidth = 5;
lnPage.MinimumLineLength = 200;
lnPage.MaximumWallPercent = 10;
lnPage.Wall = 7;
lnPage.Run(img);

Border Removal

If an image comes in at any angle or there is extra space on the flatbed backing, the scanner will fill the gap with a color. If it is black, then it can be removed.


BorderRemoveCommand borderPage = new BorderRemoveCommand();
borderPage.Run(img);

Hole Punch Removal

This one is pretty cut and dry. If any hole punches were picked up by the scanner and made black, you can eliminate them to restore those areas to match the background.


HolePunchRemoveCommand holePage = new HolePunchRemoveCommand();
holePage.Run(img);

Additional cleanup

The functions above are great general-purpose document cleanup functions for any document image. LEADTOOLS offers even more document image processing functions like Deskew, Min, Max, etc. which can be used for more precise tuning, or that benefit one technology more than another.

Download an Example

One of our support guys recently posted a great image cleanup sample project, which you can download here. If you need LEADTOOLS as well, you can download the latest installer here.

Now get out there and create some amazing document imaging applications while eating some bacon-wrapped red meat. The programmer-friendliness of LEADTOOLS toolkits and the time saved on your project should offset any blood pressure concerns!

Greg: Developer Advocate