Search and Redact Documents and Images with Regular Expressions and OCR


One of the most important aspects of documents and record-keeping is the ability to redact sensitive information. Originally done permanently with blank ink on paper, the need for redaction certainly carried over into document imaging as processes and archives started going paperless. LEADTOOLS has many ways to redact documents, but today we’re going to look at an example you might not have thought of before and would be right at home within any ECM or eDiscovery application.

Our Annotations functionality being the easiest and most like the old days. Load the image, use the mouse to draw a big black line over the text that needs redacting and save the image and/or annotations. The annotation can be permanently burned onto the image data, or saved as a layer on top of it so that users with appropriate permissions can actually remove the redaction and see the contents.

Examples and Uses

What if you need something faster that can be applied on a large batch of documents? Here’s where Optical Character Recognition (OCR) can come in handy. LEADTOOLS can load any image or document, get the text directly if it’s already a searchable format like PDF or DOC, or use OCR to extract the text if it’s raster like a TIFF or JPEG. Then, search all the documents within your archive to get the bounding rectangle of that text and permanently add a redaction to the raster data, or save an annotation layer to the document.

A simple text search would be useful, but adding Regular Expressions to the mix really turns it up a notch. For example, rather than searching for each known Social Security Number in your organization, use a regex to search for any and every SSN. This is just one practical use of searching document images with OCR and you can go beyond redacting. You could also add a highlight annotation to draw attention to certain details within your document repository.

Download the Sample Project

For the full downloadable example project, check out the forum post.

This entry was posted in OCR and tagged , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *