Check if Folder of PDFs Need OCR

Receiving a mix of both raster and searchable PDFs can be a pain, especially for those who get these files from clients daily. The LEADTOOLS OCR and PDF SDK makes it easy for developers to check if a file needs to have OCR ran or not.

Within PDFs are three different object types:

  • Text
  • Rectangle
  • Image
For this blog post, I’ll show you how to check all object types from stored PDFs, then calculate if the PDF needs to converted to a searchable PDF using the OCR SDK.

At the start of your application, you will want to start up the OCR engine just in case you do have to OCR any files. Once the engine is started, we will want to look through each PDF from a specified folder. For each file found, load it as PDFDocument and parse the pages using ParsePages while setting PDFParseOptions to only look at the objects.

Once all the objects have been looked at, compare all of the non-text objects to all of the objects found. If more than 10% of the objects non-text objects, then it’s fair to assume that majority of the PDF is not searchable and you will want to OCR the document.


string[] pdfFolder = Directory.GetFiles(@"D:∖temp∖PDFs", "*.pdf");

// Start OCR Engine
using (IOcrEngine ocrEngine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD, false))
{
    ocrEngine.Startup(null, null, null, @"C:∖LEADTOOLS 20∖Bin∖Common∖OcrLEADRuntime");

    foreach (string file in pdfFolder)
    {
        Console.WriteLine($"Reading {file}");
        using (PDFDocument document = new PDFDocument(file))
        {
            PDFParsePagesOptions options = PDFParsePagesOptions.Objects;
            document.ParsePages(options, 1, -1);

            int totalPdfObjects = 0;
            int totalNonTextObjects = 0;

            foreach (PDFDocumentPage page in document.Pages)
            {
                int i = page.PageNumber;

                IList<PDFObject> objects = page.Objects;

                foreach (PDFObject obj in objects)
                {
                    totalPdfObjects++;

                    if (obj.ObjectType != PDFObjectType.Text)
                    {
                        totalNonTextObjects++;
                    }
                }
            }

            double percentage = ((double)totalNonTextObjects / totalPdfObjects);
            if (percentage > .1)
            {
                Console.ForegroundColor = ConsoleColor.Green;
                Console.WriteLine($"Performing OCR on {file}∖n");
                using (RasterCodecs codecs = new RasterCodecs())
                using (RasterImage image = codecs.Load(file, 0, CodecsLoadByteOrder.RgbOrGray, 1, -1))
                    DoOCR(image, ocrEngine, file);
            }
            else
            {
                Console.WriteLine($"{file} does not need OCR∖n");
            }
        }
        Console.ForegroundColor = ConsoleColor.White;
    }

}
Console.WriteLine("Finished");
Console.ReadLine();

Below is the code for the DoOCR(image, ocrEngine, file) method above.

private static void DoOCR(RasterImage image, IOcrEngine ocrEngine, string fileName)
{
        using (IOcrDocument ocrDocument = ocrEngine.DocumentManager.CreateDocument())
        {
            ocrDocument.Pages.AddPages(image, 1, -1, null);
            ocrDocument.Pages.AutoPreprocess(OcrAutoPreprocessPageCommand.All, null);
            ocrDocument.Pages.Recognize(null);

            // Save the document we have as PDF 
            ocrDocument.Save($@"D:∖temp∖NewPDFs∖{Path.GetFileNameWithoutExtension(fileName)}_OCR.pdf", DocumentFormat.Pdf, null);
        }    
}

Support

Need help getting this sample up and going? Contact our support team for free technical support! For pricing or licensing questions, you can contact our sales team (sales@leadtools.com) or call us at 704-332-5532.

This entry was posted in OCR. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *