OCR and Extract Data from Scanned Invoices and Forms using C#

In our previous blog, we discussed the benefits of using the LEADTOOLS Document SDK to easily recognize hundreds of different forms and invoices. This post will build off of that, showcasing how to then automatically extract information and data from the recognized forms.

First, we will identify and then process the fields. The fields are the various locations on the form that were set when we created the master form (or template) for each invoice or specific form type. Our processing engine will look for data in those designated areas. There are many types of data fields including text, images, tables, OMR bubbles, and barcodes. Our engine loads and processes all the fields for the developer to then easily write how they want that information distributed. It is often good practice to check to see what type of field was processed and then write code for that type accordingly.

The code snippets below first show the type of form our engine determines the recognized form to be, then the loading and processing of the fields within that form, followed by the identification of an individual field type, and last but not least, displaying that field’s extracted information to the console.

Find the Right Match

For this example, we are going to make a couple of changes to last blog’s RecognizeForms function.

private static void RecognizeForms()
{
    Console.WriteLine("Recognizing Forms\n");

    string[] formsToRecognize = Directory.GetFiles(filledFormsDirectory, "*.tif", SearchOption.AllDirectories);

    string[] masterFileNames = Directory.GetFiles(masterFormsDirectory, "*.bin", SearchOption.AllDirectories);

    foreach (string filledFormName in formsToRecognize)
    {
        RasterImage currentForm = codecs.Load(filledFormName, 0, CodecsLoadByteOrder.BgrOrGray, 1, -1);
        FormRecognitionAttributes filledFormAttributes = LoadFilledFormAttributes(currentForm);

        string resultMessage = "";

        foreach (string masterFileName in masterFileNames)
        {
            FormRecognitionAttributes masterFormAttributes = LoadMasterFormAttributes(masterFileName);

            //Compares the master form to the filled form
            FormRecognitionResult recognitionResult = recognitionEngine.CompareForm(masterFormAttributes, filledFormAttributes, null);

            //When the Recognition Engine compares the two documents it also sets a confidence level for how closely the engine thinks the two documents match
            if (recognitionResult.Confidence >= AllowedConfidenceLevel)
            {
                resultMessage = $"Form {Path.GetFileNameWithoutExtension(filledFormName)} has been recognized as a(n) {Path.GetFileNameWithoutExtension(masterFileName)} with a confidence level of {recognitionResult.Confidence}";

                //Once we found the right master form we can read the filled form
                FormPages filledFormData = ProcessForm(recognitionResult, masterFileName, currentForm);
                PrintFormData(filledFormData);

                break;
            }

            resultMessage = $"The form {Path.GetFileNameWithoutExtension(filledFormName)} failed to be recognized with a confidence level of {recognitionResult.Confidence}";
        }

        Console.WriteLine(resultMessage);
        Console.WriteLine("=============================================================\n");
    }
}

Notice that in this example we process the filled form data and print that information to the console.

Process the Form Data

Extracting text from the scanned form requires us to pass information gathered by the Recognition Engine to the Processing Engine. Additionally, we will load the master form’s XML file into the Processing Engine to tell it which field to collect data on. Once the Processing Engine is prepared, we then read the information off of the filled form.

private static FormPages ProcessForm(FormRecognitionResult recognitionResult, string masterFormFileName, RasterImage filledForm)
{
    // The Recognition Engine records how the master form and the filled form align page by page
    List<PageAlignment> alignment = new List<PageAlignment>();
    for (int k = 0; k < recognitionResult.PageResults.Count; k++)
    {
        alignment.Add(recognitionResult.PageResults[k].Alignment);
    }

    // Load the Processing Engine with the found master form
    string fieldsfName = Path.GetFileNameWithoutExtension(masterFormFileName) + ".xml";
    string fieldsfullPath = Path.Combine(masterFormsDirectory, fieldsfName);

    processingEngine.LoadFields(fieldsfullPath);

    // Processing Engine reads filled form
    processingEngine.Process(filledForm, alignment);
    return processingEngine.Pages;
}

Print the Form Data

For this example, we are simply reading the information off of the filled form and printing it to the console.

private static void PrintFormData(FormPages formData)
{    
    foreach (FormPage formPage in formData)
        foreach (FormField field in formPage)
        {
            List<string> row = new List<string>() { };

            if (field is TextFormField)
                row.AddRange(ReadTextFormField(field));

            else if (field is TableFormField)
                row.AddRange(ReadTableFormField(field));

            else if (field is UnStructuredTextFormField)
                row.AddRange(ReadUnStructuredTextFormField(field));
            
            row.Insert(0, "Field Name: " + field.Name);
            row.Add("Field Bounds: " + field.Bounds.ToString() + "\n------------------------------------------------------------");
            
            foreach (string line in row) Console.WriteLine(line);
        }
}

As you can see, we break down how to present the filled form information based on the type of field that is being read.

Text Form Fields

Here we read the text from the text form field result and as well as the confidence the OCR engine has that it read the correct characters.

private static List<string> ReadTextFormField(FormField field)
{
    List<string> row = new List<string>();

    row.Add("Field Type: Text");
    row.Add("Field Value: " + ((field as TextFormField).Result as TextFormFieldResult).Text + "");

    if (((field as TextFormField).Result as TextFormFieldResult).AverageConfidence < AllowedConfidenceLevel)
    {
        row.Add("Field Confidence: " + ((field as TextFormField).Result as TextFormFieldResult).AverageConfidence.ToString() + "% ---> Needs manual review");
    }
    else
        row.Add("Field Confidence: " + ((field as TextFormField).Result as TextFormFieldResult).AverageConfidence.ToString() + "%");            

    return row;
}

Table Form Fields

Reading table data is similar to reading the data from a text form field. However, with a table we have to read the results from every row and column present on the table. Here is an example of how to read table values.

private static List<string> ReadTableFormField(FormField field)
{
    List<string> row = new List<string>();

    List<TableColumn> col = (field as TableFormField).Columns;
    TableFormFieldResult results = (field as TableFormField).Result as TableFormFieldResult;
    row.Add("Field Type: Table");

    for (int i = 0; i < results.Rows.Count; i++)
    {
        TableFormRow rows = results.Rows[i];

        row.Add($"------------------Table Row Number: {i + 1}-----------------------\n");

        int lineCounter = 1;
        string[] rowInfo = new string[rows.Fields.Count];
        for (int j = 0; j < rows.Fields.Count; j++)
        {
            OcrFormField ocrField = rows.Fields[j];
            TextFormFieldResult txtResults = ocrField.Result as TextFormFieldResult;
            if (txtResults.AverageConfidence >= AllowedConfidenceLevel)
            {
                rowInfo[j] = txtResults.Text;
                int counter = 1;

                if (txtResults.Text != null)
                    counter += CountCharacterInString(txtResults.Text, '\n');

                if (counter > lineCounter)
                    lineCounter = counter;
            }
            else
            {
                row.Add("% ---> Needs manual review\n");
                manualReviewCount++;
                rowInfo[j] = txtResults.Text;
                int counter = 1;

                if (txtResults.Text != null)
                    counter += CountCharacterInString(txtResults.Text, '\n');

                if (counter > lineCounter)
                    lineCounter = counter;
            }

        }
        for (int k = 0; k < rowInfo.Length; k++)
        {
            row.Add(col[k].OcrField.Name + ": " + rowInfo[k]);
        }
    }
    row.Add("------------------------------------------------------------");

    if (((field as TableFormField).Result as TableFormFieldResult).Status == FormFieldStatus.Failed)
    {
        row.Add("Field Confidence: % ---> Needs manual review");
        manualReviewCount++;
    }
    else
        row.Add("Field Confidence: Successful");

    return row;
}

private static int CountCharacterInString(String str, char c)
{
    int counter = 0;

    for (int i = 0; i < str.Length; i++) if (str[i] == c) counter++;

    return counter;
}

Unstructured Text Form Fields

Unstructured text form fields just represent a rectangular region on a form that doesn’t have a predefined structure like a text field or a table would. For this example, we are going to read the unstructured data the same way we read the text field data.

private static List<string> ReadUnStructuredTextFormField(FormField field)
{
    List<string> row = new List<string>();

    row.Add("Field Type: UnStructuredText");
    row.Add("Field Value: " + ((field as UnStructuredTextFormField).Result as TextFormFieldResult).Text);

    if (((field as UnStructuredTextFormField).Result as TextFormFieldResult).AverageConfidence < AllowedConfidenceLevel)
    {
        row.Add("Field Confidence: " + ((field as UnStructuredTextFormField).Result as TextFormFieldResult).AverageConfidence.ToString() + "% ---> Needs manual review");
        manualReviewCount++;
    }
    else
        row.Add("Field Confidence: " + ((field as UnStructuredTextFormField).Result as TextFormFieldResult).AverageConfidence.ToString() + "%");

    return row;
}

These are just a few examples of the many different field types that the Processing Engine can handle. Our documentation has a full list of all of the supported form field input types.

Free Evaluation, Free Technical Support, Free Demos, & More!

Download our FREE 60-day evaluation to test all these features and actually program with LEADTOOLS before a purchase is even made. Gain access to our extensive documentation, sample source code, demos, and tutorials.

Need Assistance? LEAD Is Here For You

Contact our support team for free technical support! For pricing or licensing questions, you can contact our sales team via email or call us at 704-332-5532.

This entry was posted in Forms Recognition and Processing. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *