Extracting Text from PDF

#1 Posted : Friday, April 27, 2018 8:28:24 AM(UTC)

jnethercutt

Groups: Registered
Posts: 26

Thanks: 3 times

I've had excellent results extracting text from PDFs in my application. However, I am having issues with one particular vendor's PDF invoice and do not understand why. Understanding the challenges with this specific vendor's PDFs will help me going forward as my test base of vendors and their invoices grow.

I will attach a page of the PDF and the OCR word results file. I'm particularly trying to understand why the OCR results do not contain the "INVOICE NO" label from the upper right hand corner of the invoice as well as the PAGE label from that same area. There are other text areas that are not showing in the OCR results file, but the missing INVOICE NO and PAGE label text areas are particularly problematic for my application.

I'm using the following commands in my code to get access to the extacted text/zone data:

RasterImage image;
image = ocrEngine.RasterCodecsInstance.Load(imageFile, 0, CodecsLoadByteOrder.Rgb, 1, -1);
var ocrPage = ocrDocument.Pages.AddPage(image, null);
ocrDocument.Pages.AutoZone(
ocrDocument.Pages.Recognize(null);
var ocrPageWords = ocrPage.GetRecognizedCharacters();

Can you enlighten me regarding why I may be having challenges with text from this PDF. I was thinking that PDFs with rendered text would pretty much give me 100% accuracy.

#2 Posted : Friday, April 27, 2018 8:40:10 AM(UTC)

Hadi

Groups: Manager, Tech Support, Administrators
Posts: 218

Was thanked: 12 time(s) in 12 post(s)

You mention that your PDFs have rendered text - do you mean that your PDFs are vector text based already (you can select the text)? If so, I would recommend that you use the Document SDK to extract the text from the PDF without having to Rasterize and OCR the documents. The Document SDK also has support for OCRing rasterized files and has an option to turn on 'Auto' Text extraction, which uses SVG if the input is already Document based (such as vector PDFs) and uses OCR if the input is Raster-based.

Here is some sample code on how to achieve that:

Code:

List<DocumentPageText> documentText = new List<DocumentPageText>();
var inputDocument = DocumentFactory.LoadFromFile(input, documentOptions);
inputDocument.Text.TextExtractionMode = DocumentTextExtractionMode.Auto;
inputDocument.Text.OcrEngine = ocrEngine;
foreach (var page in inputDocument.Pages)
  {
  var pageText = page.GetText();
   //Build the words if you want the word information (bounds and each word value)       
   //pageText.BuildWords(); 
   pageText.BuildText();
   documentText.Add(pageText);
  }

Now the issue you are seeing with your particular PDF it could be a variety of things so if you could share the PDF with us for us to test with, we will be able to debug it and get back to you with why that particular text is not getting extracted properly.

If you do not want to share the PDF on our public forums, please let me know and I will setup a case so we can continue to work on this via email.

Have a great day!

Hadi Chami
Developer Support Manager
LEAD Technologies, Inc.

1 user thanked Hadi for this useful post.

jnethercutt on 4/27/2018(UTC)

#3 Posted : Friday, April 27, 2018 8:46:47 AM(UTC)

jnethercutt

Groups: Registered
Posts: 26

Thanks: 3 times

I attached the two files in a reply to your email. Thanks.

#4 Posted : Friday, April 27, 2018 8:47:53 AM(UTC)

jnethercutt

Groups: Registered
Posts: 26

Thanks: 3 times

And yes, text can be selected from the PDF.

#5 Posted : Friday, April 27, 2018 9:25:58 AM(UTC)

Anthony Northrup

Groups: Registered, Tech Support, Administrators
Posts: 199

Was thanked: 28 time(s) in 28 post(s)

Hello Judy,

If you save out the IOcrDocument to PDF instead of just pulling the recognized text, you can see the issue. When running the AutoZone method, the engine marks the upper right table as an image, so it doesn't perform OCR when recognizing. If you use the document based method posted by Hadi, you will get the correct results. Here is the snippet I used to compare the two method:

Code:


using (IOcrEngine engine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD, false))
{
	engine.Startup(null, null, null, null);

	// Using OCR
	Console.WriteLine("============ OCR ============");
	using (RasterImage image = engine.RasterCodecsInstance.Load(fileName, 0, CodecsLoadByteOrder.Rgb, 1, -1))
	using (IOcrDocument document = engine.DocumentManager.CreateDocument())
	{
		document.Pages.AddPages(image, 1, -1, null);
		document.Pages.AutoZone(null);
		document.Pages.Recognize(null);
		for (int i = 0; i < document.Pages.Count; i++)
		{
			Console.WriteLine("====== Page {0} ======", i + 1);
			IOcrPage page = document.Pages[i];
			Console.WriteLine(page.GetText(-1));
		}
	}

	// Spacing
	Console.WriteLine();
	Console.WriteLine();
	Console.WriteLine();

	// Pulling from the SVG
	Console.WriteLine("============ SVG ============");
	LoadDocumentOptions documentOptions = new LoadDocumentOptions()
	{
		FirstPageNumber = 1,
		LastPageNumber = -1
	};
	LEADDocument inputDocument = DocumentFactory.LoadFromFile(fileName, documentOptions);
	inputDocument.Text.TextExtractionMode = DocumentTextExtractionMode.Auto;
	inputDocument.Text.OcrEngine = engine;
	foreach (DocumentPage page in inputDocument.Pages)
	{
		Console.WriteLine("====== Page {0} ======", page.PageNumber);
		DocumentPageText pageText = page.GetText();
		pageText.BuildText();
		Console.WriteLine(pageText.Text);
	};
}

And here is a snippet from the results:

Code:


============ OCR ============
====== Page 1 ======
BroadBclnd lnternQtioncll`

Code:


============ SVG ============
====== Page 1 ======
INVOICE NO PAGE

As you can see, using the document based extraction properly recognizes the text you were attempting to read from the document. Let me know if you have any questions about using this method.

Thanks,

Anthony Northrup
Developer Support Engineer
LEAD Technologies, Inc.

1 user thanked Anthony Northrup for this useful post.

jnethercutt on 4/27/2018(UTC)

#6 Posted : Friday, April 27, 2018 10:01:28 AM(UTC)

jnethercutt

Groups: Registered
Posts: 26

Thanks: 3 times

Great, I'll switch to the document based method as described. Thanks!

#7 Posted : Friday, April 27, 2018 10:10:26 AM(UTC)

Anthony Northrup

Groups: Registered, Tech Support, Administrators
Posts: 199

Was thanked: 28 time(s) in 28 post(s)

Hello Judy,

Perfect, I'm glad we could help. There's only one note about using the document method, while it will perfectly pull the text from a vector based document (such as a searchable PDF), it generally will not perform OCR on images containing text inside a vector based document. The PDF you sent me uses an image for the company title in the upper left of the first page. If reading the text in that image isn't important, the document method will work perfectly for you. However, if you would also like to pull the text from the various images within a vector based document, let me know and I can send you some code that will handle that.

Thanks,

Anthony Northrup
Developer Support Engineer
LEAD Technologies, Inc.

#8 Posted : Monday, April 30, 2018 12:01:26 PM(UTC)

jnethercutt

Groups: Registered
Posts: 26

Thanks: 3 times

With my application being developed using the Rasterize/OCR approach, the code has quite a bit of use of ZoneIndex & ZoneType. If I switch to the Document SDK, I'm not seeing any references to Zone available to me. I may want to rewrite my application to use the Document SDK but before I commit to this I wanted to double check to be sure I am not overlooking something.

#9 Posted : Monday, April 30, 2018 1:17:29 PM(UTC)

Hadi

Groups: Manager, Tech Support, Administrators
Posts: 218

Was thanked: 12 time(s) in 12 post(s)

If you need the bounds of the words / Characters, then you can get them using the BuildWords method and access them via the Words / Characters properties.

Here are some helpful links:

Parsing Text with the Documents Library
https://www.leadtools.co...he-document-library.html

Document Page Text Class
https://www.leadtools.co...ox/documentpagetext.html

Document Word Structure
https://www.leadtools.co...dh/dox/documentword.html

Document Character Structure
https://www.leadtools.co...x/documentcharacter.html

Note that the bounds on the Words are given in Document Units. If you need to convert these to Pixels, you will have to do the following:

Code:


using (var inputDocument = DocumentFactory.LoadFromFile(input, documentOptions))
{
 inputDocument.Text.TextExtractionMode = DocumentTextExtractionMode.Auto;
 inputDocument.Text.OcrEngine = ocrEngine;
 foreach (var page in inputDocument.Pages)
 {
 var pageText = page.GetText();

 pageText.BuildWords();

 for (int i = 0; i<pageText.Words.Count; i++)
 {
  var word = pageText.Words[i];
  word.Bounds = inputDocument.RectToPixels(word.Bounds).ToLeadRectD();
  pageText.Words[i] = word;
 }

 for(int i = 0; i<pageText.Characters.Count; i++)
 {
  var character = pageText.Characters[i];
  character.Bounds = inputDocument.RectToPixels(character.Bounds).ToLeadRectD().ToLeadRect().ToLeadRectD();
  pageText.Characters[i] = character;
 }

 pageText.BuildText();
 }         
}

Edited by user Monday, November 23, 2020 4:08:55 PM(UTC) | Reason: Not specified

Hadi Chami
Developer Support Manager
LEAD Technologies, Inc.

#10 Posted : Friday, June 1, 2018 12:47:06 PM(UTC)

jnethercutt

Groups: Registered
Posts: 26

Thanks: 3 times

Hi,

When using OCR, I use SaveXml to generate an XML report that shows me the lines, words, with left/top/right/bottoom bounds.
Example: ocrDocument.SaveXml(ocrResultsFile, OcrXmlOutputOptions.None);

Output looks like:
<?xml version="1.0" encoding="UTF-16" standalone="yes"?>
<pages>
<page horizontal_resolution="300" vertical_resolution="300" width="2550" height="3300">
<zone id="0" name="" type="Text" left="117" top="141" right="631" bottom="424" subtype="Text" language="">
<paragraph>
<line left="118" top="142" right="630" bottom="175" base="21">
<word left="118" top="142" right="301" bottom="168" base="21">National</word>
<word left="330" top="143" right="442" bottom="168" base="22">Cable</word>
<word left="470" top="145" right="515" bottom="168" base="23">TV</word>
<word left="541" top="144" right="630" bottom="175" base="20">Coop</word>
</line>
<line left="121" top="192" right="442" bottom="220" base="24">
<word left="121" top="194" right="163" bottom="218" base="24">PO</word>
<word left="189" top="192" right="278" bottom="220" base="24">BOX#</word>
<word left="308" top="193" right="442" bottom="218" base="25">414826</word>

Is there a similar command that will give me this type of output when using the Document SDK with my vector text based PDFs?

#11 Posted : Monday, June 4, 2018 9:52:36 AM(UTC)

Anthony Northrup

Groups: Registered, Tech Support, Administrators
Posts: 199

Was thanked: 28 time(s) in 28 post(s)

Hello Judy,

The Document SDK abstracts the text from the page for similar usage either with an OCR'd image or a vector document. As such, some of the information (such as zones, or paragraphs) is lost, however the important information (such as bounds, and word/line breaks) are kept. I've attached a sample application that contains a similar SaveXml method for the LEADDocument. It uses the Characters property of the DocumentPageText class to generate the output. I manually iterate over all the characters per page and divide them up into words and lines using the IsEndOfWord and IsEndOfLine properties. While this isn't as complex (or perhaps general purpose) as the SaveXml method available for our OCR document, it will allow you generate a similar file to the one you mentioned. Let me know if you have any questions or run into any issues when using that method in your own project.

File Attachment(s):

Project.zip (4kb) downloaded 319 time(s).

Thanks,

Anthony Northrup
Developer Support Engineer
LEAD Technologies, Inc.

#12 Posted : Monday, June 4, 2018 2:17:17 PM(UTC)

jnethercutt

Groups: Registered
Posts: 26

Thanks: 3 times

Thanks for your fast response.
jn

#13 Posted : Tuesday, July 24, 2018 3:29:37 AM(UTC)

Mark Evans

Groups: Registered
Posts: 4

Originally Posted by: Anthony Northrup

Hello Judy,

Perfect, I'm glad we could help. There's only one note about using the document method, while it will perfectly pull the text from a vector based document (such as a searchable PDF), it generally will not perform OCR on images containing text inside a vector based document. The PDF you sent me uses an image for the company title in the upper left of the first page. If reading the text in that image isn't important, the document method will work perfectly for you. However, if you would also like to pull the text from the various images within a vector based document, let me know and I can send you some code that will handle that.

Thanks,

I have a use case where I would like to use the document method, but also have documents with embedded images for customer logo, address etc. For which I would like to capture this text also.

Is it possible to share the method which you mention here please?

All the best

Mark

#14 Posted : Tuesday, July 24, 2018 1:55:48 PM(UTC)

Anthony Northrup

Groups: Registered, Tech Support, Administrators
Posts: 199

Was thanked: 28 time(s) in 28 post(s)

EDIT: Removed.

The previous answer should have just been setting the SvgImagesRecognitionMode to Always

Edited by user Tuesday, August 7, 2018 2:00:39 PM(UTC) | Reason: Previous answer was awful

Anthony Northrup
Developer Support Engineer
LEAD Technologies, Inc.

#15 Posted : Tuesday, August 7, 2018 8:08:44 AM(UTC)

Mark Evans

Groups: Registered
Posts: 4

Thanks Anthony for the detailed response. A quick follow up question if I may... is there a way to detect if document contains an image of a certain size / dimension?

For example, in my case, I only want to perform the OCR conditionally, if a document contains a large image i.e. Header or footer image, rather than for all documents.

Thanks in advance!

Mark

#16 Posted : Tuesday, August 7, 2018 2:04:38 PM(UTC)

Anthony Northrup

Groups: Registered, Tech Support, Administrators
Posts: 199

Was thanked: 28 time(s) in 28 post(s)

Hello Mark,

I have modified my previous reply, because after looking at it again I realized there was a much easier method. The edit should also answer your new question as it will preserve the existing SVG text but still perform OCR on the images in the header/footer.

Thanks,

Anthony Northrup
Developer Support Engineer
LEAD Technologies, Inc.

#17 Posted : Wednesday, August 8, 2018 3:35:06 AM(UTC)

Mark Evans

Groups: Registered
Posts: 4

Thanks Anthony, that's extremely useful.

Sorry, one other quick question on the subject. We haven't yet migrated to the Document SDK, so are still using DocumentReaders for text extraction.

Is there a similar way to force this through there? or can we detect the image size and force OCR? Until we upgrade to the suggested method

Many thanks

Mark

#18 Posted : Thursday, August 9, 2018 8:43:45 AM(UTC)

Anthony Northrup

Groups: Registered, Tech Support, Administrators
Posts: 199

Was thanked: 28 time(s) in 28 post(s)

Hello Mark,

Unfortunately it doesn't look like we have anything like that using the old deprecated DocumentReader assembly. You'll have to upgrade to the Document SDK for this functionality.

Thanks,

Anthony Northrup
Developer Support Engineer
LEAD Technologies, Inc.

#19 Posted : Monday, August 13, 2018 6:37:20 AM(UTC)

Mark Evans

Groups: Registered
Posts: 4

Ok thanks, that example seems to be using the DocumentConverter, is there similar functionality exposed via Leadtools.Documents, specifically we were looking at using the DocumentFactory approach


	Try the latest version of LEADTOOLS for free for 60 days by downloading the evaluation: https://www.leadtools.com/downloads Wanna join the discussion? Login to your LEADTOOLS Support account or Register a new forum account.

Notification

Extracting Text from PDF