Welcome Guest! To enable all features, please Login or Register.



Last Go to last post Unread Go to first unread post
#1 Posted : Wednesday, July 25, 2018 4:07:30 PM(UTC)

Groups: Registered
Posts: 26

Thanks: 3 times

My application processes a lot of PDFs and I’m using the code below to extract the text from the PDFs.

List<DocumentPageText> documentText = new List<DocumentPageText>();
LoadDocumentOptions documentOptions = new LoadDocumentOptions();
var inputDocument = DocumentFactory.LoadFromFile(imageFile, documentOptions);
inputDocument.Text.TextExtractionMode = DocumentTextExtractionMode.Auto;
var pageText = new DocumentPageText();
foreach (var page in inputDocument.Pages)
pageText = page.GetText();

Occasionally I come across a PDF where the text is not extracted quite right, but looking at the PDF visually in Acrobat and performing an Edit/Select All/Copy and paste to notepad gives me the right result.

For example in the PDF I have attached, the word “Total” at the bottom of the document is split in to two words:
<word left=\1741\ top=\2886\ right=\1790\ bottom=\2922\ >Tot</word>
<word left=\1801\ top=\2886\ right=\1828\ bottom=\2922\ >al</word>

As well as the word “Balance”:
<word left=\1741\ top=\3024\ right=\1788\ bottom=\3060\ >Bal</word>
<word left=\1801\ top=\3024\ right=\1880\ bottom=\3060\ >ance</word>

I see this type of issue on random documents. Most other documents from this particular vendor don’t have this issue.

Do you have any suggestions on how this might be resolved so that “Total” and “Balance” are extracted as single words?

(I’ll send you the PDF via email when I get a reply.)

Try the latest version of LEADTOOLS for free for 60 days by downloading the evaluation: https://www.leadtools.com/downloads

Wanna join the discussion? Login to your LEADTOOLS Support accountor Register a new forum account.

#2 Posted : Wednesday, July 25, 2018 4:29:48 PM(UTC)
Anthony Northrup

Groups: Registered, Tech Support, Administrators
Posts: 199

Was thanked: 28 time(s) in 28 post(s)

Hello Judy,

That is interesting. I'm not entirely sure how we extract the text information from a PDF, but it might actually be stored in the way we are getting. You'll notice from the bounds listed the two "words" are only two pixels apart, so visually they'll appear together. If you could send me the PDF via email I could look into this issue further for you.

Anthony Northrup
Developer Support Engineer
LEAD Technologies, Inc.

You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

Powered by YAF.NET | YAF.NET © 2003-2024, Yet Another Forum.NET
This page was generated in 0.048 seconds.