Question

Can Document OCR not get text out of a scanned pdf?

I have tried processToXml and ProcessToPdf and tried putting ProcessToPdf before each of these and thried everything with and without ocrImagesAndText being true. everything just returns false. I am trying to get text out of a pdf produced by scanning a paper document, but there are even some pdfs the regular pdf connector can read that document ocr cannot, unless I just cannot sort out how to use it. I can make it get text from images in word documents and it can get text out of a pdf I make by doing a print to pdf, so I know I am not doing everything wrong. can this component actually not get text from a scanned pdf?

Comments

Keep up to date on this post and subscribe to comments

Pega
August 23, 2019 - 2:04pm

It can but a lower quality of the scan may be the reason. I would suggest you open a support request so that they can examine your specific PDF unless you can attach it here for the community to examine.

August 23, 2019 - 4:54pm
Response to tsasnett

document is sensitive but it is very high quality. doc OCR actually fails on some PDFs that can be had with the regular pdf Connector. I made a thing to do what I want to do anyway, here you go.

public void pngThat(string path)

{

AcroPDDoc pdfd = new AcroPDDoc();

pdfd.Open(path);

Object jsObj = pdfd.GetJSObject();

Type jsType = pdfd.GetType();

object[] saveAsParam = { "out.png", "com.adobe.acrobat.png", "", false, false };

jsType.InvokeMember("saveAs", BindingFlags.InvokeMethod | BindingFlags.Public | BindingFlags.Public | BindingFlags.Instance, null, jsObj, saveAsParam, CultureInfo.InvariantCulture);

}

August 23, 2019 - 5:00pm
Response to tsasnett

forgot to say, that code uses Acrobat SDK, so this is only helpful to people with Acrobat Pro DC. We have to bring a free-er easier solution to the people, Tsasnett Sir. Acrobat SDK rough and not free.