PDA

View Full Version : [all variants] Image file to PDF (w/ OCR)



mvip
June 27th, 2009, 04:17 PM
Hi,

I'm sitting on a collection of scanned documents (10k or so) that I'm looking to convert to PDFs with OCR. The documents are currently stored as PNG files and I want to convert them to searchable PDF files (to be used with IBM's OmniFind).

I've found a number of tools that can extract text our of an image file (such as tesseract-ocr (http://code.google.com/p/tesseract-ocr/)), but none that can actually generate a PDF in the end.

There are a number of commercial Windows apps that can do this, but I would really consider that a last-resort.

Thanks.

mvip
June 28th, 2009, 12:04 AM
*bump*

Chemical Imbalance
June 28th, 2009, 12:08 AM
Just double click on the picture so that it opens with Ubuntu's default picture viewer. Then click on File, Print. Then choose "print to file".
Make sure you rename it to to "something".pdf and choose the .PDF option, not Postscript.
Take note of what directory it saves to (i.e. your home folder or Desktop).

I'm not sure if the text is searchable or not however.

mvip
June 28th, 2009, 12:58 AM
Just double click on the picture so that it opens with Ubuntu's default picture viewer. Then click on File, Print. Then choose "print to file".
Make sure you rename it to to "something".pdf and choose the .PDF option, not Postscript.
Take note of what directory it saves to (i.e. your home folder or Desktop).

I'm not sure if the text is searchable or not however.

I suppose that's one approach. Unfortunately it falls short on two points:

It is (probably) not searchable
It is not scalable. We're talking 10k+ files