Howto: Make scanned PDFs searchable (OCR) using pdfocr

**tuxcantfly** · April 17th, 2010

What pdfocr is for

Suppose you have a PDF document that was made using a scanner, or otherwise consists of image data but doesn't have text data. Such a PDF can't be searched by PDF readers or desktop search applications. pdfocr is a simple utility I made that takes a PDF file, then generates a new one that has the text layer added, so it's searchable by your PDF reader and can be indexed by your desktop search application, but is still identical when printed.

What pdfocr is not for

This is only of use if your PDF was made from a scanned source; if you exported your PDF from OpenOffice or the like it already has a text layer so this is unnecessary.

If what you're looking for is to simply extract the plain text from a PDF file, but not to embed the text into the PDF file, see this guide.

Compatibility

This guide will work on Ubuntu Karmic (9.10) or Lucid (10.04); the dependencies for this software don't build on older versions.

Installing pdfocr

The easiest way to install pdfocr is to add my PPA and use apt-get. If you would instead prefer to install it manually, see here for instructions

Code:

sudo add-apt-repository ppa:gezakovacs/pdfocr
sudo apt-get update
sudo apt-get install pdfocr

Using pdfocr to add a text layer to your scanned PDF file

Open a terminal, go to the directory that has the PDF file you want to convert, and enter (substituting input.pdf with the input PDF file, and output.pdf with the output PDF file)

Code:

pdfocr -i input.pdf -o output.pdf

Now wait as OCR is performed on the PDF file page-by-page, and the output file is generated. This should take a few seconds per page, depending on the resolution of your PDF file (high-res PDF files get better accuracy, but will take longer). Once done, you should now have a searchable PDF at output.pdf.

Credits

pdfocr was written by me (Geza Kovacs). It is simply a script which automates the following process:

1. Splitting the PDF file into separate pages using pdftk
2. Extracting out the image data using pdfimages
3. Doing OCR (optical character recognition) using cuneiform
4. Embedding the detected text back into the PDF file using hocr2pdf
5. Merging together the files using pdftk.

Hence, if you want more fine-grained control than the defaults, you can just invoke these utilities manually. Source is available on github. Feedback is welcome.

Thread: Howto: Make scanned PDFs searchable (OCR) using pdfocr

Thread Tools

Display

Threaded View

Howto: Make scanned PDFs searchable (OCR) using pdfocr

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions