How To: OCR any PDF file

**RequinB4** · August 5th, 2008

As anyone who has tried knows, using optical character recognition on pdf files can be confusing, especially since Tesseract, repeatedly hailed as the best free ocr software can only do *tif files.

Step 1: Install needed packages

Code:

sudo apt-get install tesseract-ocr tesseract-ocr-eng xpdf-reader xpdf imagemagick xpdf-utils

Side Note: You will need to install language packages for tesseract for every other language you wish to use. For example, package tesseract-ocr-fra allows you to ocr the french language. Check synaptic package manager

Step 2: See if you actually need ocr

xpdf-utils (which you just installed) provides a pdftotext utility:

Code:

pdftotext

If it works, congratulations and don't move on

Step 3: OCR'd

The following shell script will attempt to ocr your file. I suggest placing it somewhere in your $PATH so you can run it from the same directory as the pdf file and not have weird filenames.

NOTE: This program works by converting each page of the PDF file into a 100MB TIFF image. That means a temporary 100MB increase of hard drive usage per page in your pdf while the program is running. If it's an issue, this can be decreased/changed by editing the shell script - either decrease the quality number in pdftoppm or make it do only a few pages at a time - check man pdftoppm

Code:

#!/bin/sh
mkdir tmp
cp $@ tmp
cd tmp
pdftoppm * -f 1 -l 10 -r 600 ocrbook
for i in *.ppm; do convert "$i" "`basename "$i" .ppm`.tif"; done
for i in *.tif; do tesseract "$i" "`basename "$i" .tif`" -l nld; done
for i in *.txt; do cat $i >> ${name}.txt; echo "[pagebreak]" >> pdf-ocr-output.txt; done
mv pdf-ocr-output.txt ..
rm *
cd ..
rmdir tmp

Usage:
1) Copy the script into gedit or your favorite text editing program and save it.
2)

Code:

cd /path/to/saved/file/
chmod +x filename

3) Run the script: If it is saved in your $PATH, just type the filename in the console, followed by the name of the file you wish to ocr. Otherwise, you have to cd to the /path/to/saved/file and

Code:

./filename "PDF file you wish to ocr"

Edit: Moving to tutorials and tips -my bad

**waspinator** · April 28th, 2009

is there a way to integrate the ocr layer back into the pdf so that you create a searchable pdf instead of a seperate text file?

**nortexoid** · October 1st, 2009

That's what I'm interested in.

**userid** · December 15th, 2009

Probably via openoffice.org 3.0 / editable pdf?

Thanks for the excellent script btw!

pdftoppm * -f 1 -l 10 -r 600 ocrbook

Seems like its limited to the first 10 pages? I also think the quality of the ppm is rather high!

**AlwaysLearning** · January 4th, 2010

Originally Posted by userid

Probably via openoffice.org 3.0 / editable pdf?

Alright, you've got my attention here. Karmic has OO.org v3.1.1 now and yet in Writer I'm still only seeing an "Export as PDF" option... is there extra-magic (such as Extensions) that you need to edit an existing PDF to insert the text layer?

**nortexoid** · January 27th, 2010

I tried the script out and it doesn't always work. The only usable solution I've found--and it's hardly a solution!--is to run a VM of Windows and Acrobat Standard/Pro!

Pretty lame.

**khaeru** · February 5th, 2010

Originally Posted by nortexoid

Pretty lame.

Well, complaining about free scripts is also "pretty lame."

RequinB4, I edited your script a bit. It works great! For futher improvement, to cut down on the disk usage one could use "pdfinfo" to get the number of pages in the file and extract them one at a time using the same page for the -l and -f parameters to pdftoppm.

Also note if you want to run this, all you really need to install are tesseract-ocr-[LANG] and imagemagick (tesseract-ocr is a dependency of each of the individual tesseract language packages). Change "-l eng" in the file to match the tesseract language you've installed/want to use.

Code:

#!/bin/sh
SCRIPT_NAME=`basename "$0" .sh`
TMP_DIR=${SCRIPT_NAME}-tmp
OUTPUT_FILE=${SCRIPT_NAME}-output.txt

mkdir $TMP_DIR
cp $@ $TMP_DIR
cd $TMP_DIR

pdftoppm -r 600 * ocrbook

for i in *.ppm
do
  BASE=`basename "$i" .ppm`
  convert "$i" "${BASE}.tif"
  tesseract "${BASE}.tif" "${BASE}" -l eng
  cat ${BASE}.txt | tee -a $OUTPUT_FILE
  echo "[pagebreak]" | tee -a $OUTPUT_FILE
  rm ${BASE}.*
done

mv $OUTPUT_FILE ..
rm *
cd ..
rmdir $TMP_DIR

**clevertomato** · February 9th, 2010

I could use some clarification. I'd like to be able to scan documents and end up with a text-based PDF (searchable, selectable), but when I read "insert text layer" in this thread, I am a little lost.

**Koppie** · February 11th, 2010

Gscan2pdf is GREAT. It installs pretty much every package you need at once, although I also had to install the tesseract English module.

**clevertomato** · February 12th, 2010

More clarification, please? Bear with me if this seems elementary.

Let's say I type a document in Open Office and export to PDF. When I open the PDF, AFAIK, I don't see a "picture" of the text with a hidden text layer that can be indexed. It appears to be ACTUAL text that I'm looking at. I can select text, search text, etc, right there in the document that appears before me.

What I'm reading so far about these OCR tools is that I'm still left with an IMAGE of the page of text, but with an extra hidden layer of actual text somewhere that can be searched. Am I understanding this correctly? Is there no way to use OCR to end up with essentially the same kind of PDF that I get when I export a word processing document to PDF? I don't want both an image of text and actual text hidden somewhere behind it.

Thread: How To: OCR any PDF file

Thread Tools

Display

How To: OCR any PDF file

Re: How To: OCR any PDF file

Re: How To: OCR any PDF file

Re: How To: OCR any PDF file

Re: How To: OCR any PDF file

Re: How To: OCR any PDF file

Re: How To: OCR any PDF file

Re: How To: OCR any PDF file

Re: How To: OCR any PDF file

Re: How To: OCR any PDF file

Bookmarks

Bookmarks

Posting Permissions