Hi, I have implemented the same thing in /bin/sh (i.e. no need to install ruby). By default, it overlays the recognized text with the original pdf, and not the rasterized image (as in the case of pdfocr), although that can also be enforced with the --ppm option.
It uses tesseract (one needs version >3.0), which works very nicely, in contrast to cuneiform (which did not recognize practically anything, whatever I tried - I did not try to tweak it, though)
I have a problem: the output looks ok on the screen, it is searchable, etc - but when I print it, the recognized text also appears on the printer. How can this be solved?
...start edit: well, I discovered that I skipped a large part of this (and another, here referenced) thread, and I am not the only one who implemented their own version for this task. What I do differently is that I specify the -n option to hocr2pdf (i.e. do not overlay the raster image over the recognized text), but then do it later with pdftk, overlaying the original pdf page over the recognized text. That should be, I think, better.
However, all the implementations (including pdfsandwich, which also works quite nicely) have the problem, that the OCR-et document is ok on the screen, in a pdf reader, but when printed, the background text gets also printed. Any idea how to solve this?
... end edit.
Code:
#!/bin/sh
# Dependencies: pdftoppm, pdftk, tesseract (>3.0), hocr2pdf
# see http://code.google.com/p/tesseract-ocr/wiki/ReadMe how to install tesseract (>3.0)
input=""
output=""
from=1
to=0
verbose=false
useppm=false
resolution=450
help()
{
cat <<EOF
Usage: pdf-ocr <-i input> [options]
-o <output> Specify output filename
-f|--from <number> Specify starting page
-t|--to <number> Specify last page
-p|--ppm Use the ppm image instead of the original pdf
-r <resolution> Specify resolution (image given to the OCR program)
Default is $resolution
-v|--verbose
EOF
}
while [ $# -gt 0 ] ; do
case $1 in
-i) input=$2; shift;;
-o) output=$2; shift;;
-f|--from) from=$2; shift;;
-t|--to) to=$2; shift;;
-v|--verbose) verbose=true;;
-r) resolution=$2; shift;;
-p|--ppm) useppm=true;;
-h|--help) help; exit;;
*) echo Unknown argument: $1; exit;;
esac
shift
done
if [ "$input" = "" ] ; then
help
exit 1
fi
if [ "$output" = "" ] ; then
output=`echo $input | sed 's/\.[pP][dD][fF]$//g'`
output="${output}-ocr.pdf"
fi
echo "### $input --> $output"
tmp=`mktemp -d`
pdfinfo=${tmp}/pdfinfo.txt
pdftk ${input} dump_data > ${pdfinfo}
pages=`awk '$1=="NumberOfPages:" { print $2 }' ${pdfinfo}`
echo "### Input file has $pages pages"
if [ $from -lt 1 ] ; then from=1; fi
if [ $to = 0 -o $to -gt $pages ] ; then to=$pages; fi
if [ "$pages" = "" -o "$pages" = 0 ] ; then
echo "### input file has no pages"
exit 2
fi
i=1
if [ "$from" -gt $i ] ; then
i="$from"
fi
if [ "$to" -lt $pages ] ; then
pages="$to"
fi
pagelist=""
while [ $i -le $pages ] ; do
echo " >> Processing page $i"
page_pdf=${tmp}/${i}.pdf
page_base=`echo $page_pdf | sed 's/\.pdf$//g'`
page_ppm=${page_base}.ppm
page_hocr=${page_base}.html
page_ocr=${page_base}.tmp.pdf
page_final=${page_base}.ocr.pdf
echo -n " Extracting page... "
pdftk $input cat $i output $page_pdf
echo "DONE, return status: $?"
echo -n " Converting to ppm... "
pdftoppm -r ${resolution} $page_pdf > $page_ppm
echo "DONE, return status: $?"
echo -n " Running tesseract (OCR)... "
tesseract_output=`tesseract ${page_ppm} ${page_base} hocr 2>&1`
tesseract_status=$?
if [ $verbose = true -o $tesseract_status != 0 ] ; then echo; echo $tesseract_output; fi
echo "DONE, return status: $tesseract_status"
if [ $useppm = false ] ; then
echo -n " Creating text layer (hocr2pdf)... "
hocr2pdf -r ${resolution} -n -s -i ${page_ppm} -o ${page_ocr} < ${page_hocr}
echo "DONE, status: $?"
echo -n " Underlaying text below original image (pdftk)... "
pdftk ${page_pdf} background ${page_ocr} output ${page_final}
echo "DONE, status: $?"
else
echo -n " Creating page with recognized text (hocr2pdf)... "
hocr2pdf -r ${resolution} -s -i ${page_ppm} -o ${page_final} < ${page_hocr}
echo "DONE, status: $?"
fi
pagelist="${pagelist} ${page_final}"
i=$((i+1))
done
echo Merging pages: pdftk ${pagelist} cat output "${output}"
pdftk ${pagelist} cat output ${tmp}/final.pdf
echo Updating PDF info
pdftk ${tmp}/final.pdf update_info ${pdfinfo} output "${output}"
rm -rf ${tmp}