Howto: Make scanned PDFs searchable (OCR) using pdfocr

**joehill** · August 16th, 2010

I hope I can get this working soon. I was excited when I saw that I could actually select text in an OCRed document, but then I realized it was only detecting something like 15 to 30 words per page. Any ideas on how to get it to see all the text?

**clasikowski** · August 18th, 2010

Hi!

There seem to be problems for hocr2pdf with the hocr files produced by cuneiform in version 0.9.0 or higher. The reocgnition is pretty good, but when adding the hocr file to the pdf sometimes the font size used in the hidden layer is blown way out of proportion, so that the hidden layer doesn't fit the image layout at all, even some parts of the ocr result are not saved at all.

I haven't found out why this happens, it seems that hocr2pdf is not able to work with the new hocr format used in cuneiform 0.9.0 or higher.

pdfsandwich ( http://www.tobias-elze.de/pdfsandwich/index.html ) could be worth looking at, it works with cuneiform 0.7.0, the standard ubuntu packages, since it first converts the pdf image to bmp3 (the format cuneiform was written for); the resulting hocr files are slightly larger than those produced by cuneiform 0.9.0 or higher. The results match much better!

so long
clasikowski aka hank

**snoozy23** · August 24th, 2010

Hi, the program doesn't work for me. I'm on Ubuntu 10.4 with actual kernel.

I've tried a large ebook. The quality is ok.

here is the output:

Code:

Error while running OCR on page 275
==========
Extracting page 276
Converting page 276 to ppm
pdftoppm version 3.02
Copyright 1996-2007 Glyph & Cog, LLC
Usage: pdftoppm [options] <PDF-file> <PPM-root>
  -f <int>          : first page to print
  -l <int>          : last page to print
  -r <int>          : resolution, in DPI (default is 150)
  -mono             : generate a monochrome PBM file
  -gray             : generate a grayscale PGM file
  -t1lib <string>   : enable t1lib font rasterizer: yes, no
  -freetype <string>: enable FreeType font rasterizer: yes, no
  -aa <string>      : enable font anti-aliasing: yes, no
  -aaVector <string>: enable vector anti-aliasing: yes, no
  -opw <string>     : owner password (for encrypted files)
  -upw <string>     : user password (for encrypted files)
  -q                : don't print any messages or errors
  -cfg <string>     : configuration file to use in place of .xpdfrc
  -v                : print copyright and version info
  -h                : print usage information
  -help             : print usage information
  --help            : print usage information
  -?                : print usage information
Running OCR on page 276
Magick: Improper image header `276.ppm' @ pnm.c/ReadPNMImage/297
Cuneiform for Linux 0.9.0
Error while running OCR on page 276
Merging together PDF files
/tmp/d20100824-6962-1fqv3qc/*-new.pdf not found as file or resource.
Error: Failed to open PDF file: 
   /tmp/d20100824-6962-1fqv3qc/*-new.pdf
Errors encountered.  No output created.
Done.  Input errors, so no output created.
Updating PDF info for /home/steffen/tmp/morgen.pdf
/tmp/d20100824-6962-1fqv3qc/merged.pdf not found as file or resource.
Error: Failed to open PDF file: 
   /tmp/d20100824-6962-1fqv3qc/merged.pdf
Errors encountered.  No output created.
Done.  Input errors, so no output created.
Cleaning up temporary files

maybe there is allready a solution. thank you

**old robots never rust** · August 25th, 2010

Originally Posted by clasikowski

Hi!

There seem to be problems for hocr2pdf with the hocr files produced by cuneiform in version 0.9.0 or higher. The reocgnition is pretty good, but when adding the hocr file to the pdf sometimes the font size used in the hidden layer is blown way out of proportion, so that the hidden layer doesn't fit the image layout at all, even some parts of the ocr result are not saved at all.

I haven't found out why this happens, it seems that hocr2pdf is not able to work with the new hocr format used in cuneiform 0.9.0 or higher.

pdfsandwich ( http://www.tobias-elze.de/pdfsandwich/index.html ) could be worth looking at, it works with cuneiform 0.7.0, the standard ubuntu packages, since it first converts the pdf image to bmp3 (the format cuneiform was written for); the resulting hocr files are slightly larger than those produced by cuneiform 0.9.0 or higher. The results match much better!

so long
clasikowski aka hank

Hank, thanks for this suggestion. It is giving me the same mis-matching results that others have had.

I tried the option -noimage and can see that the output has random sized text that often overlap one another, which is why they never line up with the original image.

Has anyone come up with a solution to this yet?

**geodanny** · October 4th, 2010

Like the others, the text and image layers do not match up. Do you need samples of our files?

btw: thanks for providing this program. Things are better than before.

**gourneau** · October 22nd, 2010

Originally Posted by snoozy23

Hi, the program doesn't work for me. I'm on Ubuntu 10.4 with actual kernel.

I modified your solution to work with Ubuntu 9.10 and pdftoppm 3.02 (which looks like what you used too). The problem I had was that pdftoppm pads the output .ppm with 5 zeros. Mine will only work for pdfs less than 100 pages.

Code:

#!/bin/bash

TESS_LANG=eng
rflag=
# first figure out what args we have
getopts 'r:' OPT;
shift $(($OPTIND - 1))
if [ $OPT == "r" ]
then
    rflag="-rotate $OPTARG";
fi

CURRENT_DIR=`pwd`
SCRIPT_NAME=`basename "$0" .sh`
TMP_DIR=${SCRIPT_NAME}-tmp
mkdir ${TMP_DIR}

for thisfile in "$@"
do
    NAME=`basename "${thisfile}" .pdf`
    cp "$thisfile" ${TMP_DIR}
    cd ${TMP_DIR}

    echo "Examining: ${thisfile}";
    pgs=`pdfinfo "${thisfile}" | grep Pages | awk '{print $2}'`
    echo "Found ${pgs} pages; converting...";
    # it's only fair, since we're suppressing it later...
    echo "Tesseract Open Source OCR Engine";
    for x in `seq 1 ${pgs}`
    do
        echo -en "  Page ${x}...";
        pdftoppm -f $x -l $x -r 600 "$thisfile" ocrbook;
		if [ $x -gt 9 ]; then
        		BASE=ocrbook-0000${x};
		else 
        		BASE=ocrbook-00000${x};
		fi 
        convert ${BASE}.ppm ${rflag} ${BASE}.tif;
        tesseract ${BASE}.tif ${BASE} -l ${TESS_LANG} > /dev/null 2>&1;
        cat ${BASE}.txt >> "${NAME}.txt";
        echo "[pagebreak]" >> "${NAME}.txt";
        rm ocrbook*;
        echo "done";
    done;

    echo "Conversion complete";

    mv "${NAME}.txt" ${CURRENT_DIR}
    rm *
    cd ${CURRENT_DIR}
done

rmdir ${TMP_DIR}

**gourneau** · October 22nd, 2010

wrong thread, sorry for the post. Please delete.

**fuzzyworbles** · November 13th, 2010

this app is really promising, especially because there's nothing out there like it for linux. it would be extremely useful for any organization that images a grip of documents.

in our office, for example, we image all our files. the fancy expensive feed scanners we have scan into pdf, but aren't OCR'd.

if this app were accurate, i'd use it to image tens of thousands of pdf's that i otherwise have to rely on effective file names to find what i'm looking for.

my experience, however, is like the rest: poor recognition and badly misaligned. sure, tesseract is out there, but OCRing to a separate text file isn't very useful in this context.

**potrick** · November 30th, 2010

I'm really wanting to try this out and write it up, but the PPA doesn't work for 10.10. Is there any chance of that happening soon?

**nortexoid** · December 3rd, 2010

I just tried installing from the PPA in Maverick 10.10 too, but the dependencies don't allow one to install it. Quite unfortunate. Will there be updated packages soon?

Thread: Howto: Make scanned PDFs searchable (OCR) using pdfocr

Thread Tools

Display

Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions