Howto: Make scanned PDFs searchable (OCR) using pdfocr

Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

Hi!

It still doesen't work properly; the best solution I found so far for pdf files with searchable text layer is gscan2pdf 0.9.31; using ocropus as engine the recognition is pretty good; the matching between image and text is very accurate.

I have developed another solution producing djvu-files, see http://wiki.ubuntuusers.de/xsane2djvu , a wrapper for xsane-text recognition; it's german, but the script is anotated, so it should be not too difficult to use...

so long
clasikowski AKA hank

Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

Honestly, this failed miserable, but I wish it didn't. I'm on 11.04 64-bit and here is the output I received while running the script

Code:

dave@dave:~/Public$ pdfocr -i chapter1.pdf -o ocrChapter1.pdf Input file is /home/dave/Public/chapter1.pdf Output file is /home/dave/Public/ocrChapter1.pdf Using working dir /tmp/d20110616-18134-1fnw6dx Getting info from PDF file InfoKey: Creator InfoValue: PScript5.dll Version 5.2.2 InfoKey: Title InfoValue: C:\Documents and Settings\dave\Desktop\FP00001.SPL InfoKey: Author InfoValue: me InfoKey: Producer InfoValue: GPL Ghostscript 8.15 InfoKey: ModDate InfoValue: D:20110512182529 InfoKey: CreationDate InfoValue: D:20110512182529 PdfID0: 1e3052408a834d039f6d4a01a63f4d7 PdfID1: 1e3052408a834d039f6d4a01a63f4d7 NumberOfPages: 43 Converting 43 pages ========== Extracting page 1 Converting page 1 to ppm Running OCR on page 1 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 1 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 2 Converting page 2 to ppm Running OCR on page 2 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 2 Warning: Image x/y resolution not set, defaulting to: 300 Warning: tag mismatch: 'b' can not close last open: 'i' Warning: tag mismatch: 'span' can not close last open: 'b' Warning: tag mismatch: 'p' can not close last open: 'b' Warning: tag mismatch: 'i' can not close last open: 'b' Warning: tag mismatch: 'span' can not close last open: 'i' Warning: tag mismatch: 'p' can not close last open: 'i' Warning: tag mismatch: 'i' can not close last open: 'b' Warning: tag mismatch: 'span' can not close last open: 'i' Warning: tag mismatch: 'p' can not close last open: 'i' Warning: tag mismatch: 'b' can not close last open: 'i' Warning: tag mismatch: 'span' can not close last open: 'b' Warning: tag mismatch: 'b' can not close last open: 'i' Warning: tag mismatch: 'span' can not close last open: 'b' Warning: tag mismatch: 'p' can not close last open: 'b' Warning: tag mismatch: 'div' can not close last open: 'b' Warning: tag mismatch: 'body' can not close last open: 'b' Warning: tag mismatch: 'html' can not close last open: 'b' Warning: unclosed tag: 'b' Warning: unclosed tag: 'span' Warning: unclosed tag: 'b' Warning: unclosed tag: 'span' Warning: unclosed tag: 'p' Warning: unclosed tag: 'i' Warning: unclosed tag: 'span' Warning: unclosed tag: 'p' Warning: unclosed tag: 'i' Warning: unclosed tag: 'span' Warning: unclosed tag: 'p' Warning: unclosed tag: 'b' Warning: unclosed tag: 'span' Warning: unclosed tag: 'p' Warning: unclosed tag: 'div' Warning: unclosed tag: 'body' Warning: unclosed tag: 'html' ========== Extracting page 3 Converting page 3 to ppm Running OCR on page 3 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 3 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 4 Converting page 4 to ppm Running OCR on page 4 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 4 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 5 Converting page 5 to ppm Running OCR on page 5 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 5 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 6 Converting page 6 to ppm Running OCR on page 6 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 6 Warning: Image x/y resolution not set, defaulting to: 300 Warning: tag mismatch: 'i' can not close last open: 'b' Warning: tag mismatch: 'i' can not close last open: 'b' Warning: tag mismatch: 'span' can not close last open: 'i' Warning: tag mismatch: 'p' can not close last open: 'i' Warning: tag mismatch: 'b' can not close last open: 'i' Warning: tag mismatch: 'span' can not close last open: 'b' Warning: tag mismatch: 'p' can not close last open: 'b' Warning: tag mismatch: 'b' can not close last open: 'i' Warning: tag mismatch: 'span' can not close last open: 'b' Warning: tag mismatch: 'p' can not close last open: 'b' Warning: tag mismatch: 'i' can not close last open: 'b' Warning: tag mismatch: 'span' can not close last open: 'i' Warning: tag mismatch: 'p' can not close last open: 'i' Warning: tag mismatch: 'i' can not close last open: 'b' Warning: tag mismatch: 'span' can not close last open: 'i' Warning: tag mismatch: 'p' can not close last open: 'i' Warning: tag mismatch: 'div' can not close last open: 'i' Warning: tag mismatch: 'body' can not close last open: 'i' Warning: tag mismatch: 'html' can not close last open: 'i' Warning: unclosed tag: 'i' Warning: unclosed tag: 'span' Warning: unclosed tag: 'p' Warning: unclosed tag: 'i' Warning: unclosed tag: 'span' Warning: unclosed tag: 'p' Warning: unclosed tag: 'b' Warning: unclosed tag: 'span' Warning: unclosed tag: 'p' Warning: unclosed tag: 'b' Warning: unclosed tag: 'span' Warning: unclosed tag: 'p' Warning: unclosed tag: 'i' Warning: unclosed tag: 'i' Warning: unclosed tag: 'span' Warning: unclosed tag: 'p' Warning: unclosed tag: 'div' Warning: unclosed tag: 'body' Warning: unclosed tag: 'html' ========== Extracting page 7 Converting page 7 to ppm Running OCR on page 7 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 7 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 8 Converting page 8 to ppm Running OCR on page 8 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 8 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 9 Converting page 9 to ppm Running OCR on page 9 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 9 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 10 Converting page 10 to ppm Running OCR on page 10 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 10 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 11 Converting page 11 to ppm Running OCR on page 11 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 11 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 12 Converting page 12 to ppm Running OCR on page 12 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 12 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 13 Converting page 13 to ppm Running OCR on page 13 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 13 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 14 Converting page 14 to ppm Running OCR on page 14 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 14 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 15 Converting page 15 to ppm Running OCR on page 15 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 15 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 16 Converting page 16 to ppm Running OCR on page 16 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 16 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 17 Converting page 17 to ppm Running OCR on page 17 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 17 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 18 Converting page 18 to ppm Running OCR on page 18 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 18 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 19 Converting page 19 to ppm Running OCR on page 19 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 19 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 20 Converting page 20 to ppm Running OCR on page 20 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 20 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 21 Converting page 21 to ppm Running OCR on page 21 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 21 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 22 Converting page 22 to ppm Running OCR on page 22 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 22 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 23 Converting page 23 to ppm Running OCR on page 23 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 23 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 24 Converting page 24 to ppm Running OCR on page 24 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 24 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 25 Converting page 25 to ppm Running OCR on page 25 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 25 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 26 Converting page 26 to ppm Running OCR on page 26 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 26 Warning: Image x/y resolution not set, defaulting to: 300 Warning: tag mismatch: 'i' can not close last open: 'b' Warning: tag mismatch: 'b' can not close last open: 'i' Warning: tag mismatch: 'span' can not close last open: 'b' Warning: tag mismatch: 'p' can not close last open: 'b' Warning: tag mismatch: 'div' can not close last open: 'b' Warning: tag mismatch: 'body' can not close last open: 'b' Warning: tag mismatch: 'html' can not close last open: 'b' Warning: unclosed tag: 'b' Warning: unclosed tag: 'i' Warning: unclosed tag: 'span' Warning: unclosed tag: 'p' Warning: unclosed tag: 'div' Warning: unclosed tag: 'body' Warning: unclosed tag: 'html' ========== Extracting page 27 Converting page 27 to ppm Running OCR on page 27 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 27 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 28 Converting page 28 to ppm Running OCR on page 28 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 28 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 29 Converting page 29 to ppm Running OCR on page 29 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 29 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 30 Converting page 30 to ppm Running OCR on page 30 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 30 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 31 Converting page 31 to ppm Running OCR on page 31 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 31 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 32 Converting page 32 to ppm Running OCR on page 32 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 32 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 33 Converting page 33 to ppm Running OCR on page 33 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 33 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 34 Converting page 34 to ppm Running OCR on page 34 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 34 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 35 Converting page 35 to ppm Running OCR on page 35 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 35 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 36 Converting page 36 to ppm Running OCR on page 36 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 36 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 37 Converting page 37 to ppm Running OCR on page 37 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 37 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 38 Converting page 38 to ppm Running OCR on page 38 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 38 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 39 Converting page 39 to ppm Running OCR on page 39 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 39 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 40 Converting page 40 to ppm Running OCR on page 40 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 40 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 41 Converting page 41 to ppm Running OCR on page 41 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 41 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 42 Converting page 42 to ppm Running OCR on page 42 Cuneiform for Linux 1.0.0 Embedding text into PDF for page 42 Warning: Image x/y resolution not set, defaulting to: 300 ========== Extracting page 43 Converting page 43 to ppm Running OCR on page 43 Cuneiform for Linux 1.0.0 PUMA_XFinalrecognition failed. Error while running OCR on page 43 Merging together PDF files Updating PDF info for /home/dave/Public/ocrChapter1.pdf Cleaning up temporary files /usr/bin/pdfocr:287:in `delete': Is a directory - /tmp/d20110616-18134-1fnw6dx/20_files (Errno::EISDIR) from /usr/bin/pdfocr:287 from /usr/bin/pdfocr:283:in `foreach' from /usr/bin/pdfocr:283

Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

It seems the script is not handling spaces in both the file and path to the file. Are you still supporting updates? Is there anyway one could access your repo to fix this bug?

Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

The new version PDF Xchange Viewer can OCR scanned pages very well. Its not a native Linux app and it's proprietary, but runs very well under Wine and has multiple language support.

Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

If command line tools are not an absolute must for you, give a try to PDF Xchange Viewer (http://www.tracker-software.com/prod...xchange-viewer). It is a windows program, but runs flawlessly under wine, it has OCR and a whole lot of other features. It has a free and a pro version - I decided to pay for the pro version which is offering more tools.

Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

Quote:

Originally Posted by the_summer

Its a great helper. I tried it on some files. The ocr seem to work, but when i search for words, the marks are often quite far away from the searched word.
Maybe it's because there is always the warning:

Code:

Warning: Image x/y resolution not set, defaulting to: 300

Is there a way to manually set different values for the resolution.
I didnt find something about it in the man-page.

Best wishes

Jan

At least part of the problem here is that pdftoppm and hocr2pdf (used by pdfocr) default to low resolutions, like 150. If the input file has higher resolution, nothing lines up. Here is how to hard-code a 600dpi resolution in /usr/bin/pdfocr:

Code:

1.upto(pagenum) { |i| puts "==========" puts "Extracting page #{i}" basefn = i.to_s.rjust(numdigits, '0') sh "pdftk #{infile} cat #{i} output #{basefn+'.pdf'}" if not File.file?(basefn+'.pdf') puts "Error while extracting page #{i}" next end puts "Converting page #{i} to ppm" sh "pdftoppm -r 600 #{basefn+'.pdf'} > #{basefn+'.ppm'}" if not File.file?(basefn+'.ppm') puts "Error while converting page #{i} to ppm" next end puts "Running OCR on page #{i}" sh "cuneiform -l #{language} -f hocr -o #{basefn+'.hocr'} #{basefn+'.ppm'}" if not File.file?(basefn+'.hocr') puts "Error while running OCR on page #{i}" next end puts "Embedding text into PDF for page #{i}" sh "hocr2pdf -r 600 -i #{basefn+'.ppm'} -s -o #{basefn+'-new.pdf'} < #{basefn+'.hocr'}" if not File.file?(basefn+'-new.pdf') puts "Error while embedding text into PDF for page #{i}" next end }

Add "-r 600" (coincidentally the same parameter) in two places, to the pdftoppm and hocr2pdf sections of the above block of code (assuming 600dpi input), as shown.

This gives significantly better results.

Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

I have found that Tesseract (3.02) is more accurate than Cuneiform (1.1.0).

Tesseract is much slower though; ~10-15 seconds a page on a Core i5.

To use Tesseract instead of Cuneiform change

Code:

sh "cuneiform -l #{language} -f hocr -o #{basefn+'.hocr'} #{basefn+'.ppm'}"

to

Code:

sh "tesseract #{basefn+'.ppm'} #{basefn+'.hocr'} -l #{language} hocr"

and

Code:

sh "hocr2pdf -i #{basefn+'.ppm'} -s -o #{basefn+'-new.pdf'} < #{basefn+'.hocr'}"

to

Code:

sh "hocr2pdf -i #{basefn+'.ppm'} -s -o #{basefn+'-new.pdf'} < #{basefn+'.hocr.html'}"

YMMV but this also cleared up some formatting errors and inconsistencies I was getting with Cuneiform's hOCR.

Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

I tried both pdfocr (which uses cuneiform) and tesseract on scanned pdf files and could not make them work.

Then I installed Lios (in which you can choose cuneiform or tesseract in preferences) which has GUI. By using OCR-Pdf from Menu I tried the same file and got good results with both the engines.

Lios thread is also present on this forum.

Kamalakar