Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr
Hi!
It still doesen't work properly; the best solution I found so far for pdf files with searchable text layer is gscan2pdf 0.9.31; using ocropus as engine the recognition is pretty good; the matching between image and text is very accurate.
I have developed another solution producing djvu-files, see http://wiki.ubuntuusers.de/xsane2djvu , a wrapper for xsane-text recognition; it's german, but the script is anotated, so it should be not too difficult to use...
so long
clasikowski AKA hank
Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr
Honestly, this failed miserable, but I wish it didn't. I'm on 11.04 64-bit and here is the output I received while running the script
Code:
dave@dave:~/Public$ pdfocr -i chapter1.pdf -o ocrChapter1.pdf
Input file is /home/dave/Public/chapter1.pdf
Output file is /home/dave/Public/ocrChapter1.pdf
Using working dir /tmp/d20110616-18134-1fnw6dx
Getting info from PDF file
InfoKey: Creator
InfoValue: PScript5.dll Version 5.2.2
InfoKey: Title
InfoValue: C:\Documents and Settings\dave\Desktop\FP00001.SPL
InfoKey: Author
InfoValue: me
InfoKey: Producer
InfoValue: GPL Ghostscript 8.15
InfoKey: ModDate
InfoValue: D:20110512182529
InfoKey: CreationDate
InfoValue: D:20110512182529
PdfID0: 1e3052408a834d039f6d4a01a63f4d7
PdfID1: 1e3052408a834d039f6d4a01a63f4d7
NumberOfPages: 43
Converting 43 pages
==========
Extracting page 1
Converting page 1 to ppm
Running OCR on page 1
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 1
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 2
Converting page 2 to ppm
Running OCR on page 2
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 2
Warning: Image x/y resolution not set, defaulting to: 300
Warning: tag mismatch: 'b' can not close last open: 'i'
Warning: tag mismatch: 'span' can not close last open: 'b'
Warning: tag mismatch: 'p' can not close last open: 'b'
Warning: tag mismatch: 'i' can not close last open: 'b'
Warning: tag mismatch: 'span' can not close last open: 'i'
Warning: tag mismatch: 'p' can not close last open: 'i'
Warning: tag mismatch: 'i' can not close last open: 'b'
Warning: tag mismatch: 'span' can not close last open: 'i'
Warning: tag mismatch: 'p' can not close last open: 'i'
Warning: tag mismatch: 'b' can not close last open: 'i'
Warning: tag mismatch: 'span' can not close last open: 'b'
Warning: tag mismatch: 'b' can not close last open: 'i'
Warning: tag mismatch: 'span' can not close last open: 'b'
Warning: tag mismatch: 'p' can not close last open: 'b'
Warning: tag mismatch: 'div' can not close last open: 'b'
Warning: tag mismatch: 'body' can not close last open: 'b'
Warning: tag mismatch: 'html' can not close last open: 'b'
Warning: unclosed tag: 'b'
Warning: unclosed tag: 'span'
Warning: unclosed tag: 'b'
Warning: unclosed tag: 'span'
Warning: unclosed tag: 'p'
Warning: unclosed tag: 'i'
Warning: unclosed tag: 'span'
Warning: unclosed tag: 'p'
Warning: unclosed tag: 'i'
Warning: unclosed tag: 'span'
Warning: unclosed tag: 'p'
Warning: unclosed tag: 'b'
Warning: unclosed tag: 'span'
Warning: unclosed tag: 'p'
Warning: unclosed tag: 'div'
Warning: unclosed tag: 'body'
Warning: unclosed tag: 'html'
==========
Extracting page 3
Converting page 3 to ppm
Running OCR on page 3
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 3
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 4
Converting page 4 to ppm
Running OCR on page 4
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 4
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 5
Converting page 5 to ppm
Running OCR on page 5
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 5
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 6
Converting page 6 to ppm
Running OCR on page 6
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 6
Warning: Image x/y resolution not set, defaulting to: 300
Warning: tag mismatch: 'i' can not close last open: 'b'
Warning: tag mismatch: 'i' can not close last open: 'b'
Warning: tag mismatch: 'span' can not close last open: 'i'
Warning: tag mismatch: 'p' can not close last open: 'i'
Warning: tag mismatch: 'b' can not close last open: 'i'
Warning: tag mismatch: 'span' can not close last open: 'b'
Warning: tag mismatch: 'p' can not close last open: 'b'
Warning: tag mismatch: 'b' can not close last open: 'i'
Warning: tag mismatch: 'span' can not close last open: 'b'
Warning: tag mismatch: 'p' can not close last open: 'b'
Warning: tag mismatch: 'i' can not close last open: 'b'
Warning: tag mismatch: 'span' can not close last open: 'i'
Warning: tag mismatch: 'p' can not close last open: 'i'
Warning: tag mismatch: 'i' can not close last open: 'b'
Warning: tag mismatch: 'span' can not close last open: 'i'
Warning: tag mismatch: 'p' can not close last open: 'i'
Warning: tag mismatch: 'div' can not close last open: 'i'
Warning: tag mismatch: 'body' can not close last open: 'i'
Warning: tag mismatch: 'html' can not close last open: 'i'
Warning: unclosed tag: 'i'
Warning: unclosed tag: 'span'
Warning: unclosed tag: 'p'
Warning: unclosed tag: 'i'
Warning: unclosed tag: 'span'
Warning: unclosed tag: 'p'
Warning: unclosed tag: 'b'
Warning: unclosed tag: 'span'
Warning: unclosed tag: 'p'
Warning: unclosed tag: 'b'
Warning: unclosed tag: 'span'
Warning: unclosed tag: 'p'
Warning: unclosed tag: 'i'
Warning: unclosed tag: 'i'
Warning: unclosed tag: 'span'
Warning: unclosed tag: 'p'
Warning: unclosed tag: 'div'
Warning: unclosed tag: 'body'
Warning: unclosed tag: 'html'
==========
Extracting page 7
Converting page 7 to ppm
Running OCR on page 7
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 7
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 8
Converting page 8 to ppm
Running OCR on page 8
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 8
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 9
Converting page 9 to ppm
Running OCR on page 9
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 9
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 10
Converting page 10 to ppm
Running OCR on page 10
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 10
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 11
Converting page 11 to ppm
Running OCR on page 11
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 11
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 12
Converting page 12 to ppm
Running OCR on page 12
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 12
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 13
Converting page 13 to ppm
Running OCR on page 13
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 13
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 14
Converting page 14 to ppm
Running OCR on page 14
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 14
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 15
Converting page 15 to ppm
Running OCR on page 15
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 15
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 16
Converting page 16 to ppm
Running OCR on page 16
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 16
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 17
Converting page 17 to ppm
Running OCR on page 17
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 17
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 18
Converting page 18 to ppm
Running OCR on page 18
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 18
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 19
Converting page 19 to ppm
Running OCR on page 19
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 19
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 20
Converting page 20 to ppm
Running OCR on page 20
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 20
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 21
Converting page 21 to ppm
Running OCR on page 21
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 21
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 22
Converting page 22 to ppm
Running OCR on page 22
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 22
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 23
Converting page 23 to ppm
Running OCR on page 23
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 23
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 24
Converting page 24 to ppm
Running OCR on page 24
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 24
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 25
Converting page 25 to ppm
Running OCR on page 25
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 25
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 26
Converting page 26 to ppm
Running OCR on page 26
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 26
Warning: Image x/y resolution not set, defaulting to: 300
Warning: tag mismatch: 'i' can not close last open: 'b'
Warning: tag mismatch: 'b' can not close last open: 'i'
Warning: tag mismatch: 'span' can not close last open: 'b'
Warning: tag mismatch: 'p' can not close last open: 'b'
Warning: tag mismatch: 'div' can not close last open: 'b'
Warning: tag mismatch: 'body' can not close last open: 'b'
Warning: tag mismatch: 'html' can not close last open: 'b'
Warning: unclosed tag: 'b'
Warning: unclosed tag: 'i'
Warning: unclosed tag: 'span'
Warning: unclosed tag: 'p'
Warning: unclosed tag: 'div'
Warning: unclosed tag: 'body'
Warning: unclosed tag: 'html'
==========
Extracting page 27
Converting page 27 to ppm
Running OCR on page 27
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 27
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 28
Converting page 28 to ppm
Running OCR on page 28
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 28
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 29
Converting page 29 to ppm
Running OCR on page 29
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 29
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 30
Converting page 30 to ppm
Running OCR on page 30
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 30
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 31
Converting page 31 to ppm
Running OCR on page 31
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 31
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 32
Converting page 32 to ppm
Running OCR on page 32
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 32
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 33
Converting page 33 to ppm
Running OCR on page 33
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 33
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 34
Converting page 34 to ppm
Running OCR on page 34
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 34
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 35
Converting page 35 to ppm
Running OCR on page 35
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 35
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 36
Converting page 36 to ppm
Running OCR on page 36
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 36
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 37
Converting page 37 to ppm
Running OCR on page 37
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 37
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 38
Converting page 38 to ppm
Running OCR on page 38
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 38
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 39
Converting page 39 to ppm
Running OCR on page 39
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 39
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 40
Converting page 40 to ppm
Running OCR on page 40
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 40
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 41
Converting page 41 to ppm
Running OCR on page 41
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 41
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 42
Converting page 42 to ppm
Running OCR on page 42
Cuneiform for Linux 1.0.0
Embedding text into PDF for page 42
Warning: Image x/y resolution not set, defaulting to: 300
==========
Extracting page 43
Converting page 43 to ppm
Running OCR on page 43
Cuneiform for Linux 1.0.0
PUMA_XFinalrecognition failed.
Error while running OCR on page 43
Merging together PDF files
Updating PDF info for /home/dave/Public/ocrChapter1.pdf
Cleaning up temporary files
/usr/bin/pdfocr:287:in `delete': Is a directory - /tmp/d20110616-18134-1fnw6dx/20_files (Errno::EISDIR)
from /usr/bin/pdfocr:287
from /usr/bin/pdfocr:283:in `foreach'
from /usr/bin/pdfocr:283
Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr
It seems the script is not handling spaces in both the file and path to the file. Are you still supporting updates? Is there anyway one could access your repo to fix this bug?
Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr
The new version PDF Xchange Viewer can OCR scanned pages very well. Its not a native Linux app and it's proprietary, but runs very well under Wine and has multiple language support.
Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr
If command line tools are not an absolute must for you, give a try to PDF Xchange Viewer (http://www.tracker-software.com/prod...xchange-viewer). It is a windows program, but runs flawlessly under wine, it has OCR and a whole lot of other features. It has a free and a pro version - I decided to pay for the pro version which is offering more tools.
Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr
Quote:
Originally Posted by
the_summer
Its a great helper. I tried it on some files. The ocr seem to work, but when i search for words, the marks are often quite far away from the searched word.
Maybe it's because there is always the warning:
Code:
Warning: Image x/y resolution not set, defaulting to: 300
Is there a way to manually set different values for the resolution.
I didnt find something about it in the man-page.
Best wishes
Jan
At least part of the problem here is that pdftoppm and hocr2pdf (used by pdfocr) default to low resolutions, like 150. If the input file has higher resolution, nothing lines up. Here is how to hard-code a 600dpi resolution in /usr/bin/pdfocr:
Code:
1.upto(pagenum) { |i|
puts "=========="
puts "Extracting page #{i}"
basefn = i.to_s.rjust(numdigits, '0')
sh "pdftk #{infile} cat #{i} output #{basefn+'.pdf'}"
if not File.file?(basefn+'.pdf')
puts "Error while extracting page #{i}"
next
end
puts "Converting page #{i} to ppm"
sh "pdftoppm -r 600 #{basefn+'.pdf'} > #{basefn+'.ppm'}"
if not File.file?(basefn+'.ppm')
puts "Error while converting page #{i} to ppm"
next
end
puts "Running OCR on page #{i}"
sh "cuneiform -l #{language} -f hocr -o #{basefn+'.hocr'} #{basefn+'.ppm'}"
if not File.file?(basefn+'.hocr')
puts "Error while running OCR on page #{i}"
next
end
puts "Embedding text into PDF for page #{i}"
sh "hocr2pdf -r 600 -i #{basefn+'.ppm'} -s -o #{basefn+'-new.pdf'} < #{basefn+'.hocr'}"
if not File.file?(basefn+'-new.pdf')
puts "Error while embedding text into PDF for page #{i}"
next
end
}
Add "-r 600" (coincidentally the same parameter) in two places, to the pdftoppm and hocr2pdf sections of the above block of code (assuming 600dpi input), as shown.
This gives significantly better results.
Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr
I have found that Tesseract (3.02) is more accurate than Cuneiform (1.1.0).
Tesseract is much slower though; ~10-15 seconds a page on a Core i5.
To use Tesseract instead of Cuneiform change
Code:
sh "cuneiform -l #{language} -f hocr -o #{basefn+'.hocr'} #{basefn+'.ppm'}"
to
Code:
sh "tesseract #{basefn+'.ppm'} #{basefn+'.hocr'} -l #{language} hocr"
and
Code:
sh "hocr2pdf -i #{basefn+'.ppm'} -s -o #{basefn+'-new.pdf'} < #{basefn+'.hocr'}"
to
Code:
sh "hocr2pdf -i #{basefn+'.ppm'} -s -o #{basefn+'-new.pdf'} < #{basefn+'.hocr.html'}"
YMMV but this also cleared up some formatting errors and inconsistencies I was getting with Cuneiform's hOCR.
Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr
I tried both pdfocr (which uses cuneiform) and tesseract on scanned pdf files and could not make them work.
Then I installed Lios (in which you can choose cuneiform or tesseract in preferences) which has GUI. By using OCR-Pdf from Menu I tried the same file and got good results with both the engines.
Lios thread is also present on this forum.
Kamalakar