Page 5 of 5 FirstFirst ... 345
Results 41 to 48 of 48

Thread: Howto: Make scanned PDFs searchable (OCR) using pdfocr

  1. #41
    Join Date
    Mar 2008
    Beans
    27

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Hi!

    It still doesen't work properly; the best solution I found so far for pdf files with searchable text layer is gscan2pdf 0.9.31; using ocropus as engine the recognition is pretty good; the matching between image and text is very accurate.

    I have developed another solution producing djvu-files, see http://wiki.ubuntuusers.de/xsane2djvu , a wrapper for xsane-text recognition; it's german, but the script is anotated, so it should be not too difficult to use...

    so long
    clasikowski AKA hank

  2. #42
    Join Date
    Jun 2008
    Beans
    Hidden!

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Honestly, this failed miserable, but I wish it didn't. I'm on 11.04 64-bit and here is the output I received while running the script

    Code:
    dave@dave:~/Public$ pdfocr -i chapter1.pdf -o ocrChapter1.pdf
    Input file is /home/dave/Public/chapter1.pdf
    Output file is /home/dave/Public/ocrChapter1.pdf
    Using working dir /tmp/d20110616-18134-1fnw6dx
    Getting info from PDF file
    
    InfoKey: Creator
    InfoValue: PScript5.dll Version 5.2.2
    InfoKey: Title
    InfoValue: C:\Documents and Settings\dave\Desktop\FP00001.SPL
    InfoKey: Author
    InfoValue: me
    InfoKey: Producer
    InfoValue: GPL Ghostscript 8.15
    InfoKey: ModDate
    InfoValue: D:20110512182529
    InfoKey: CreationDate
    InfoValue: D:20110512182529
    PdfID0: 1e3052408a834d039f6d4a01a63f4d7
    PdfID1: 1e3052408a834d039f6d4a01a63f4d7
    NumberOfPages: 43
    
    Converting 43 pages
    ==========
    Extracting page 1
    Converting page 1 to ppm
    Running OCR on page 1
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 1
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 2
    Converting page 2 to ppm
    Running OCR on page 2
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 2
    Warning: Image x/y resolution not set, defaulting to: 300
    Warning: tag mismatch: 'b' can not close last open: 'i'
    Warning: tag mismatch: 'span' can not close last open: 'b'
    Warning: tag mismatch: 'p' can not close last open: 'b'
    Warning: tag mismatch: 'i' can not close last open: 'b'
    Warning: tag mismatch: 'span' can not close last open: 'i'
    Warning: tag mismatch: 'p' can not close last open: 'i'
    Warning: tag mismatch: 'i' can not close last open: 'b'
    Warning: tag mismatch: 'span' can not close last open: 'i'
    Warning: tag mismatch: 'p' can not close last open: 'i'
    Warning: tag mismatch: 'b' can not close last open: 'i'
    Warning: tag mismatch: 'span' can not close last open: 'b'
    Warning: tag mismatch: 'b' can not close last open: 'i'
    Warning: tag mismatch: 'span' can not close last open: 'b'
    Warning: tag mismatch: 'p' can not close last open: 'b'
    Warning: tag mismatch: 'div' can not close last open: 'b'
    Warning: tag mismatch: 'body' can not close last open: 'b'
    Warning: tag mismatch: 'html' can not close last open: 'b'
    Warning: unclosed tag: 'b'
    Warning: unclosed tag: 'span'
    Warning: unclosed tag: 'b'
    Warning: unclosed tag: 'span'
    Warning: unclosed tag: 'p'
    Warning: unclosed tag: 'i'
    Warning: unclosed tag: 'span'
    Warning: unclosed tag: 'p'
    Warning: unclosed tag: 'i'
    Warning: unclosed tag: 'span'
    Warning: unclosed tag: 'p'
    Warning: unclosed tag: 'b'
    Warning: unclosed tag: 'span'
    Warning: unclosed tag: 'p'
    Warning: unclosed tag: 'div'
    Warning: unclosed tag: 'body'
    Warning: unclosed tag: 'html'
    ==========
    Extracting page 3
    Converting page 3 to ppm
    Running OCR on page 3
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 3
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 4
    Converting page 4 to ppm
    Running OCR on page 4
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 4
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 5
    Converting page 5 to ppm
    Running OCR on page 5
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 5
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 6
    Converting page 6 to ppm
    Running OCR on page 6
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 6
    Warning: Image x/y resolution not set, defaulting to: 300
    Warning: tag mismatch: 'i' can not close last open: 'b'
    Warning: tag mismatch: 'i' can not close last open: 'b'
    Warning: tag mismatch: 'span' can not close last open: 'i'
    Warning: tag mismatch: 'p' can not close last open: 'i'
    Warning: tag mismatch: 'b' can not close last open: 'i'
    Warning: tag mismatch: 'span' can not close last open: 'b'
    Warning: tag mismatch: 'p' can not close last open: 'b'
    Warning: tag mismatch: 'b' can not close last open: 'i'
    Warning: tag mismatch: 'span' can not close last open: 'b'
    Warning: tag mismatch: 'p' can not close last open: 'b'
    Warning: tag mismatch: 'i' can not close last open: 'b'
    Warning: tag mismatch: 'span' can not close last open: 'i'
    Warning: tag mismatch: 'p' can not close last open: 'i'
    Warning: tag mismatch: 'i' can not close last open: 'b'
    Warning: tag mismatch: 'span' can not close last open: 'i'
    Warning: tag mismatch: 'p' can not close last open: 'i'
    Warning: tag mismatch: 'div' can not close last open: 'i'
    Warning: tag mismatch: 'body' can not close last open: 'i'
    Warning: tag mismatch: 'html' can not close last open: 'i'
    Warning: unclosed tag: 'i'
    Warning: unclosed tag: 'span'
    Warning: unclosed tag: 'p'
    Warning: unclosed tag: 'i'
    Warning: unclosed tag: 'span'
    Warning: unclosed tag: 'p'
    Warning: unclosed tag: 'b'
    Warning: unclosed tag: 'span'
    Warning: unclosed tag: 'p'
    Warning: unclosed tag: 'b'
    Warning: unclosed tag: 'span'
    Warning: unclosed tag: 'p'
    Warning: unclosed tag: 'i'
    Warning: unclosed tag: 'i'
    Warning: unclosed tag: 'span'
    Warning: unclosed tag: 'p'
    Warning: unclosed tag: 'div'
    Warning: unclosed tag: 'body'
    Warning: unclosed tag: 'html'
    ==========
    Extracting page 7
    Converting page 7 to ppm
    Running OCR on page 7
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 7
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 8
    Converting page 8 to ppm
    Running OCR on page 8
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 8
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 9
    Converting page 9 to ppm
    Running OCR on page 9
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 9
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 10
    Converting page 10 to ppm
    Running OCR on page 10
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 10
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 11
    Converting page 11 to ppm
    Running OCR on page 11
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 11
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 12
    Converting page 12 to ppm
    Running OCR on page 12
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 12
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 13
    Converting page 13 to ppm
    Running OCR on page 13
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 13
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 14
    Converting page 14 to ppm
    Running OCR on page 14
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 14
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 15
    Converting page 15 to ppm
    Running OCR on page 15
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 15
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 16
    Converting page 16 to ppm
    Running OCR on page 16
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 16
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 17
    Converting page 17 to ppm
    Running OCR on page 17
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 17
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 18
    Converting page 18 to ppm
    Running OCR on page 18
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 18
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 19
    Converting page 19 to ppm
    Running OCR on page 19
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 19
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 20
    Converting page 20 to ppm
    Running OCR on page 20
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 20
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 21
    Converting page 21 to ppm
    Running OCR on page 21
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 21
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 22
    Converting page 22 to ppm
    Running OCR on page 22
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 22
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 23
    Converting page 23 to ppm
    Running OCR on page 23
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 23
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 24
    Converting page 24 to ppm
    Running OCR on page 24
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 24
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 25
    Converting page 25 to ppm
    Running OCR on page 25
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 25
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 26
    Converting page 26 to ppm
    Running OCR on page 26
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 26
    Warning: Image x/y resolution not set, defaulting to: 300
    Warning: tag mismatch: 'i' can not close last open: 'b'
    Warning: tag mismatch: 'b' can not close last open: 'i'
    Warning: tag mismatch: 'span' can not close last open: 'b'
    Warning: tag mismatch: 'p' can not close last open: 'b'
    Warning: tag mismatch: 'div' can not close last open: 'b'
    Warning: tag mismatch: 'body' can not close last open: 'b'
    Warning: tag mismatch: 'html' can not close last open: 'b'
    Warning: unclosed tag: 'b'
    Warning: unclosed tag: 'i'
    Warning: unclosed tag: 'span'
    Warning: unclosed tag: 'p'
    Warning: unclosed tag: 'div'
    Warning: unclosed tag: 'body'
    Warning: unclosed tag: 'html'
    ==========
    Extracting page 27
    Converting page 27 to ppm
    Running OCR on page 27
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 27
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 28
    Converting page 28 to ppm
    Running OCR on page 28
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 28
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 29
    Converting page 29 to ppm
    Running OCR on page 29
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 29
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 30
    Converting page 30 to ppm
    Running OCR on page 30
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 30
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 31
    Converting page 31 to ppm
    Running OCR on page 31
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 31
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 32
    Converting page 32 to ppm
    Running OCR on page 32
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 32
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 33
    Converting page 33 to ppm
    Running OCR on page 33
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 33
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 34
    Converting page 34 to ppm
    Running OCR on page 34
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 34
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 35
    Converting page 35 to ppm
    Running OCR on page 35
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 35
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 36
    Converting page 36 to ppm
    Running OCR on page 36
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 36
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 37
    Converting page 37 to ppm
    Running OCR on page 37
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 37
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 38
    Converting page 38 to ppm
    Running OCR on page 38
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 38
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 39
    Converting page 39 to ppm
    Running OCR on page 39
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 39
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 40
    Converting page 40 to ppm
    Running OCR on page 40
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 40
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 41
    Converting page 41 to ppm
    Running OCR on page 41
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 41
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 42
    Converting page 42 to ppm
    Running OCR on page 42
    Cuneiform for Linux 1.0.0
    Embedding text into PDF for page 42
    Warning: Image x/y resolution not set, defaulting to: 300
    ==========
    Extracting page 43
    Converting page 43 to ppm
    Running OCR on page 43
    Cuneiform for Linux 1.0.0
    PUMA_XFinalrecognition failed.
    Error while running OCR on page 43
    Merging together PDF files
    Updating PDF info for /home/dave/Public/ocrChapter1.pdf
    Cleaning up temporary files
    /usr/bin/pdfocr:287:in `delete': Is a directory - /tmp/d20110616-18134-1fnw6dx/20_files (Errno::EISDIR)
    	from /usr/bin/pdfocr:287
    	from /usr/bin/pdfocr:283:in `foreach'
    	from /usr/bin/pdfocr:283
    Last edited by bonfire89; June 16th, 2011 at 05:27 PM.

  3. #43
    Join Date
    Nov 2010
    Beans
    4

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    It seems the script is not handling spaces in both the file and path to the file. Are you still supporting updates? Is there anyway one could access your repo to fix this bug?

  4. #44
    Join Date
    Oct 2009
    Location
    Würzburg, GER
    Beans
    85
    Distro
    Ubuntu 15.10 Wily Werewolf

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    The new version PDF Xchange Viewer can OCR scanned pages very well. Its not a native Linux app and it's proprietary, but runs very well under Wine and has multiple language support.
    Ubuntu 16.04 'Xenial' 64-bit on a Thinkpad T430s (2356AB2) - Intel Core i5 3320M - 16 GB RAM - on-board Intel Ivybridge graphics

  5. #45
    Join Date
    Apr 2006
    Beans
    213

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    If command line tools are not an absolute must for you, give a try to PDF Xchange Viewer (http://www.tracker-software.com/prod...xchange-viewer). It is a windows program, but runs flawlessly under wine, it has OCR and a whole lot of other features. It has a free and a pro version - I decided to pay for the pro version which is offering more tools.

  6. #46
    Join Date
    May 2009
    Location
    Toronto
    Beans
    9

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Quote Originally Posted by the_summer View Post
    Its a great helper. I tried it on some files. The ocr seem to work, but when i search for words, the marks are often quite far away from the searched word.
    Maybe it's because there is always the warning:

    Code:
    Warning: Image x/y resolution not set, defaulting to: 300
    Is there a way to manually set different values for the resolution.
    I didnt find something about it in the man-page.

    Best wishes

    Jan
    At least part of the problem here is that pdftoppm and hocr2pdf (used by pdfocr) default to low resolutions, like 150. If the input file has higher resolution, nothing lines up. Here is how to hard-code a 600dpi resolution in /usr/bin/pdfocr:

    Code:
    1.upto(pagenum) { |i|
    	puts "=========="
    	puts "Extracting page #{i}"
    	basefn = i.to_s.rjust(numdigits, '0')
    	sh "pdftk #{infile} cat #{i} output #{basefn+'.pdf'}"
    	if not File.file?(basefn+'.pdf')
    		puts "Error while extracting page #{i}"
    		next
    	end
    	puts "Converting page #{i} to ppm"
    	sh "pdftoppm -r 600 #{basefn+'.pdf'} > #{basefn+'.ppm'}"
    	if not File.file?(basefn+'.ppm')
    		puts "Error while converting page #{i} to ppm"
    		next
    	end
    	puts "Running OCR on page #{i}"
    	sh "cuneiform -l #{language} -f hocr -o #{basefn+'.hocr'} #{basefn+'.ppm'}"
    	if not File.file?(basefn+'.hocr')
    		puts "Error while running OCR on page #{i}"
    		next
    	end
    	puts "Embedding text into PDF for page #{i}"
    	sh "hocr2pdf -r 600 -i #{basefn+'.ppm'} -s -o #{basefn+'-new.pdf'} < #{basefn+'.hocr'}"
    	if not File.file?(basefn+'-new.pdf')
    		puts "Error while embedding text into PDF for page #{i}"
    		next
    	end
    }
    Add "-r 600" (coincidentally the same parameter) in two places, to the pdftoppm and hocr2pdf sections of the above block of code (assuming 600dpi input), as shown.

    This gives significantly better results.

  7. #47
    Join Date
    Mar 2011
    Beans
    2

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    I have found that Tesseract (3.02) is more accurate than Cuneiform (1.1.0).

    Tesseract is much slower though; ~10-15 seconds a page on a Core i5.


    To use Tesseract instead of Cuneiform change

    Code:
    sh "cuneiform -l #{language} -f hocr -o #{basefn+'.hocr'} #{basefn+'.ppm'}"
    to

    Code:
    sh "tesseract #{basefn+'.ppm'} #{basefn+'.hocr'} -l #{language} hocr"
    and

    Code:
    sh "hocr2pdf -i #{basefn+'.ppm'} -s -o #{basefn+'-new.pdf'} < #{basefn+'.hocr'}"
    to

    Code:
    sh "hocr2pdf -i #{basefn+'.ppm'} -s -o #{basefn+'-new.pdf'} < #{basefn+'.hocr.html'}"
    YMMV but this also cleared up some formatting errors and inconsistencies I was getting with Cuneiform's hOCR.

  8. #48
    Join Date
    Jun 2005
    Location
    Delhi, India
    Beans
    565
    Distro
    Ubuntu 22.04 Jammy Jellyfish

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    I tried both pdfocr (which uses cuneiform) and tesseract on scanned pdf files and could not make them work.

    Then I installed Lios (in which you can choose cuneiform or tesseract in preferences) which has GUI. By using OCR-Pdf from Menu I tried the same file and got good results with both the engines.

    Lios thread is also present on this forum.

    Kamalakar
    Acer Aspire 3 A315-21 with pre-installed Endless OS & Ubuntu 22.04 in dual boot
    Linux Registered User #395189
    Ubuntu user number # 345

Page 5 of 5 FirstFirst ... 345

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •