Page 1 of 5 123 ... LastLast
Results 1 to 10 of 48

Thread: Howto: Make scanned PDFs searchable (OCR) using pdfocr

  1. #1
    Join Date
    Mar 2006
    Location
    Palo Alto, CA
    Beans
    1,226
    Distro
    Ubuntu 12.04 Precise Pangolin

    Howto: Make scanned PDFs searchable (OCR) using pdfocr

    What pdfocr is for

    Suppose you have a PDF document that was made using a scanner, or otherwise consists of image data but doesn't have text data. Such a PDF can't be searched by PDF readers or desktop search applications. pdfocr is a simple utility I made that takes a PDF file, then generates a new one that has the text layer added, so it's searchable by your PDF reader and can be indexed by your desktop search application, but is still identical when printed.

    What pdfocr is not for

    This is only of use if your PDF was made from a scanned source; if you exported your PDF from OpenOffice or the like it already has a text layer so this is unnecessary.

    If what you're looking for is to simply extract the plain text from a PDF file, but not to embed the text into the PDF file, see this guide.

    Compatibility

    This guide will work on Ubuntu Karmic (9.10) or Lucid (10.04); the dependencies for this software don't build on older versions.

    Installing pdfocr

    The easiest way to install pdfocr is to add my PPA and use apt-get. If you would instead prefer to install it manually, see here for instructions

    Code:
    sudo add-apt-repository ppa:gezakovacs/pdfocr
    sudo apt-get update
    sudo apt-get install pdfocr
    Using pdfocr to add a text layer to your scanned PDF file

    Open a terminal, go to the directory that has the PDF file you want to convert, and enter (substituting input.pdf with the input PDF file, and output.pdf with the output PDF file)

    Code:
    pdfocr -i input.pdf -o output.pdf
    Now wait as OCR is performed on the PDF file page-by-page, and the output file is generated. This should take a few seconds per page, depending on the resolution of your PDF file (high-res PDF files get better accuracy, but will take longer). Once done, you should now have a searchable PDF at output.pdf.

    Credits

    pdfocr was written by me (Geza Kovacs). It is simply a script which automates the following process:

    1. Splitting the PDF file into separate pages using pdftk
    2. Extracting out the image data using pdfimages
    3. Doing OCR (optical character recognition) using cuneiform
    4. Embedding the detected text back into the PDF file using hocr2pdf
    5. Merging together the files using pdftk.

    Hence, if you want more fine-grained control than the defaults, you can just invoke these utilities manually. Source is available on github. Feedback is welcome.

  2. #2
    Join Date
    Dec 2008
    Beans
    2

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Sounds like an awesome program, unfortunately I've got the following errors any idea why?

    $pdfocr -i Acceptance.pdf -o test.pdf
    Input file is /mnt/data/tmp/Acceptance.pdf
    Output file is /mnt/data/tmp/test.pdf
    Using working dir /tmp/d20100419-4215-ohxah0
    Getting info from PDF file

    Warning: no info dictionary found
    NumberOfPages: 1

    Converting 1 pages
    ==========
    Extracting page 1
    Converting page 1 to ppm
    Running OCR on page 1
    PUMA_XFinalrecognition failed.
    Cuneiform for Linux 0.9.0
    Error while running OCR on page 1
    Merging together PDF files
    Error: Failed to open PDF file:
    /tmp/d20100419-4215-ohxah0/*-new.pdf
    Errors encountered. No output created.
    Done. Input errors, so no output created.
    Updating PDF info for /mnt/data/tmp/test.pdf
    Error: Failed to open PDF file:
    /tmp/d20100419-4215-ohxah0/merged.pdf
    Errors encountered. No output created.
    Done. Input errors, so no output created.
    Cleaning up temporary files

  3. #3
    Join Date
    Apr 2008
    Location
    England
    Beans
    603
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Very nice. And certainly extremely useful.

    Thanks.

    I did get a few errors, but none of them stopped the program working.

    Errors below.
    Code:
    Warning: Image x/y resolution not set, defaulting to: 300
    Warning: tag mismatch: 'b' can not close last open: 'i'
    Warning: tag mismatch: 'span' can not close last open: 'b'
    Warning: tag mismatch: 'p' can not close last open: 'b'
    Warning: tag mismatch: 'div' can not close last open: 'b'
    Warning: tag mismatch: 'body' can not close last open: 'b'
    Warning: tag mismatch: 'html' can not close last open: 'b'
    Warning: unclosed tag: 'b'
    Warning: unclosed tag: 'span'
    Warning: unclosed tag: 'p'
    Warning: unclosed tag: 'div'
    Warning: unclosed tag: 'body'
    Warning: unclosed tag: 'html'

  4. #4
    Join Date
    Mar 2010
    Beans
    1

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    I am excited about this script, I think it will be very useful. However I ran into a different error than the one posted above:

    Code:
    aaron@aaron-desktop:~$ pdfocr -i 0175.pdf -o out3.pdf
    Input file is /home/aaron/0175.pdf
    Output file is /home/aaron/out3.pdf
    Using working dir /tmp/d20100421-20950-4g11i8
    Getting info from PDF file
    
    InfoKey: Creator
    InfoValue: XSane version 0.996 (sane 1.0) - by Oliver Rauch
    InfoKey: Title
    InfoValue: XSane scanned image
    InfoKey: Producer
    InfoValue: XSane 0.996
    InfoKey: CreationDate
    InfoValue: D:20100421215339+00'00'
    NumberOfPages: 1
    
    Converting 1 pages
    ==========
    Extracting page 1
    Converting page 1 to ppm
    Running OCR on page 1
    1.ppm is not a BMP file.
    Cuneiform for Linux 0.9.0
    Error while running OCR on page 1
    Merging together PDF files
    Error: Failed to open PDF file: 
       /tmp/d20100421-20950-4g11i8/*-new.pdf
    Errors encountered.  No output created.
    Done.  Input errors, so no output created.
    Updating PDF info for /home/aaron/out3.pdf
    Error: Failed to open PDF file: 
       /tmp/d20100421-20950-4g11i8/merged.pdf
    Errors encountered.  No output created.
    Done.  Input errors, so no output created.
    Cleaning up temporary files
    Notice the "1.ppm is not a BMP file." line. I get a similar error if I run cuneiform by itself:

    Code:
    aaron@aaron-desktop:~$ cuneiform -f hocr -o out.hocr 0175.pdf
    Cuneiform for Linux 0.9.0
    0175.pdf is not a BMP file.
    It's as if cuneiform defaults to assuming the input file is a bmp. Is there a way to change this?

    I should also mention that I am on amd64, and I initially had problems with cuneiform throwing an error. I had to add /usr/local/lib64 to a .conf file in /etc/ld.so.conf.d, and running ldconfig (per https://answers.launchpad.net/cuneif...uestion/100695). After making this change cuneiform seems to work, but there still could be other unseen issues.

    Thanks.

  5. #5
    Join Date
    Apr 2007
    Location
    BiH
    Beans
    Hidden!
    Distro
    Ubuntu 11.10 Oneiric Ocelot

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Thanks a lot, great utility.
    ...

  6. #6
    Join Date
    Mar 2008
    Beans
    27

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Hi!

    Quote Originally Posted by a.w.howell View Post
    It's as if cuneiform defaults to assuming the input file is a bmp. Is there a way to change this?
    Yes; you just have to install

    libmagick++-dev

    before compiling cuneiform, then it'll be able to recognize just about every input format (all imagemagick is able to use, to be precise)

    so long
    clasikowski aka hank

  7. #7
    Join Date
    Apr 2010
    Beans
    1

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Its a great helper. I tried it on some files. The ocr seem to work, but when i search for words, the marks are often quite far away from the searched word.
    Maybe it's because there is always the warning:

    Code:
    Warning: Image x/y resolution not set, defaulting to: 300
    Is there a way to manually set different values for the resolution.
    I didnt find something about it in the man-page.

    Best wishes

    Jan

  8. #8
    Join Date
    Dec 2007
    Location
    Los Angeles
    Beans
    35
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Great utility! Just used in on a 60 page document. Many Thanks. In evince, I also find that the placement of the search term results is off. Any suggestions how to improve this would be appreciated.

  9. #9
    Join Date
    Apr 2008
    Location
    New Haven, CT
    Beans
    111
    Distro
    Ubuntu 10.10 Maverick Meerkat

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Quote Originally Posted by clasikowski View Post
    Hi!



    Yes; you just have to install

    libmagick++-dev

    before compiling cuneiform, then it'll be able to recognize just about every input format (all imagemagick is able to use, to be precise)

    so long
    clasikowski aka hank
    Hank,
    Does that mean install libmagick++-dev before doing "sudo add-apt-repository ppa:gezakovacs/pdfocr
    sudo apt-get update
    sudo apt-get install pdfocr"?

    I do have libmagick++-dev installed and did a reinstall of Geza's pdfocr but I still get the following error output:
    pdfocr -i 1.pdf -o 2.pdf
    Input file is /home/nils/1.pdf
    Output file is /home/nils/2.pdf
    Using working dir /tmp/d20100520-15606-f6vqow
    Getting info from PDF file

    InfoKey: Title
    InfoValue: /tmp/simple-scan-RDLXCV.pdf
    InfoKey: Producer
    InfoValue: ImageMagick 6.5.7-8 2009-11-26 Q16 http://www.imagemagick.org
    InfoKey: ModDate
    InfoValue: D:20100520074947
    InfoKey: CreationDate
    InfoValue: D:20100520074947
    NumberOfPages: 1

    Converting 1 pages
    ==========
    Extracting page 1
    Converting page 1 to ppm
    Running OCR on page 1
    *** buffer overflow detected ***: cuneiform terminated
    ======= Backtrace: =========
    /lib/libc.so.6(__fortify_fail+0x37)[0x7f742e9131a7]
    /lib/libc.so.6(+0xfe060)[0x7f742e912060]
    /usr/lib/cuneiform/libfon32.so.0.9.0(+0x21a5c)[0x7f7429d93a5c]
    /usr/lib/cuneiform/libfon32.so.0.9.0(+0x2226a)[0x7f7429d9426a]
    /usr/lib/cuneiform/libfon32.so.0.9.0(FONRecog2Glue+0x1e7)[0x7f7429d81f47]
    /usr/lib/cuneiform/libpass2.so.0.9.0(+0x73b3)[0x7f742a8793b3]
    /usr/lib/cuneiform/libpass2.so.0.9.0(+0x7592)[0x7f742a879592]
    /usr/lib/cuneiform/libpass2.so.0.9.0(+0xab63)[0x7f742a87cb63]
    /usr/lib/cuneiform/libpass2.so.0.9.0(p2_proc+0xa7a)[0x7f742a87d85a]
    /usr/lib/cuneiform/librstr.so.0.9.0(+0x98e01)[0x7f742ad4de01]
    /usr/lib/cuneiform/librstr.so.0.9.0(RSTRRecognizeMain+0x224)[0x7f742ad60184]
    /usr/lib/cuneiform/librstr.so.0.9.0(RSTRRecognize+0x19)[0x7f742ad60dc9]
    /usr/lib/cuneiform/libcuneiform.so.0.9.0(+0xcafe)[0x7f742f343afe]
    /usr/lib/cuneiform/libcuneiform.so.0.9.0(PUMA_XFinalRecognition+0xd1)[0x7f742f345091]
    cuneiform[0x403beb]
    /lib/libc.so.6(__libc_start_main+0xfd)[0x7f742e832c4d]
    cuneiform[0x402a19]
    ======= Memory map: ========
    00400000-00405000 r-xp 00000000 08:01 59229185 /usr/bin/cuneiform
    00604000-00605000 r--p 00004000 08:01 59229185 /usr/bin/cuneiform
    00605000-00606000 rw-p 00005000 08:01 59229185 /usr/bin/cuneiform
    010d0000-04c92000 rw-p 00000000 00:00 0 [heap]
    7f741ef2d000-7f741ef31000 r-xp 00000000 08:01 59232850 /usr/lib/ImageMagick-6.5.7/modules-Q16/coders/dib.so
    7f741ef31000-7f741f130000 ---p 00004000 08:01 59232850 /usr/lib/ImageMagick-6.5.7/modules-Q16/coders/dib.so
    7f741f130000-7f741f131000 r--p 00003000 08:01 59232850 /usr/lib/ImageMagick-6.5.7/modules-Q16/coders/dib.so
    7f741f131000-7f741f132000 rw-p 00004000 08:01 59232850 /usr/lib/ImageMagick-6.5.7/modules-Q16/coders/dib.so
    7f74236b5000-7f74236b6000 ---p 00000000 00:00 0
    7f74236b6000-7f7423eb6000 rw-p 00000000 00:00 0
    7f7423eb6000-7f7423eb7000 ---p 00000000 00:00 0
    7f7423eb7000-7f74246b7000 rw-p 00000000 00:00 0
    7f74246b7000-7f74246b8000 ---p 00000000 00:00 0
    7f74246b8000-7f7424eb8000 rw-p 00000000 00:00 0
    7f7424eb8000-7f7424ec1000 r-xp 00000000 08:01 59232948 /usr/lib/ImageMagick-6.5.7/modules-Q16/coders/pnm.so
    7f7424ec1000-7f74250c0000 ---p 00009000 08:01 59232948 /usr/lib/ImageMagick-6.5.7/modules-Q16/coders/pnm.so
    7f74250c0000-7f74250c1000 r--p 00008000 08:01 59232948 /usr/lib/ImageMagick-6.5.7/modules-Q16/coders/pnm.so
    7f74250c1000-7f74250c2000 rw-p 00009000 08:01 59232948 /usr/lib/ImageMagick-6.5.7/modules-Q16/coders/pnm.so
    7f74250c2000-7f74250c7000 r-xp 00000000 08:01 59231793 /usr/lib/libXdmcp.so.6.0.0
    7f74250c7000-7f74252c6000 ---p 00005000 08:01 59231793 /usr/lib/libXdmcp.so.6.0.0
    7f74252c6000-7f74252c7000 r--p 00004000 08:01 59231793 /usr/lib/libXdmcp.so.6.0.0
    7f74252c7000-7f74252c8000 rw-p 00005000 08:01 59231793 /usr/lib/libXdmcp.so.6.0.0
    7f74252c8000-7f74252ca000 r-xp 00000000 08:01 59231782 /usr/lib/libXau.so.6.0.0
    7f74252ca000-7f74254ca000 ---p 00002000 08:01 59231782 /usr/lib/libXau.so.6.0.0
    7f74254ca000-7f74254cb000 r--p 00002000 08:01 59231782 /usr/lib/libXau.so.6.0.0
    7f74254cb000-7f74254cc000 rw-p 00003000 08:01 59231782 /usr/lib/libXau.so.6.0.0
    7f74254cc000-7f74254d3000 r-xp 00000000 08:01 50824095 /lib/librt-2.11.1.so
    7f74254d3000-7f74256d2000 ---p 00007000 08:01 50824095 /lib/librt-2.11.1.so
    7f74256d2000-7f74256d3000 r--p 00006000 08:01 50824095 /lib/librt-2.11.1.so
    7f74256d3000-7f74256d4000 rw-p 00007000 08:01 50824095 /lib/librt-2.11.1.so
    7f74256d4000-7f74256ef000 r-xp 00000000 08:01 59232788 /usr/lib/libxcb.so.1.1.0
    7f74256ef000-7f74258ee000 ---p 0001b000 08:01 59232788 /usr/lib/libxcb.so.1.1.0
    7f74258ee000-7f74258ef000 r--p 0001a000 08:01 59232788 /usr/lib/libxcb.so.1.1.0
    7f74258ef000-7f74258f0000 rw-p 0001b000 08:01 59232788 /usr/lib/libxcb.so.1.1.0
    7f74258f0000-7f74258f4000 r-xp 00000000 08:01 50824123 /lib/libuuid.so.1.3.0
    7f74258f4000-7f7425af3000 ---p 00004000 08:01 50824123 /lib/libuuid.so.1.3.0
    7f7425af3000-7f7425af4000 r--p 00003000 08:01 50824123 /lib/libuuid.so.1.3.0
    7f7425af4000-7f7425af5000 rw-p 00004000 08:01 50824123 /lib/libuuid.so.1.3.0
    7f7425af5000-7f7425b02000 r-xp 00000000 08:01 59232191 /usr/lib/libgomp.so.1.0.0
    7f7425b02000-7f7425d01000 ---p 0000d000 08:01 59232191 /usr/lib/libgomp.so.1.0.0
    7f7425d01000-7f7425d02000 r--p 0000c000 08:01 59232191 /usr/lib/libgomp.so.1.0.0
    7f7425d02000-7f7425d03000 rw-p 0000d000 08:01 59232191 /usr/lib/libgomp.so.1.0.0
    7f7425d03000-7f7425e34000 r-xp 00000000 08:01 59231778 /usr/lib/libX11.so.6.3.0
    7f7425e34000-7f7426034000 ---p 00131000 08:01 59231778 /usr/lib/libX11.so.6.3.0
    7f7426034000-7f7426035000 r--p 00131000 08:01 59231778 /usr/lib/libX11.so.6.3.0
    7f7426035000-7f7426039000 rw-p 00132000 08:01 59231778 /usr/lib/libX11.so.6.3.0
    7f7426039000-7f7426050000 r-xp 00000000 08:01 59231747 /usr/lib/libICE.so.6.3.0
    7f7426050000-7f742624f000 ---p 00017000 08:01 59231747 /usr/lib/libICE.so.6.3.0
    7f742624f000-7f7426250000 r--p 00016000 08:01 59231747 /usr/lib/libICE.so.6.3.0
    7f7426250000-7f7426251000 rw-p 00017000 08:01 59231747 /usr/lib/libICE.so.6.3.0
    7f7426251000-7f7426254000 rw-p 00000000 00:00 0
    7f7426254000-7f742625c000 r-xp 00000000 08:01 59231776 /usr/lib/libSM.so.6.0.1
    7f742625c000-7f742645b000 ---p 00008000 08:01 59231776 /usr/lib/libSM.so.6.0.1
    7f742645b000-7f742645c000 r--p 00007000 08:01 59231776 /usr/lib/libSM.so.6.0.1
    7f742645c000-7f742645d000 rw-p 00008000 08:01 59231776 /usr/lib/libSM.so.6.0.1
    7f742645d000-7f7426465000 r-xp 00000000 08:01 59232423 /usr/lib/libltdl.so.7.2.1
    7f7426465000-7f7426665000 ---p 00008000 08:01 59232423 /usr/lib/libltdl.so.7.2.1Cuneiform for Linux 0.9.0
    Error while running OCR on page 1
    Merging together PDF files
    /tmp/d20100520-15606-f6vqow/*-new.pdf not found as file or resource.
    Error: Failed to open PDF file:
    /tmp/d20100520-15606-f6vqow/*-new.pdf
    Errors encountered. No output created.
    Done. Input errors, so no output created.
    Updating PDF info for /home/nils/2.pdf
    /tmp/d20100520-15606-f6vqow/merged.pdf not found as file or resource.
    Error: Failed to open PDF file:
    /tmp/d20100520-15606-f6vqow/merged.pdf
    Errors encountered. No output created.
    Done. Input errors, so no output created.
    Cleaning up temporary files

  10. #10
    Join Date
    Aug 2009
    Beans
    169
    Distro
    Ubuntu 11.04 Natty Narwhal

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Quote Originally Posted by tuxcantfly View Post
    What pdfocr is for

    Installing pdfocr

    The easiest way to install pdfocr is to add my PPA and use apt-get. If you would instead prefer to install it manually, see here for instructions

    Code:
    sudo add-apt-repository ppa:gezakovacs/pdfocr
    sudo apt-get update
    sudo apt-get install pdfocr
    Using pdfocr to add a text layer to your scanned PDF file

    Open a terminal, go to the directory that has the PDF file you want to convert, and enter (substituting input.pdf with the input PDF file, and output.pdf with the output PDF file)

    Code:
    pdfocr -i input.pdf -o output.pdf
    Now wait as OCR is performed on the PDF file page-by-page, and the output file is generated. This should take a few seconds per page, depending on the resolution of your PDF file (high-res PDF files get better accuracy, but will take longer). Once done, you should now have a searchable PDF at output.pdf.
    Worked great for me - thanks! I had to go in and fix about 20% of the text but this helped a lot - will be using this more...

Page 1 of 5 123 ... LastLast

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •