Page 2 of 5 FirstFirst 1234 ... LastLast
Results 11 to 20 of 48

Thread: Howto: Make scanned PDFs searchable (OCR) using pdfocr

  1. #11
    Join Date
    Jul 2010
    Beans
    1

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    OK, I'm new to Linux so this may be some config problem of mine. I've installed pdfocr from the PPA and when I run the script it fails on the line:

    sh "pdftoppm #{basefn+'.pdf'} > #{basefn+'.ppm'}"

    The line runs, but pdftoppm is not allowing a redirection via ">" into the output .ppm file. It just creates a 0 byte file and outputs the pdftoppm help message. pdfocr carries on because it just checks to see if a .ppm file is there, which there is, but it's empty.

    If I run pdftoppm by hand using the same syntax, it fails the same way. If in run it using a PPM-root file name (not >), it works fine. pdftoppm manpage doesn't say anything about redirection.

    FYI Ubuntu Lucid 10.04

    Ideas??

  2. #12
    Join Date
    Apr 2008
    Location
    New Haven, CT
    Beans
    111
    Distro
    Ubuntu 10.10 Maverick Meerkat

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    As of note, Google Docs allows now uploading PDFs that can be OCRed (must check box). This works surprisingly well even with low resolution PDF files. The disappointing thing is that the PDF pages appear embedded in a GDoc with the text wrapping around them and not "embedded". Text can obviously be searched but you won't be pointed to the actual position in the PDF file, just the surrounding text.

    It is a killer feature if you (hypothetically) wanted to convert a paper book into text: simply snap pictures with your camera, bundle it into a single PDF, convert it into text and delete the pictures of the pages. Manual reformatting required.
    Last edited by nanonils; July 2nd, 2010 at 06:50 PM.

  3. #13
    Join Date
    Jun 2010
    Beans
    2

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Great info just used it and it works great.

    Cheers,
    D.

  4. #14
    Join Date
    May 2009
    Beans
    2

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Thanks, sounds just what I was looking for. Unfortunately, I haven't got any success. I got the following error:

    Extracting page 1
    Converting page 1 to ppm
    Running OCR on page 1
    Magick: NoDecodeDelegateForThisImageFormat `1.ppm' @ constitute.c/ReadImage/530
    Cuneiform for Linux 0.9.0
    Error while running OCR on page 1
    I think I installed correctly all of the imagemagick stuff. Any ideas on what's going wrong?

  5. #15
    Join Date
    Jul 2007
    Beans
    39

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    thank you very much for posting this useful information - this worked really well for me for a few documents but then I installed the pdftotext per the other instruction thread from that post and it seems like I can't process any more, I get the following errors at the console

    any guidance appreciated

    john@ubuntu:~$ pdfocr -i /home/john/firsttest.pdf -o output.pdf
    Input file is /home/john/firsttest.pdf
    Output file is /home/john/output.pdf
    Using working dir /tmp/d20100708-8664-k9qzbg
    Getting info from PDF file

    InfoKey: Creator
    InfoValue: PScript5.dll Version 5.2
    InfoKey: Title
    InfoValue: 74-120-155 092970.pub
    InfoKey: Author
    InfoValue: Cindy
    InfoKey: Producer
    InfoValue: GPL Ghostscript 8.64
    InfoKey: ModDate
    InfoValue: D:20100708211330-07'00'
    InfoKey: CreationDate
    InfoValue: D:20100503094726-05'00'
    PdfID0: 56aa2475327b6351a435090fb8a5489
    PdfID1: 9e788838d0c37c7723539b8c4afeffdd
    NumberOfPages: 1

    Converting 1 pages
    ==========
    Extracting page 1
    Converting page 1 to ppm
    pdftoppm version 3.02
    Copyright 1996-2007 Glyph & Cog, LLC
    Usage: pdftoppm [options] <PDF-file> <PPM-root>
    -f <int> : first page to print
    -l <int> : last page to print
    -r <int> : resolution, in DPI (default is 150)
    -mono : generate a monochrome PBM file
    -gray : generate a grayscale PGM file
    -t1lib <string> : enable t1lib font rasterizer: yes, no
    -freetype <string>: enable FreeType font rasterizer: yes, no
    -aa <string> : enable font anti-aliasing: yes, no
    -aaVector <string>: enable vector anti-aliasing: yes, no
    -opw <string> : owner password (for encrypted files)
    -upw <string> : user password (for encrypted files)
    -q : don't print any messages or errors
    -cfg <string> : configuration file to use in place of .xpdfrc
    -v : print copyright and version info
    -h : print usage information
    -help : print usage information
    --help : print usage information
    -? : print usage information
    Running OCR on page 1
    ImageMagick: Memory allocation failed `' @ dib.c/WriteDIBImage/1066
    Cuneiform for Linux 0.9.0
    Error while running OCR on page 1
    Merging together PDF files
    Error: Failed to open PDF file:
    /tmp/d20100708-8664-k9qzbg/*-new.pdf
    Errors encountered. No output created.
    Done. Input errors, so no output created.
    Updating PDF info for /home/john/output.pdf
    Error: Failed to open PDF file:
    /tmp/d20100708-8664-k9qzbg/merged.pdf
    Errors encountered. No output created.
    Done. Input errors, so no output created.
    Cleaning up temporary files

  6. #16
    Join Date
    Jul 2006
    Beans
    39
    Distro
    Ubuntu 6.06

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Great idea! I'm trying it right now. My first error was when running your script in a directory that contains spaces in the name. Or when running your script on a file that has a space in the filename.

    Your script seems to cut off the directoryname/filename at the space. You can fix your script by escaping spaces in your script.

    Thanks! I'm excited to see how my first OCR'ed PDF comes out.

  7. #17
    Join Date
    Mar 2009
    Location
    Brazil
    Beans
    475
    Distro
    Ubuntu 12.04 Precise Pangolin

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Thank you!
    I've been looking for something like this for a long time.

    I'm installing it right now.
    Ubuntu User #27453 | Linux User #490358
    "Don't preach Linux, mention it"
    "Linux is not Windows"
    73% of statistics in forums are made up on the spot

  8. #18
    Join Date
    Nov 2009
    Beans
    5

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Thanks!!!
    I've been looking for a long time for something like this (as an alternative to FineReader) and it works flawlessly!!!!

  9. #19
    Join Date
    Oct 2006
    Location
    Desert
    Beans
    75

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    I seem to be having this common issue:
    Code:
    Warning: Image x/y resolution not set, defaulting to: 300
    I am running 64-bit 10.04.
    Any way to fix this?

    texla was able to fix this, but how?

    My ocr'd text is not where it is supposed to be.

  10. #20
    Join Date
    Mar 2009
    Location
    Brazil
    Beans
    475
    Distro
    Ubuntu 12.04 Precise Pangolin

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Quote Originally Posted by trumpeteersman View Post
    I seem to be having this common issue:
    Code:
    Warning: Image x/y resolution not set, defaulting to: 300
    I am running 64-bit 10.04.
    Any way to fix this?

    texla was able to fix this, but how?

    My ocr'd text is not where it is supposed to be.
    It also happens here, the generated text is much bigger than the original one, so it gets out of place.
    Ubuntu User #27453 | Linux User #490358
    "Don't preach Linux, mention it"
    "Linux is not Windows"
    73% of statistics in forums are made up on the spot

Page 2 of 5 FirstFirst 1234 ... LastLast

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •