Page 1 of 2 12 LastLast
Results 1 to 10 of 20

Thread: How To: OCR any PDF file

  1. #1
    Join Date
    Mar 2008
    Beans
    349
    Distro
    Ubuntu 8.04 Hardy Heron

    How To: OCR any PDF file

    As anyone who has tried knows, using optical character recognition on pdf files can be confusing, especially since Tesseract, repeatedly hailed as the best free ocr software can only do *tif files.

    Step 1: Install needed packages
    Code:
    sudo apt-get install tesseract-ocr tesseract-ocr-eng xpdf-reader xpdf imagemagick xpdf-utils
    Side Note: You will need to install language packages for tesseract for every other language you wish to use. For example, package tesseract-ocr-fra allows you to ocr the french language. Check synaptic package manager

    Step 2: See if you actually need ocr

    xpdf-utils (which you just installed) provides a pdftotext utility:
    Code:
    pdftotext
    If it works, congratulations and don't move on

    Step 3: OCR'd

    The following shell script will attempt to ocr your file. I suggest placing it somewhere in your $PATH so you can run it from the same directory as the pdf file and not have weird filenames.

    NOTE: This program works by converting each page of the PDF file into a 100MB TIFF image. That means a temporary 100MB increase of hard drive usage per page in your pdf while the program is running. If it's an issue, this can be decreased/changed by editing the shell script - either decrease the quality number in pdftoppm or make it do only a few pages at a time - check man pdftoppm

    Code:
    #!/bin/sh
    mkdir tmp
    cp $@ tmp
    cd tmp
    pdftoppm * -f 1 -l 10 -r 600 ocrbook
    for i in *.ppm; do convert "$i" "`basename "$i" .ppm`.tif"; done
    for i in *.tif; do tesseract "$i" "`basename "$i" .tif`" -l nld; done
    for i in *.txt; do cat $i >> ${name}.txt; echo "[pagebreak]" >> pdf-ocr-output.txt; done
    mv pdf-ocr-output.txt ..
    rm *
    cd ..
    rmdir tmp
    Usage:
    1) Copy the script into gedit or your favorite text editing program and save it.
    2)
    Code:
    cd /path/to/saved/file/
    chmod +x filename
    3) Run the script: If it is saved in your $PATH, just type the filename in the console, followed by the name of the file you wish to ocr. Otherwise, you have to cd to the /path/to/saved/file and
    Code:
    ./filename "PDF file you wish to ocr"
    Edit: Moving to tutorials and tips -my bad
    Last edited by RequinB4; August 7th, 2008 at 05:53 PM.

  2. #2
    Join Date
    Mar 2007
    Beans
    148
    Distro
    Ubuntu 9.10 Karmic Koala

    Re: How To: OCR any PDF file

    is there a way to integrate the ocr layer back into the pdf so that you create a searchable pdf instead of a seperate text file?

  3. #3
    Join Date
    Mar 2009
    Beans
    232

    Re: How To: OCR any PDF file

    That's what I'm interested in.

  4. #4
    Join Date
    Aug 2006
    Beans
    69

    Re: How To: OCR any PDF file

    Probably via openoffice.org 3.0 / editable pdf?

    Thanks for the excellent script btw!

    pdftoppm * -f 1 -l 10 -r 600 ocrbook
    Seems like its limited to the first 10 pages? I also think the quality of the ppm is rather high!
    Last edited by userid; December 17th, 2009 at 04:31 PM.

  5. #5
    Join Date
    Mar 2007
    Beans
    Hidden!

    Re: How To: OCR any PDF file

    Quote Originally Posted by userid View Post
    Probably via openoffice.org 3.0 / editable pdf?
    Alright, you've got my attention here. Karmic has OO.org v3.1.1 now and yet in Writer I'm still only seeing an "Export as PDF" option... is there extra-magic (such as Extensions) that you need to edit an existing PDF to insert the text layer?

  6. #6
    Join Date
    Mar 2009
    Beans
    232

    Re: How To: OCR any PDF file

    I tried the script out and it doesn't always work. The only usable solution I've found--and it's hardly a solution!--is to run a VM of Windows and Acrobat Standard/Pro!

    Pretty lame.

  7. #7
    Join Date
    Oct 2006
    Location
    Toronto, Canada
    Beans
    15

    Re: How To: OCR any PDF file

    Quote Originally Posted by nortexoid View Post
    Pretty lame.
    Well, complaining about free scripts is also "pretty lame."

    RequinB4, I edited your script a bit. It works great! For futher improvement, to cut down on the disk usage one could use "pdfinfo" to get the number of pages in the file and extract them one at a time using the same page for the -l and -f parameters to pdftoppm.

    Also note if you want to run this, all you really need to install are tesseract-ocr-[LANG] and imagemagick (tesseract-ocr is a dependency of each of the individual tesseract language packages). Change "-l eng" in the file to match the tesseract language you've installed/want to use.
    Code:
    #!/bin/sh
    SCRIPT_NAME=`basename "$0" .sh`
    TMP_DIR=${SCRIPT_NAME}-tmp
    OUTPUT_FILE=${SCRIPT_NAME}-output.txt
    
    mkdir $TMP_DIR
    cp $@ $TMP_DIR
    cd $TMP_DIR
    
    pdftoppm -r 600 * ocrbook
    
    for i in *.ppm
    do
      BASE=`basename "$i" .ppm`
      convert "$i" "${BASE}.tif"
      tesseract "${BASE}.tif" "${BASE}" -l eng
      cat ${BASE}.txt | tee -a $OUTPUT_FILE
      echo "[pagebreak]" | tee -a $OUTPUT_FILE
      rm ${BASE}.*
    done
    
    mv $OUTPUT_FILE ..
    rm *
    cd ..
    rmdir $TMP_DIR

  8. #8
    Join Date
    Sep 2009
    Beans
    266

    Re: How To: OCR any PDF file

    I could use some clarification. I'd like to be able to scan documents and end up with a text-based PDF (searchable, selectable), but when I read "insert text layer" in this thread, I am a little lost.
    I'd rather die fighting lions than being trampled by geese.

  9. #9
    Join Date
    Apr 2006
    Location
    San Francisco
    Beans
    22
    Distro
    Ubuntu 9.10 Karmic Koala

    Re: How To: OCR any PDF file

    Gscan2pdf is GREAT. It installs pretty much every package you need at once, although I also had to install the tesseract English module.

  10. #10
    Join Date
    Sep 2009
    Beans
    266

    Re: How To: OCR any PDF file

    More clarification, please? Bear with me if this seems elementary.

    Let's say I type a document in Open Office and export to PDF. When I open the PDF, AFAIK, I don't see a "picture" of the text with a hidden text layer that can be indexed. It appears to be ACTUAL text that I'm looking at. I can select text, search text, etc, right there in the document that appears before me.

    What I'm reading so far about these OCR tools is that I'm still left with an IMAGE of the page of text, but with an extra hidden layer of actual text somewhere that can be searched. Am I understanding this correctly? Is there no way to use OCR to end up with essentially the same kind of PDF that I get when I export a word processing document to PDF? I don't want both an image of text and actual text hidden somewhere behind it.
    I'd rather die fighting lions than being trampled by geese.

Page 1 of 2 12 LastLast

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •