Page 3 of 5 FirstFirst 12345 LastLast
Results 21 to 30 of 48

Thread: Howto: Make scanned PDFs searchable (OCR) using pdfocr

  1. #21
    Join Date
    Jan 2006
    Location
    Cairo, Egypt
    Beans
    109
    Distro
    Xubuntu Development Release

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    I hope I can get this working soon. I was excited when I saw that I could actually select text in an OCRed document, but then I realized it was only detecting something like 15 to 30 words per page. Any ideas on how to get it to see all the text?

  2. #22
    Join Date
    Mar 2008
    Beans
    27

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Hi!

    There seem to be problems for hocr2pdf with the hocr files produced by cuneiform in version 0.9.0 or higher. The reocgnition is pretty good, but when adding the hocr file to the pdf sometimes the font size used in the hidden layer is blown way out of proportion, so that the hidden layer doesn't fit the image layout at all, even some parts of the ocr result are not saved at all.

    I haven't found out why this happens, it seems that hocr2pdf is not able to work with the new hocr format used in cuneiform 0.9.0 or higher.

    pdfsandwich ( http://www.tobias-elze.de/pdfsandwich/index.html ) could be worth looking at, it works with cuneiform 0.7.0, the standard ubuntu packages, since it first converts the pdf image to bmp3 (the format cuneiform was written for); the resulting hocr files are slightly larger than those produced by cuneiform 0.9.0 or higher. The results match much better!


    so long
    clasikowski aka hank
    Last edited by clasikowski; August 18th, 2010 at 10:46 AM. Reason: spelling

  3. #23
    Join Date
    Apr 2010
    Beans
    4

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Hi, the program doesn't work for me. I'm on Ubuntu 10.4 with actual kernel.

    I've tried a large ebook. The quality is ok.

    here is the output:

    Code:
    Error while running OCR on page 275
    ==========
    Extracting page 276
    Converting page 276 to ppm
    pdftoppm version 3.02
    Copyright 1996-2007 Glyph & Cog, LLC
    Usage: pdftoppm [options] <PDF-file> <PPM-root>
      -f <int>          : first page to print
      -l <int>          : last page to print
      -r <int>          : resolution, in DPI (default is 150)
      -mono             : generate a monochrome PBM file
      -gray             : generate a grayscale PGM file
      -t1lib <string>   : enable t1lib font rasterizer: yes, no
      -freetype <string>: enable FreeType font rasterizer: yes, no
      -aa <string>      : enable font anti-aliasing: yes, no
      -aaVector <string>: enable vector anti-aliasing: yes, no
      -opw <string>     : owner password (for encrypted files)
      -upw <string>     : user password (for encrypted files)
      -q                : don't print any messages or errors
      -cfg <string>     : configuration file to use in place of .xpdfrc
      -v                : print copyright and version info
      -h                : print usage information
      -help             : print usage information
      --help            : print usage information
      -?                : print usage information
    Running OCR on page 276
    Magick: Improper image header `276.ppm' @ pnm.c/ReadPNMImage/297
    Cuneiform for Linux 0.9.0
    Error while running OCR on page 276
    Merging together PDF files
    /tmp/d20100824-6962-1fqv3qc/*-new.pdf not found as file or resource.
    Error: Failed to open PDF file: 
       /tmp/d20100824-6962-1fqv3qc/*-new.pdf
    Errors encountered.  No output created.
    Done.  Input errors, so no output created.
    Updating PDF info for /home/steffen/tmp/morgen.pdf
    /tmp/d20100824-6962-1fqv3qc/merged.pdf not found as file or resource.
    Error: Failed to open PDF file: 
       /tmp/d20100824-6962-1fqv3qc/merged.pdf
    Errors encountered.  No output created.
    Done.  Input errors, so no output created.
    Cleaning up temporary files
    maybe there is allready a solution. thank you

  4. #24
    Join Date
    Aug 2010
    Location
    Nashville, TN
    Beans
    2
    Distro
    Ubuntu 11.04 Natty Narwhal

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Quote Originally Posted by clasikowski View Post
    Hi!

    There seem to be problems for hocr2pdf with the hocr files produced by cuneiform in version 0.9.0 or higher. The reocgnition is pretty good, but when adding the hocr file to the pdf sometimes the font size used in the hidden layer is blown way out of proportion, so that the hidden layer doesn't fit the image layout at all, even some parts of the ocr result are not saved at all.

    I haven't found out why this happens, it seems that hocr2pdf is not able to work with the new hocr format used in cuneiform 0.9.0 or higher.

    pdfsandwich ( http://www.tobias-elze.de/pdfsandwich/index.html ) could be worth looking at, it works with cuneiform 0.7.0, the standard ubuntu packages, since it first converts the pdf image to bmp3 (the format cuneiform was written for); the resulting hocr files are slightly larger than those produced by cuneiform 0.9.0 or higher. The results match much better!


    so long
    clasikowski aka hank
    Hank, thanks for this suggestion. It is giving me the same mis-matching results that others have had.

    I tried the option -noimage and can see that the output has random sized text that often overlap one another, which is why they never line up with the original image.

    Has anyone come up with a solution to this yet?

  5. #25
    Join Date
    May 2008
    Location
    Mountain View, CA
    Beans
    25
    Distro
    Ubuntu 10.10 Maverick Meerkat

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Like the others, the text and image layers do not match up. Do you need samples of our files?

    btw: thanks for providing this program. Things are better than before.

  6. #26
    Join Date
    Apr 2005
    Beans
    4

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Quote Originally Posted by snoozy23 View Post
    Hi, the program doesn't work for me. I'm on Ubuntu 10.4 with actual kernel.
    I modified your solution to work with Ubuntu 9.10 and pdftoppm 3.02 (which looks like what you used too). The problem I had was that pdftoppm pads the output .ppm with 5 zeros. Mine will only work for pdfs less than 100 pages.

    Code:
    #!/bin/bash
    
    TESS_LANG=eng
    rflag=
    # first figure out what args we have
    getopts 'r:' OPT;
    shift $(($OPTIND - 1))
    if [ $OPT == "r" ]
    then
        rflag="-rotate $OPTARG";
    fi
    
    CURRENT_DIR=`pwd`
    SCRIPT_NAME=`basename "$0" .sh`
    TMP_DIR=${SCRIPT_NAME}-tmp
    mkdir ${TMP_DIR}
    
    for thisfile in "$@"
    do
        NAME=`basename "${thisfile}" .pdf`
        cp "$thisfile" ${TMP_DIR}
        cd ${TMP_DIR}
    
        echo "Examining: ${thisfile}";
        pgs=`pdfinfo "${thisfile}" | grep Pages | awk '{print $2}'`
        echo "Found ${pgs} pages; converting...";
        # it's only fair, since we're suppressing it later...
        echo "Tesseract Open Source OCR Engine";
        for x in `seq 1 ${pgs}`
        do
            echo -en "  Page ${x}...";
            pdftoppm -f $x -l $x -r 600 "$thisfile" ocrbook;
    		if [ $x -gt 9 ]; then
            		BASE=ocrbook-0000${x};
    		else 
            		BASE=ocrbook-00000${x};
    		fi 
            convert ${BASE}.ppm ${rflag} ${BASE}.tif;
            tesseract ${BASE}.tif ${BASE} -l ${TESS_LANG} > /dev/null 2>&1;
            cat ${BASE}.txt >> "${NAME}.txt";
            echo "[pagebreak]" >> "${NAME}.txt";
            rm ocrbook*;
            echo "done";
        done;
    
        echo "Conversion complete";
    
        mv "${NAME}.txt" ${CURRENT_DIR}
        rm *
        cd ${CURRENT_DIR}
    done
    
    rmdir ${TMP_DIR}

  7. #27
    Join Date
    Apr 2005
    Beans
    4

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    wrong thread, sorry for the post. Please delete.

  8. #28
    Join Date
    Jul 2007
    Location
    San Antone, Teksis
    Beans
    Hidden!
    Distro
    Ubuntu 12.04 Precise Pangolin

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    this app is really promising, especially because there's nothing out there like it for linux. it would be extremely useful for any organization that images a grip of documents.

    in our office, for example, we image all our files. the fancy expensive feed scanners we have scan into pdf, but aren't OCR'd.

    if this app were accurate, i'd use it to image tens of thousands of pdf's that i otherwise have to rely on effective file names to find what i'm looking for.

    my experience, however, is like the rest: poor recognition and badly misaligned. sure, tesseract is out there, but OCRing to a separate text file isn't very useful in this context.

  9. #29
    Join Date
    Nov 2005
    Location
    Boulder, Colorado
    Beans
    89
    Distro
    Ubuntu 11.04 Natty Narwhal

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    I'm really wanting to try this out and write it up, but the PPA doesn't work for 10.10. Is there any chance of that happening soon?

  10. #30
    Join Date
    Mar 2009
    Beans
    232

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    I just tried installing from the PPA in Maverick 10.10 too, but the dependencies don't allow one to install it. Quite unfortunate. Will there be updated packages soon?

Page 3 of 5 FirstFirst 12345 LastLast

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •