Page 4 of 5 FirstFirst ... 2345 LastLast
Results 31 to 40 of 48

Thread: Howto: Make scanned PDFs searchable (OCR) using pdfocr

  1. #31
    Join Date
    Jan 2007
    Beans
    6,539
    Distro
    Ubuntu 13.04 Raring Ringtail

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    The dependency problem is because it depends on libmagick++2, while recent versions of Ubuntu only have libmagick++3.

    You can get a copy of libmagick++2 from the lucid repos here. Install the .deb then go ahead and install pdfocr as normal.

  2. #32
    Join Date
    Dec 2010
    Beans
    1

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    I have one problem. After recognizing the created pdf file is so small that it is readable only on 300% zoom, 400% has artefacts already. Is it normal? How can I increase it's size?

  3. #33
    Join Date
    Feb 2010
    Beans
    6

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    The script errors out with
    Code:
    "Error: there are 0 pages in the input PDF File."
    But the output of pdftk dump_data shows "NumberOfPages: 11"

    Code:
    InfoKey: Creator
    InfoValue: PScript5.dll Version 5.2
    InfoKey: Title
    InfoValue: PartList_Parts_Explosion_LST2.pdf
    InfoKey: Producer
    InfoValue: Acrobat Distiller 7.0 (Windows)
    InfoKey: Author
    InfoValue: jspencer
    InfoKey: ModDate
    InfoValue: D:20050929120602-05'00'
    InfoKey: CreationDate
    InfoValue: D:20050929120602-05'00'
    PdfID0: 20ac307166dce2f6b8e7982e42c9eae
    PdfID1: 2818188aa0687043a69a97e172b78b3a
    NumberOfPages: 11
    PageLabelNewIndex: 1
    PageLabelStart: 1
    PageLabelNumStyle: DecimalArabicNumerals


    I have tried editing the script to correctly define "pagenum", but I have no experience with Ruby.

    Any help is appreciated.


    Jeremy

  4. #34
    Join Date
    Nov 2009
    Beans
    9
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    I tried this script, but did not get great results with English text.

    For those having problems here, you may want to try the solutions in this thread:

    http://ubuntuforums.org/showthread.php?t=1647350

    The thread provides instructions on how to install Tesseract 3.00--which now produces hOCR output--and I've posted a shell script there that will take an image-type PDF, process it with Tesseract, and then convert it back into a single PDF.

    I can't take credit for the script (acknowledgements included in the script itself), but it seems to work pretty well for me, except for a file size issue that sometimes arises (sometimes it will save about 10-15% over the original file size, sometimes it will double it - still can't figure that out). But it's a shell script, and not too tough to modify to suit your needs/wants.

    Alternatively, it may be worth seeing if it's possible to use Tesseract in the script in this thread, but that may be tricky for those of us completely unfamiliar with Ruby, particularly if there's a shell script that offers similar functionality.....

  5. #35
    Join Date
    Oct 2007
    Beans
    20
    Distro
    Ubuntu 7.04 Feisty Fawn

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Quote Originally Posted by drewsimon View Post
    Great idea! I'm trying it right now. My first error was when running your script in a directory that contains spaces in the name. Or when running your script on a file that has a space in the filename.

    Your script seems to cut off the directoryname/filename at the space. You can fix your script by escaping spaces in your script.

    Thanks! I'm excited to see how my first OCR'ed PDF comes out.
    I see Drewsimon has already noted this back in July. Not escaping the filenames to handle spaces and other characters really is putting a damper on this tool. Will you be fixing this?

    Otherwise it seems a really useful tool.

  6. #36
    Join Date
    Jul 2007
    Location
    San Antone, Teksis
    Beans
    Hidden!
    Distro
    Ubuntu 16.04 Xenial Xerus

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Quote Originally Posted by fungiblename View Post

    The thread provides instructions on how to install Tesseract 3.00--which now produces hOCR output--and I've posted a shell script there that will take an image-type PDF, process it with Tesseract, and then convert it back into a single PDF.
    ....
    cool. i'll try the tesseract solution when i've got a chance.

  7. #37
    Join Date
    Apr 2006
    Beans
    213

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Hi guys, there doesn't seem to be a repository for maverick, ubuntu 10.10. Does it exist for maverick in a package format? Can I just get the source and compile it for myself?
    Thanks

  8. #38
    Join Date
    Apr 2006
    Beans
    213

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    Hi, I have implemented the same thing in /bin/sh (i.e. no need to install ruby). By default, it overlays the recognized text with the original pdf, and not the rasterized image (as in the case of pdfocr), although that can also be enforced with the --ppm option.

    It uses tesseract (one needs version >3.0), which works very nicely, in contrast to cuneiform (which did not recognize practically anything, whatever I tried - I did not try to tweak it, though)

    I have a problem: the output looks ok on the screen, it is searchable, etc - but when I print it, the recognized text also appears on the printer. How can this be solved?

    ...start edit: well, I discovered that I skipped a large part of this (and another, here referenced) thread, and I am not the only one who implemented their own version for this task. What I do differently is that I specify the -n option to hocr2pdf (i.e. do not overlay the raster image over the recognized text), but then do it later with pdftk, overlaying the original pdf page over the recognized text. That should be, I think, better.

    However, all the implementations (including pdfsandwich, which also works quite nicely) have the problem, that the OCR-et document is ok on the screen, in a pdf reader, but when printed, the background text gets also printed. Any idea how to solve this?

    ... end edit.

    Code:
    #!/bin/sh
    
    # Dependencies: pdftoppm, pdftk, tesseract (>3.0), hocr2pdf
    # see http://code.google.com/p/tesseract-ocr/wiki/ReadMe how to install tesseract (>3.0)
    
    input=""
    output=""
    from=1
    to=0
    verbose=false
    useppm=false
    resolution=450
    
    help()
    {
    cat <<EOF
    
    Usage: pdf-ocr <-i input> [options]
    
      -o <output>         Specify output filename
    
      -f|--from <number>  Specify starting page
    
      -t|--to <number>    Specify last page
    
      -p|--ppm            Use the ppm image instead of the original pdf
    
      -r <resolution>     Specify resolution (image given to the OCR program)
                          Default is $resolution
    
      -v|--verbose
    
    EOF
    }
    
    while [ $# -gt 0 ] ; do
      case $1 in
        -i) input=$2; shift;;
        -o) output=$2; shift;;
        -f|--from) from=$2; shift;;
        -t|--to) to=$2; shift;;
        -v|--verbose) verbose=true;;
        -r) resolution=$2; shift;;
        -p|--ppm) useppm=true;;
        -h|--help) help; exit;;
        *) echo Unknown argument: $1; exit;;
      esac
      shift
    done
    
    if [ "$input" = "" ] ; then
      help
      exit 1
    fi
    
    if [ "$output" = "" ] ; then
      output=`echo $input | sed 's/\.[pP][dD][fF]$//g'`
      output="${output}-ocr.pdf"
    fi
    
    echo "###  $input -->  $output"
    
    tmp=`mktemp -d`
    
    pdfinfo=${tmp}/pdfinfo.txt
    pdftk ${input} dump_data > ${pdfinfo}
    pages=`awk '$1=="NumberOfPages:" { print $2 }' ${pdfinfo}`
    echo "###  Input file has $pages pages"
    if [ $from -lt 1 ] ; then from=1; fi
    if [ $to = 0 -o $to -gt $pages ] ; then to=$pages; fi
    
    if [ "$pages" = "" -o "$pages" = 0 ] ; then
      echo "### input file has no pages"
      exit 2
    fi
    
    i=1
    if [ "$from" -gt $i ] ; then
      i="$from"
    fi
    if [ "$to" -lt $pages ] ; then
      pages="$to"
    fi
    
    pagelist=""
    while [ $i -le $pages ] ; do
      echo " >> Processing page $i"
      page_pdf=${tmp}/${i}.pdf
      page_base=`echo $page_pdf | sed 's/\.pdf$//g'`
      page_ppm=${page_base}.ppm
      page_hocr=${page_base}.html
      page_ocr=${page_base}.tmp.pdf
      page_final=${page_base}.ocr.pdf
      echo -n "   Extracting page... "
      pdftk $input cat $i output $page_pdf
      echo "DONE, return status: $?"
      echo -n "   Converting to ppm... "
      pdftoppm -r ${resolution} $page_pdf > $page_ppm
      echo "DONE, return status: $?"
      echo -n "   Running tesseract (OCR)... "
      tesseract_output=`tesseract ${page_ppm} ${page_base} hocr 2>&1`
      tesseract_status=$?
      if [ $verbose = true -o $tesseract_status != 0 ] ; then echo; echo $tesseract_output; fi
      echo "DONE, return status: $tesseract_status"
      if [ $useppm = false ] ; then
        echo -n "   Creating text layer (hocr2pdf)... "
        hocr2pdf -r ${resolution} -n -s -i ${page_ppm} -o ${page_ocr} < ${page_hocr}
        echo "DONE, status: $?"
        echo -n "   Underlaying text below original image (pdftk)... "
        pdftk ${page_pdf} background ${page_ocr} output ${page_final}
        echo "DONE, status: $?"
      else
        echo -n "   Creating page with recognized text (hocr2pdf)... "
        hocr2pdf -r ${resolution}  -s -i ${page_ppm} -o ${page_final} < ${page_hocr}
        echo "DONE, status: $?"
      fi
      pagelist="${pagelist} ${page_final}"
      i=$((i+1))
    done
    
    echo Merging pages: pdftk ${pagelist} cat output "${output}"
    pdftk ${pagelist} cat output ${tmp}/final.pdf
    
    echo Updating PDF info
    pdftk ${tmp}/final.pdf update_info ${pdfinfo} output "${output}"
    
    rm -rf ${tmp}
    Last edited by barna; March 4th, 2011 at 12:52 AM.

  9. #39
    Join Date
    Apr 2006
    Beans
    213

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    It seems that the problem mentioned above (i.e. the recognized pdf, with the text layer behind the original scanned document displays correctly in a reader on screen (text layer is invisible), but both layers appear on the printer) is related to maybe the printing system. At least when printing the same document from Acrobat on MacOS, it was OK. Printing from both acroread and okular on linux produced wrong output.

  10. #40
    Join Date
    Jan 2007
    Beans
    120

    Re: Howto: Make scanned PDFs searchable (OCR) using pdfocr

    I would say this is the closest I have come to a replacement of Adobe Acrobat for Ubuntu or linux in general.

    I too have to say placement of the resulting OCRed text is off in its location by quite a bit.

    I also see the default of 300dpi which may have something to do with it.

    Being so close... and also using PDF Studio for all other replacement of Adobe Acrobat, can't those final fixes, like position of merged text, and an install routine, be created by someone out there? Please?

    I know I would have a hard time getting success again if I had to try and install and use via command line.

    Just hoping, certainly not complaining!

    This pdfocr works better than anything else on Ubuntu for OCRing like Acrobat does.

    zeddock

Page 4 of 5 FirstFirst ... 2345 LastLast

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •