Page 1 of 4 123 ... LastLast
Results 1 to 10 of 31

Thread: Tesseract 3.0 + Ubuntu 10.04 Installation Guide

  1. #1
    Join Date
    Nov 2009
    Beans
    18

    Tesseract 3.0 + Ubuntu 10.04 Installation Guide

    Hi guys,
    I am using the Tesseract package which provides OCR (optical Character Recognition for electronic document images. The 10.04 repositories currently only have 2.04 however the latest version, 3.00 is out with many new features. Here is how I got everything working:

    1. Install Imagemagick
    Imagemagick helps convert all the document images to a format Tesseract likes. We use PDFs.
    "sudo apt-get install imagemagick"

    Usage would be like
    "Convert -density 300 scanneddocument.pdf -depth 8 scanneddocument.tif"

    This converts to a good quality tiff image with 8 bit depth (required by Tesseract). You can change the density amount as you may get better results.

    2. Install Tesseract
    Get the required packages available in the repositories:
    Code:
    sudo apt-get install libpng12-dev
    sudo apt-get install libjpeg62-dev
    sudo apt-get install libtiff4-dev
    ("sudo apt-get install zlibg-dev" is suggested in the Tesseract readme but isn't available. I found I didn't need this.)
    I picked this up from a comment made, you need to be able to compile and make the software. Ubuntu needs some packages to help do this. For many of you these may already be present and installed but it doesn't hurt..
    Code:
    sudo apt-get install gcc
    sudo apt-get install g++
    sudo apt-get install automake
    Download this program which can't be gained with apt-get:
    http://www.leptonica.org/
    My link was here: http://www.leptonica.org/source/leptonlib-1.67.tar.gz
    Code:
    wget http://www.leptonica.org/source/leptonlib-1.67.tar.gz
    tar -zxvf leptonlib-1.67.tar.gz
    cd leptonlib-1.67
    ./configure
    make
    sudo checkinstall (follow the prompts and type "y" to create documentation directory. Enter a brief description then press enter twice)
    sudo ldconfig
    Now we can actually get and install Tesseract! Remember to go back one directory from the above install of Leptonica:
    Code:
    cd ..
    Code:
    wget http://tesseract-ocr.googlecode.com/files/tesseract-3.00.tar.gz
    tar -zxvf tesseract-3.00.tar.gz
    cd tesseract-3.00
    ./runautoconf
    ./configure
    make
    sudo checkinstall (follow the prompts and type "y" to create documentation directory. Enter a brief description then press enter twice)
    sudo ldconfig
    Now for whatever reason the training data isn't installed with this. I got mine straight from Tesseract's SVN:
    (This is from rjwinder)
    Code:
    cd /usr/local/share/tessdata
    sudo wget http://tesseract-ocr.googlecode.com/files/eng.traineddata.gz
    sudo gunzip eng.traineddata.gz
    Also, we really wanted to use hOCR which allows us to pinpoint the actual images over the original. You could use something like hocr2pdf ("sudo apt-get install exactimage")to remerge the pdf and hocr output to make searchable PDFs. Anyway to get this:
    Code:
    cd /usr/local/share/tessdata/configs
    sudo vi hocr
    
    You need to know how to use Vim to do this bit
    Put this in: "tessedit_create_hocr 1"
    Save with ":x"
    That's it! To use Tesseract go into the directory with your scanned PDF (or whatever it is). I will get both plain and hocr output:

    Code:
    cd /home/Zeon
    Convert -density 300 scanpage1.pdf -depth 8 scanpage1.tif
    Tesseract scanpage1.tif outputtext
    Tesseract scanpage1.tif outputtext hocr
    That's it!
    Last edited by Zeon100; March 10th, 2011 at 01:54 AM.

  2. #2
    Join Date
    Feb 2010
    Beans
    14

    Re: Tesseract 3.0 + Ubuntu 10.04 Installation Guide

    hi, I tried what you've posted but I get the following after I do "tesseract scanpage1.tif outputtext hocr"

    actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES:Error:Assert failed:in file tessdatamanager.cpp, line 55
    Segmentation fault

    any ideas?

  3. #3
    Join Date
    Dec 2010
    Beans
    1

    Re: Tesseract 3.0 + Ubuntu 10.04 Installation Guide

    Quote Originally Posted by Br11 View Post
    hi, I tried what you've posted but I get the following after I do "tesseract scanpage1.tif outputtext hocr"

    actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES:Error:Assert failed:in file tessdatamanager.cpp, line 55
    Segmentation fault

    any ideas?
    Hi Br11,

    I replaced the eng.traineddata with one from this site:http://code.google.com/p/tesseract-o...traineddata.gz

    That seemed to eliminate the TESSDATA_NUM_ENTRIES error for me.

  4. #4
    Join Date
    Feb 2010
    Beans
    14

    Re: Tesseract 3.0 + Ubuntu 10.04 Installation Guide

    It worked! Thanks!

    Now I need something that lets me have the text as a layer of the PDF and my life will be perfect

  5. #5
    Join Date
    Nov 2009
    Beans
    9
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: Tesseract 3.0 + Ubuntu 10.04 Installation Guide

    Zeon 100 Thank you! This has dramatically simplified my life.

    BR 11, this--with many thanks to Konrad Voelkel--will superimpose a text layer. It's not 100% accurate (at least with the files I'm using) but it works pretty well:

    Code:
    #!/bin/bash
    echo "usage: pdfocr.sh document.pdf \"author\" \"title\""
    # Adapted from http://blog.konradvoelkel.de/2010/01/linux-ocr-and-pdf-problem-solved/ 
    # NOTE: This script has been substantially modified/simplified from the original. 
    # This version does not allow rotation, language selection or cropping.
    # Those parameters were all required in the original, but I don't really need them.
    # If you can think of a way to make them optional, please share. 
    # This version also uses Tesseract, which I find to be substantially more
    # accurate than Cuneiform for English text. 
    # usage examples:
    pdftk "$1" burst dont_ask
    for f in pg_*.pdf
    do
    echo "pre-processing $f ..."
    convert -quiet -density 300 -depth 8 "$f" "$f.tif"
    echo no splitting
    done
    for f in pg_*.tif
    do
    echo "processing $f ..."
    tesseract "$f" "$f" hocr
    echo "Merging TIFF and hOCR into PDF file at 150 DPI..."
    #Downsample to cut down on file bloat
    hocr2pdf -r 150 -i "$f" -o "$f-ocr.pdf" <"$f.tif.html"
    done
    echo "InfoKey: Author" > in.info
    echo "InfoValue: $2" >> in.info
    echo "InfoKey: Title" >> in.info
    echo "InfoValue: $3" >> in.info
    echo "InfoKey: Creator" >> in.info
    echo "InfoValue: PDF OCR scan script" >> in.info
    pdfjoin --fitpaper --tidy --outfile "$1.ocr1.pdf" "pg_*-ocr.pdf"
    rm -f pg_*
    pdftk "$1.ocr1.pdf" update_info doc_data.txt output "$1.ocr2.pdf"
    pdftk "$1.ocr2.pdf" update_info in.info output "$1-ocr.pdf"
    rm -f "$1.ocr1.pdf" "$1.ocr2.pdf" doc_data.txt in.info
    Konrad's blog has the full version of the original script. As noted in the script comment above, if anyone can figure out how to change languages and make orientation, rotation, and cropping parameters optional, please share that info here.

    For some reason, I can't seem to get Tesseract to accept both "hocr" and "-l en" as parameters, so I took out the language because all the documents I work with are written in English.

    I also had to tweak some of the "$f" variables from the original because my bash shell seemed to recursively and progressively substitute them (resulting in errors such as "pg_0001.pdf.tif.pdf.html does not exist"). As written, on my system - a stock 10.04.1 install - this works.

    Edit: I also added a step to downsample the hocr2pdf output to 150 dpi to prevent file bloat. Before this, the script was taking some 40 KB pages and turning them into 400+ KB each: A big problem on a 100+ page document. I've also been able to get away with 200 DPI and 4-bit depth on the initial conversion, but it really depends on your source material. Your results may differ, but (at least for the files I'm working with) these settings seem to (sometimes) keep the new OCR'd version of the PDF roughly the same size as the original. There is almost certainly some loss in quality, but it's tolerable for what I'm doing. (There was previously a PNG conversion step in there, but that just fouled up everything....)

    However, it still doubles the size of some files - it's probably possible to downsample more, but this is not great for quality. Any other thoughts on how to reduce the final file size? (within the script, that is: I know there are external tools for post-processing, but I've had mixed success with those).

    Hope this helps someone....
    Last edited by fungiblename; January 7th, 2011 at 02:30 PM. Reason: Edited script to re-sample images to prevent file bloat; removed unnecessary PNG conversion

  6. #6
    Join Date
    Jan 2011
    Beans
    1

    Re: Tesseract 3.0 + Ubuntu 10.04 Installation Guide

    Thank you for this useful tutorial. I installed all the package successfully. But when I use the command (like $ tesseract test.tif outputtext hocr ), I receive this notice:

    tesseract: symbol lookup error: tesseract: undefined symbol: page_imag

    What does that mean? How can I fix it?
    Any help will be appreciated.

  7. #7
    Join Date
    Nov 2009
    Beans
    18

    Re: Tesseract 3.0 + Ubuntu 10.04 Installation Guide

    Quote Originally Posted by Br11 View Post
    It worked! Thanks!

    Now I need something that lets me have the text as a layer of the PDF and my life will be perfect
    Hey there,
    That's what Exact Image is for! it has a subprogram called hocr2pdf which will impose the hocr output over the pdf and give you a fully OCRd PDF. Use is documented on their website:

    http://www.exactcode.de/site/open_so...mage/hocr2pdf/

    hocr2pdf: Is a command line front-end for the image processing library to create perfectly layouted, searchable PDF files from hOCR, annotated HTML, input obtained from an OCR system.

    hOCR, annotated HTML, input must be provided to STDIN, and the image data is read using the filename from the -i or --input argument. For example:

    hocr2pdf -i scan.tiff -o test.pdf < cuneiform-out.hocr

    By default the text layer is hidden by the real image data. Including image data can be disabled via the -n, --no-image, so that just the recognized text from the OCR is visible - e.g. for debugging or to save storage space:

    hocr2pdf -i scan.tiff -n -o test.pdf < cuneiform-out.hocr
    Also thanks to rjwinder for pointing out the problem with the trained data from SVN. When I first downloaded it, it worked great but I came into the office yesterday after my holiday and it didn't work! Rjwinder's method fixed the problem.

  8. #8
    Join Date
    Nov 2009
    Beans
    18

    Re: Tesseract 3.0 + Ubuntu 10.04 Installation Guide

    Quote Originally Posted by Gafs View Post
    Thank you for this useful tutorial. I installed all the package successfully. But when I use the command (like $ tesseract test.tif outputtext hocr ), I receive this notice:

    tesseract: symbol lookup error: tesseract: undefined symbol: page_imag

    What does that mean? How can I fix it?
    Any help will be appreciated.
    Did you install Leptonica properly? Also does it work normally (ie without hocr)?

  9. #9
    Join Date
    Jul 2007
    Beans
    83

    Re: Tesseract 3.0 + Ubuntu 10.04 Installation Guide

    I am running Linux Echo 2.6.35-25-generic #44-Ubuntu SMP Fri Jan 21 17:40:48 UTC 2011 i686 GNU/Linux.
    Following directions in this seemingly simple tutorial (thanks for making it clear what to do and why), when I get to the part about doing
    ./runautoconfig
    I get this result.
    ~/tesseract-3.00$ ./runautoconf
    Running libtoolize
    ./runautoconf: 31: libtoolize: not found
    Running aclocal
    ./runautoconf: 38: aclocal: not found
    Running autoheader
    ./runautoconf: 44: autoheader: not found
    Running autoconf
    ./runautoconf: 51: autoconf: not found
    Running automake --add-missing --copy
    ./runautoconf: 64: automake: not found
    All done.
    To build the software now, do something like:

    $ ./configure [--with-debug] [...other options]
    $ make


    Being somewhat new, I did't understand [--with-debug] [...other options]
    so I ran;
    ~/tesseract-3.00$ ./configure --with-debug
    configure: WARNING: unrecognized options: --with-debug
    checking build system type... i686-pc-linux-gnu
    checking host system type... i686-pc-linux-gnu
    checking --enable-graphics argument... yes
    checking for cl.exe... no
    checking for g++... no
    checking whether the C++ compiler works... no
    configure: error: in `/home/jim/tesseract-3.00':
    configure: error: C++ compiler cannot create executables
    See `config.log' for more details

    Does this mean I must install/reinstall a C++ compiler?
    Any help appreciated.

  10. #10
    Join Date
    Dec 2007
    Location
    Romania
    Beans
    4
    Distro
    Xubuntu 10.04 Lucid Lynx

    Re: Tesseract 3.0 + Ubuntu 10.04 Installation Guide

    After installing libpng12-dev, libjpeg62-dev, libtiff4-dev and leptonica,
    I was using deb packages from http://notesalexp.net/lucid/main/t/tesseract/

    It works for me in Lucid.

Page 1 of 4 123 ... LastLast

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •