Hi guys,
I am using the Tesseract package which provides OCR (optical Character Recognition for electronic document images. The 10.04 repositories currently only have 2.04 however the latest version, 3.00 is out with many new features. Here is how I got everything working:
1. Install Imagemagick
Imagemagick helps convert all the document images to a format Tesseract likes. We use PDFs.
"sudo apt-get install imagemagick"
Usage would be like
"Convert -density 300 scanneddocument.pdf -depth 8 scanneddocument.tif"
This converts to a good quality tiff image with 8 bit depth (required by Tesseract). You can change the density amount as you may get better results.
2. Install Tesseract
Get the required packages available in the repositories:
Code:
sudo apt-get install libpng12-dev
sudo apt-get install libjpeg62-dev
sudo apt-get install libtiff4-dev
("sudo apt-get install zlibg-dev" is suggested in the Tesseract readme but isn't available. I found I didn't need this.)
I picked this up from a comment made, you need to be able to compile and make the software. Ubuntu needs some packages to help do this. For many of you these may already be present and installed but it doesn't hurt..
Code:
sudo apt-get install gcc
sudo apt-get install g++
sudo apt-get install automake
Download this program which can't be gained with apt-get:
http://www.leptonica.org/
My link was here: http://www.leptonica.org/source/leptonlib-1.67.tar.gz
Code:
wget http://www.leptonica.org/source/leptonlib-1.67.tar.gz
tar -zxvf leptonlib-1.67.tar.gz
cd leptonlib-1.67
./configure
make
sudo checkinstall (follow the prompts and type "y" to create documentation directory. Enter a brief description then press enter twice)
sudo ldconfig
Now we can actually get and install Tesseract! Remember to go back one directory from the above install of Leptonica:
Code:
wget http://tesseract-ocr.googlecode.com/files/tesseract-3.00.tar.gz
tar -zxvf tesseract-3.00.tar.gz
cd tesseract-3.00
./runautoconf
./configure
make
sudo checkinstall (follow the prompts and type "y" to create documentation directory. Enter a brief description then press enter twice)
sudo ldconfig
Now for whatever reason the training data isn't installed with this. I got mine straight from Tesseract's SVN:
(This is from rjwinder)
Code:
cd /usr/local/share/tessdata
sudo wget http://tesseract-ocr.googlecode.com/files/eng.traineddata.gz
sudo gunzip eng.traineddata.gz
Also, we really wanted to use hOCR which allows us to pinpoint the actual images over the original. You could use something like hocr2pdf ("sudo apt-get install exactimage")to remerge the pdf and hocr output to make searchable PDFs. Anyway to get this:
Code:
cd /usr/local/share/tessdata/configs
sudo vi hocr
You need to know how to use Vim to do this bit
Put this in: "tessedit_create_hocr 1"
Save with ":x"
That's it! To use Tesseract go into the directory with your scanned PDF (or whatever it is). I will get both plain and hocr output:
Code:
cd /home/Zeon
Convert -density 300 scanpage1.pdf -depth 8 scanpage1.tif
Tesseract scanpage1.tif outputtext
Tesseract scanpage1.tif outputtext hocr
That's it!
Bookmarks