I hope I can get this working soon. I was excited when I saw that I could actually select text in an OCRed document, but then I realized it was only detecting something like 15 to 30 words per page. Any ideas on how to get it to see all the text?
I hope I can get this working soon. I was excited when I saw that I could actually select text in an OCRed document, but then I realized it was only detecting something like 15 to 30 words per page. Any ideas on how to get it to see all the text?
Hi!
There seem to be problems for hocr2pdf with the hocr files produced by cuneiform in version 0.9.0 or higher. The reocgnition is pretty good, but when adding the hocr file to the pdf sometimes the font size used in the hidden layer is blown way out of proportion, so that the hidden layer doesn't fit the image layout at all, even some parts of the ocr result are not saved at all.
I haven't found out why this happens, it seems that hocr2pdf is not able to work with the new hocr format used in cuneiform 0.9.0 or higher.
pdfsandwich ( http://www.tobias-elze.de/pdfsandwich/index.html ) could be worth looking at, it works with cuneiform 0.7.0, the standard ubuntu packages, since it first converts the pdf image to bmp3 (the format cuneiform was written for); the resulting hocr files are slightly larger than those produced by cuneiform 0.9.0 or higher. The results match much better!
so long
clasikowski aka hank
Last edited by clasikowski; August 18th, 2010 at 10:46 AM. Reason: spelling
Hi, the program doesn't work for me. I'm on Ubuntu 10.4 with actual kernel.
I've tried a large ebook. The quality is ok.
here is the output:
maybe there is allready a solution. thank youCode:Error while running OCR on page 275 ========== Extracting page 276 Converting page 276 to ppm pdftoppm version 3.02 Copyright 1996-2007 Glyph & Cog, LLC Usage: pdftoppm [options] <PDF-file> <PPM-root> -f <int> : first page to print -l <int> : last page to print -r <int> : resolution, in DPI (default is 150) -mono : generate a monochrome PBM file -gray : generate a grayscale PGM file -t1lib <string> : enable t1lib font rasterizer: yes, no -freetype <string>: enable FreeType font rasterizer: yes, no -aa <string> : enable font anti-aliasing: yes, no -aaVector <string>: enable vector anti-aliasing: yes, no -opw <string> : owner password (for encrypted files) -upw <string> : user password (for encrypted files) -q : don't print any messages or errors -cfg <string> : configuration file to use in place of .xpdfrc -v : print copyright and version info -h : print usage information -help : print usage information --help : print usage information -? : print usage information Running OCR on page 276 Magick: Improper image header `276.ppm' @ pnm.c/ReadPNMImage/297 Cuneiform for Linux 0.9.0 Error while running OCR on page 276 Merging together PDF files /tmp/d20100824-6962-1fqv3qc/*-new.pdf not found as file or resource. Error: Failed to open PDF file: /tmp/d20100824-6962-1fqv3qc/*-new.pdf Errors encountered. No output created. Done. Input errors, so no output created. Updating PDF info for /home/steffen/tmp/morgen.pdf /tmp/d20100824-6962-1fqv3qc/merged.pdf not found as file or resource. Error: Failed to open PDF file: /tmp/d20100824-6962-1fqv3qc/merged.pdf Errors encountered. No output created. Done. Input errors, so no output created. Cleaning up temporary files
Hank, thanks for this suggestion. It is giving me the same mis-matching results that others have had.
I tried the option -noimage and can see that the output has random sized text that often overlap one another, which is why they never line up with the original image.
Has anyone come up with a solution to this yet?
Like the others, the text and image layers do not match up. Do you need samples of our files?
btw: thanks for providing this program. Things are better than before.
I modified your solution to work with Ubuntu 9.10 and pdftoppm 3.02 (which looks like what you used too). The problem I had was that pdftoppm pads the output .ppm with 5 zeros. Mine will only work for pdfs less than 100 pages.
Code:#!/bin/bash TESS_LANG=eng rflag= # first figure out what args we have getopts 'r:' OPT; shift $(($OPTIND - 1)) if [ $OPT == "r" ] then rflag="-rotate $OPTARG"; fi CURRENT_DIR=`pwd` SCRIPT_NAME=`basename "$0" .sh` TMP_DIR=${SCRIPT_NAME}-tmp mkdir ${TMP_DIR} for thisfile in "$@" do NAME=`basename "${thisfile}" .pdf` cp "$thisfile" ${TMP_DIR} cd ${TMP_DIR} echo "Examining: ${thisfile}"; pgs=`pdfinfo "${thisfile}" | grep Pages | awk '{print $2}'` echo "Found ${pgs} pages; converting..."; # it's only fair, since we're suppressing it later... echo "Tesseract Open Source OCR Engine"; for x in `seq 1 ${pgs}` do echo -en " Page ${x}..."; pdftoppm -f $x -l $x -r 600 "$thisfile" ocrbook; if [ $x -gt 9 ]; then BASE=ocrbook-0000${x}; else BASE=ocrbook-00000${x}; fi convert ${BASE}.ppm ${rflag} ${BASE}.tif; tesseract ${BASE}.tif ${BASE} -l ${TESS_LANG} > /dev/null 2>&1; cat ${BASE}.txt >> "${NAME}.txt"; echo "[pagebreak]" >> "${NAME}.txt"; rm ocrbook*; echo "done"; done; echo "Conversion complete"; mv "${NAME}.txt" ${CURRENT_DIR} rm * cd ${CURRENT_DIR} done rmdir ${TMP_DIR}
wrong thread, sorry for the post. Please delete.
this app is really promising, especially because there's nothing out there like it for linux. it would be extremely useful for any organization that images a grip of documents.
in our office, for example, we image all our files. the fancy expensive feed scanners we have scan into pdf, but aren't OCR'd.
if this app were accurate, i'd use it to image tens of thousands of pdf's that i otherwise have to rely on effective file names to find what i'm looking for.
my experience, however, is like the rest: poor recognition and badly misaligned. sure, tesseract is out there, but OCRing to a separate text file isn't very useful in this context.
I'm really wanting to try this out and write it up, but the PPA doesn't work for 10.10. Is there any chance of that happening soon?
I just tried installing from the PPA in Maverick 10.10 too, but the dependencies don't allow one to install it. Quite unfortunate. Will there be updated packages soon?
Bookmarks