Results 1 to 8 of 8

Thread: Searchable PDFs on Linux

  1. #1
    Join Date
    Oct 2007
    Beans
    220
    Distro
    Ubuntu 10.04 Lucid Lynx

    Searchable PDFs on Linux

    Hi,

    Does anyone know of a tool available on Linux (command line) that would be capable of converting a scanned PDF document into a searchable PDF document.

    I don't need to convert them to editable documents, just the ability to search through them once scanned. Its a network scanner that will just dump scanned PDFs into a specific location on the the Ubuntu File Server.

    Rgds
    Chris

  2. #2
    Join Date
    Apr 2012
    Beans
    Hidden!
    Distro
    Ubuntu 12.04 Precise Pangolin

    Re: Searchable PDFs on Linux

    Recoll does this;

    http://www.lesbonscomptes.com/recoll/

    you can use it from the command line if you wish.
    There is always a way, but it might not be the best way, or your way!

  3. #3
    Join Date
    Oct 2007
    Beans
    220
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: Searchable PDFs on Linux

    Hi, thanks for the link I will have a look at that.

    So if I scan 100 documents as PDF to the server, this could be configured to make them all searchable by text contained inside the document, so if a user opens the file on there Windows PC there can in theory search for specific words, etc..

    Rgds
    Chris

  4. #4
    Join Date
    Apr 2012
    Beans
    Hidden!
    Distro
    Ubuntu 12.04 Precise Pangolin

    Re: Searchable PDFs on Linux

    Quote Originally Posted by chrislynch8 View Post
    Hi, thanks for the link I will have a look at that.

    So if I scan 100 documents as PDF to the server, this could be configured to make them all searchable by text contained inside the document, so if a user opens the file on there Windows PC there can in theory search for specific words, etc..

    Rgds
    Chris
    Yes.






    Last edited by traditionalist; May 31st, 2012 at 04:33 PM. Reason: More info
    There is always a way, but it might not be the best way, or your way!

  5. #5
    Join Date
    Jun 2007
    Location
    Puerto Rico
    Beans
    159
    Distro
    Ubuntu 16.04 Xenial Xerus

    Re: Searchable PDFs on Linux

    I think that what you mean is a tool to perform optical character recognition on an image pdf. An image pdf is a pdf that has no text layer even if the scanned document was, for example, a letter. What I saw of recoll makes me think that the text layer has to exist.
    If what you mean is OCR, you can try gscan2pdf (http://gscan2pdf.sourceforge.net/). It is in the repositories. OCR is done by Tesseract (http://code.google.com/p/tesseract-ocr/). gscan2pdf get the scan from a regular scanner or you can import images like tif, png and even pdf's. Afterwards, you tell the software to perform OCR on the document. The text can be saved as a separate text file or inserted as a comment in a pdf. That last bit is the sad part. People will not be able to select text from the image which might be confusing to some. Tesseract also has several languages. If you are not doing OCR for English text, look for the appropriate language pack.
    ¡Levántate!, ¡Revuélvete!, ¡Resiste!
    Haz como el toro acorralado: ¡muge!
    O como el toro que no muge: ¡EMBISTE!
    - José de Diego, En la brecha -

  6. #6
    Join Date
    Apr 2012
    Beans
    Hidden!
    Distro
    Ubuntu 12.04 Precise Pangolin

    Re: Searchable PDFs on Linux

    Recoll extracts the text from scanned PDF images. Although you could of course use gscan2pdf to begin with if you are scanning the stuff yourself.
    Last edited by traditionalist; May 31st, 2012 at 06:30 PM.
    There is always a way, but it might not be the best way, or your way!

  7. #7
    Join Date
    Jul 2010
    Beans
    10

    Re: Searchable PDFs on Linux

    Quote Originally Posted by traditionalist View Post
    Recoll extracts the text from scanned PDF images. Although you could of course use gscan2pdf to begin with if you are scanning the stuff yourself.
    Hi, Recoll author here.

    While I am very flattered by your trust in recoll , I don't want the original poster to be disappointed: no, recoll does not currently run OCR on image pdfs to extract text.

    I think that the suggestion above to run a scan+OCR program is the right approach.

    Additionally, if the OCR'd text is stored by the suggested program in a PDF comment, I am not too sure that it would be indexed (maybe it would, I just don't know). There could be a need to fix the pdf filter, but this would be quite easy as it is just a shell script which calls pdftotext (from poppler). I hope that a combination of pdftotext and pdfinfo should be able to extract whatever text exists inside a pdf.

    Cheers,

    jf

  8. #8
    Join Date
    Oct 2007
    Beans
    220
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: Searchable PDFs on Linux

    Hi All,

    Thanks for the comments,

    Looking at recoll as the author has suggested is not suitable. gscan2pdf seems to require a GUI to work, we use XP workstations connected to the Ubuntu File Server. I will play around with command line and see if it will be suitable.

    So as to not confuse things this is the goal.

    1: The User loads all paper on the scanner and enters their keycode. All pages are scanned as pdf files to a specific location on the File Server

    2: Some application is monitoring this folder run OCR on all files and converts them to searchable pdf files.

    3: The User then accesses the shared folder and moves all the converted files into the relevant locations and accesses them as and when needed with some PDF reader.


    If this automatic approach is not possible, it will be a case of finding the best windows based software to achieve this, already have tested paperport+omnipage which works perfect but this is expensive.

    Thanks again for the replies and I will post back if gscan2pdf achieves what I'm looking for.

    Rgds
    Chris

    Quickly looking at gscan2pdf looks like a GUI interface is required. I've just Ubuntu Server cli only...
    Last edited by chrislynch8; June 1st, 2012 at 03:32 PM.

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •