PDA

View Full Version : [SOLVED] Search a keyword in many PDF documents at the same time, is it possible?



LastWho
January 27th, 2009, 05:43 AM
Hi
Is it possible to look for a keyword in many pdf documents (in a folder) at the same time?
i m using evince document viewer 2.22.2 or ePDFviewer 0.1.6

thanks for your help

HermanAB
January 27th, 2009, 06:15 AM
It won't be any faster than doing them in sequence.

Cheers,

Herman

emarkd
January 27th, 2009, 06:21 AM
I'll take a shot. I didn't test this so let me know if this works. The idea here is to convert each pdf to text using 'pdftotext' and then grep for the search keyword in each file.

From the directory containing the pdfs, run:



ls *.pdf|xargs pdftotext|grep "keyword"

LastWho
January 27th, 2009, 06:33 AM
I'll take a shot. I didn't test this so let me know if this works. The idea here is to convert each pdf to text using 'pdftotext' and then grep for the search keyword in each file.

From the directory containing the pdfs, run:



ls *.pdf|xargs pdftotext|grep "keyword"

this is all what i got :p


~/Documents/test$ ls *.pdf|xargs pdftotext|grep négligence
pdftotext version 3.00
Copyright 1996-2004 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
-f <int> : first page to convert
-l <int> : last page to convert
-layout : maintain original physical layout
-raw : keep strings in content stream order
-htmlmeta : generate a simple HTML file, including the meta information
-enc <string> : output text encoding name
-eol <string> : output end-of-line convention (unix, dos, or mac)
-nopgbrk : don't insert page breaks between pages
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files)
-q : don't print any messages or errors
-v : print copyright and version info
-h : print usage information
-help : print usage information
--help : print usage information
-? : print usage information

can i ask u please why did u use the ls first?
and what do u use xargs for?
thanks

LastWho
January 27th, 2009, 08:05 AM
ok, i installed acrobat reader for linux and it saves me; this is how u install it:

1) add mediubuntu to repositories: https://help.ubuntu.com/community/Medibuntu

2) install acrobat reader:

sudo apt-get install acroread


use it search tool ;)

thanks

ps: if there is another option, i m still interested!

unutbu
January 27th, 2009, 05:35 PM
This seems to work:

ls *.pdf | xargs -I{} pdftotext {} - | grep "keyword"

"ls *.pdf" lists all the pdf files in the current directory.
This list is "piped" ("|") to the xargs command.

xargs is a command-line tool for building and executing other commands. It takes the list generated by ls and feeds them to pdftotext one at a time. So if you have files called a.pdf, b.pdf, c.pdf, the xargs command will run pdftotext 3 times:


pdftotext a.pdf -
pdftotext b.pdf -
pdftotext c.pdf -

"-I{}" tells xargs that every time it sees {} it should replace {} with the name of a file.

The "-" at the end of the command tells pdftotext to send its output to stdout (standard output). The output from stdout is piped to the grep command.

The grep command searches the output for "keyword".

To learn more about each of these commands, type


man ls
man xargs
man pdftotext
man grep

LastWho
January 29th, 2009, 07:05 AM
Many Many thanks to you for theses explanations! i feel less stupid than i was 2 mn ago

vyo
August 20th, 2009, 11:54 AM
Hello,

just a little comment:

The previous command returns only the lines containing the keywords without the names of the files, which contain the keywords. If I am interested in the names I could find them for example by this command:


for f in $(find . -name '*.pdf'); do
pdftotext -q $f - | grep -i -q 'keyword' && echo $f
done

maxomenia
July 9th, 2010, 10:49 AM
where do I put the -R to search sub folder?


This seems to work:

ls *.pdf | xargs -I{} pdftotext {} - | grep "keyword""ls *.pdf" lists all the pdf files in the current directory.
This list is "piped" ("|") to the xargs command.

xargs is a command-line tool for building and executing other commands. It takes the list generated by ls and feeds them to pdftotext one at a time. So if you have files called a.pdf, b.pdf, c.pdf, the xargs command will run pdftotext 3 times:


pdftotext a.pdf -
pdftotext b.pdf -
pdftotext c.pdf -"-I{}" tells xargs that every time it sees {} it should replace {} with the name of a file.

The "-" at the end of the command tells pdftotext to send its output to stdout (standard output). The output from stdout is piped to the grep command.

The grep command searches the output for "keyword".

To learn more about each of these commands, type


man ls
man xargs
man pdftotext
man grep