PDA

View Full Version : Need a helping hand



jesuisbenjamin
November 8th, 2009, 03:35 PM
Hello forum,

i am new to programming and Python and work on a small project which is also an exercise for me to learn more.

So i begun with a program to build a personal library to help having an overview of all kinds of readable documents and search through them. There is an option to filter specified document types by their extension and by checking the file type itself.

I have a list of mostly used file types but i suppose it is not finished. Besides there is a difficulty with plain text documents without extension who have many possible file types. I would like to exclude program files (e.g. .py) since this program is meant for literature. Also i am not familiar with e-book files like Fiction Book (.fb2) or Djvu (c.f. wikipedia (http://en.wikipedia.org/wiki/Document_file_format))

The list i have so far is:

rtf : Rich Text Format
odp : OpenDocument Presentation
htm : HTML document
ods : OpenDocument Spreadsheet
odt : OpenDocument Text
ppt : Microsoft Office Document
wpd : Word Processor
txt : UTF-8
ott : OpenDocument Text Template
doc : Microsoft Office Document
html : HTML document
pdf : PDF document
xls : Microsoft Office Document

If you can help me extend this list or giving me tips on how to deal with plain text or e-book formats, it would rock.

Thanks

kavon89
November 8th, 2009, 06:41 PM
Have you already started on this project? Many of those file formats seem like they would be quite difficult to parse.

Plain text files should be the easiest to search/index because there is no extra formatting.

insanecrazy4
November 8th, 2009, 09:26 PM
with a little bit of googleing i have a found a compilation of file formats on wikipedia

http://en.wikipedia.org/wiki/List_of_file_formats

jesuisbenjamin
November 8th, 2009, 09:29 PM
Yes i have started although i have not begun with the search through documents. But i guess it can be done by temporarily converting a given file to plain text and search through it.

For now though i am concentrating on the scanning of the drive and gathering of desired document types.