I'm writing a simple web indexer in Python that can :
1. Download entire websites.
2. Do simple keyword searches through them.
Currently I'm writing a module which simply downloads and caches the website locally, i.e., no parsing is done. I'll then implement the module to parse the documents for keywords etc. and put it all in a database. The last module will search this database on the basis of input provided by the user. All these will work independently and will be usable as standalone modules.
A "main" module binds them all together. I'm thinking of adding a Tkinter GUI for this as well, so we can have two interfaces, text based and GUI based.
I need to know a couple of things :
1. What variable contains the path to the directory in which the current script resides?
2. I need to browse the element tree of the HTML documents. There seem to be two modules for doing this : HTMLParser and htmllib. Which one should I use? Although I can figure out the usage from the module docs, my work would become easier if anybody could give me a link to a tutorial or something.
3. How do I pause for a couple of seconds before downloading the next page? I want my program to play nice with websites, and not suck their bandwidth.
Any hints, suggestions etc. will be welcome. Since I'm doing this for a school project (a science fair, sort of), I won't be using third party modules. Everything has to be done from scratch.
Bookmarks