PDA

View Full Version : [SOLVED] Help Needed: Making My Own Search Engine



coljohnhannibalsmith
April 2nd, 2011, 01:39 AM
Hello,

I'm interested in the possibilty of making my own search engine rather than relying on the existing ones like Google, Bing, and Dogpile. I'm interested in doing this for a variety of reasons; to overcome censorship, geographical and language filtering, and to maximize privacy. Does anyone have any ideas about how to do this?

I suspect I would probably want to dedicate a machine for this purpose, and of course some kind of web-crawler application would have to be installed to crawl the web, looking for web sites to index. Also, I suspect that I would probably have to let such an application run uninterrupted, pretty much continuously. I've tried searching the web for pre-existing applications or scripts to do this, but I just haven't found anything. Perhaps I don't know the vocabulary and am not entering the correct search terms.

Any help or suggestions appreciated.


Thanks, Hannibal

jennybrew
April 2nd, 2011, 01:42 AM
Gosh!!!
You must be a genius.
Do you not have to come up with search algorithms etc, ets?

matt_symes
April 2nd, 2011, 01:45 AM
Gosh!!!
You must be a genius.
Do you not have to come up with search algorithms etc, ets?

and a big, fat, optimised database....

Stick to google.

coljohnhannibalsmith
April 2nd, 2011, 02:54 AM
For the afore mentioned reasons, I'd prefer not to. Also, I suspect much of the things you've mentioned have already been done by others; so I don't think I'd have to completely re-invent the wheel, as you seem to suggest. Additionally, I would expect that this forum would attract individuals of that calibre. So for the moment at least, I don't think I'm in the wrong place.:confused: Also, for the moment, this is just a thought experiment...
why such strong reactions?

Any 'constructive' replys appreciated.


Thanks, Hannibal

matt_symes
April 2nd, 2011, 03:03 AM
Hi

As i said, start off by setting up large, optimised database. MySQL is a good start.

Read this.

http://www.webopedia.com/DidYouKnow/Internet/2003/HowWebSearchEnginesWork.asp

If you are worried about privacy, have you considered TOR ?

Can you administer a database and write code ?

Kind regards

juancarlospaco
April 2nd, 2011, 03:05 AM
You dont need to do that, theres already open source search engines,
in example see Apache Lucene and derivatives... :D

uRock
April 2nd, 2011, 03:06 AM
Moved to programming talk.

Please do not create support threads in the Community Cafe.

|{urse
April 2nd, 2011, 03:08 AM
The undertaking of creating your own search engine would be ridiculously complex to nigh impossible to implement unless you have tons of dollars and some amazing coders at your disposal.

Thats as constructive as I can be on this.

matt_symes
April 2nd, 2011, 03:13 AM
Hi


the moment, this is just a thought experiment...
why such strong reactions?

No strong reactions. I think you misunderstand me.

It is just a huge, huge job and you need more than one skill set to do it.

Kind regards

coljohnhannibalsmith
April 2nd, 2011, 03:26 PM
Hi



No strong reactions. I think you misunderstand me.

It is just a huge, huge job and you need more than one skill set to do it.

Kind regards


Thank you Matt,

Perhaps I did misunderstand You. You're probably very right that this is a very complex undertaking. I am however very interested in how this can be done and I have admitted my naivety about this topic. Also, these kinds of challenges are my 'Raison d'etre,' or something. I've always found the available search engines, not just a potential source of insecurity, but also a massive and distracting generator of advertising, and disappointing in terms of relevent results.

Thank you for the informative information link you provided; I will study it as soon as I get the chance. BTW the poster from Buenos Aires suggested something close to what I was hoping for:

"You dont need to do that, there's already open source search engines,
in example see Apache Lucene and derivatives... :grin:"

I'm sure this is still a complex solution, and I'll try not to underestimate it. BTW, I can code a 'little.' I'm very comfortable compiling code, and somewhat comfortable modifying small errors in other peoples code.

'uRock,' sorry about posting in the wrong forum. I didn't know where else to post this thread, and I figured that it would be better to 'start' here, and then get moved, than to start the post in a more specific forum. Perhaps 'General Help' would have been a better starting point?


Thanks, Hannibal

rg4w
April 2nd, 2011, 04:26 PM
Really, Lucene is the answer. I've toyed with all manner of custom search engines, and while doing so can provide a good education and even be fun, performance and storage requirements make them unsuitable for all but a very few highly vertical tasks. There's a reason Google invented their own file system. ;)

If you want to proceed, Lucene is an excellent start. That, and several terrabytes of storage, which will let you index about 0.0001% of the web, which may be enough to learn what you want to learn.

coljohnhannibalsmith
April 2nd, 2011, 04:45 PM
Oh BTW,

I've tracked this down a little more and discovered the following site:

http://lucene.apache.org/java/docs/features.html

Also, I'm in the process of dowloading 'Lucene in Action,' in pdf format. I'll give this a good and thorough read.


Thanks, Hannibal


PS, what does 'rg4w (http://ubuntuforums.org/member.php?u=962490)' stand for, if you don't mind?

Arndt
April 2nd, 2011, 05:30 PM
For the afore mentioned reasons, I'd prefer not to. Also, I suspect much of the things you've mentioned have already been done by others; so I don't think I'd have to completely re-invent the wheel, as you seem to suggest. Additionally, I would expect that this forum would attract individuals of that calibre. So for the moment at least, I don't think I'm in the wrong place.:confused: Also, for the moment, this is just a thought experiment...
why such strong reactions?

Any 'constructive' replys appreciated.


Thanks, Hannibal

Maybe this is interesting: http://webglimpse.net/

rg4w
April 2nd, 2011, 06:26 PM
PS, what does 'rg4w (http://ubuntuforums.org/member.php?u=962490)' stand for, if you don't mind?
Not at all: it's a combination of my initials and the initials of my company name.

simeon87
April 3rd, 2011, 02:16 AM
Anyone remotely capable of building a search engine wouldn't ask about the basics in these forums. Crawling the web, ranking webpages and quickly answering queries are vast subjects in their own right. The size of the internet also makes it very hard for an individual to crawl even a small part of the web without some decent hardware purchases. If you still want to go ahead, you should read about PageRank and how other search engines are ranking pages, about web crawling technology and how to produce a ranking of pages based on a query.

Some Penguin
April 3rd, 2011, 02:22 AM
Lucene isn't suited for general-purpose web search engines unless you're willing to shove a LOT of hard work and magic into the analyzer. It's reasonably suited for intranet documentation where you have reasonable expectations that everything you're indexing is worthy and you're just interested in inverted indexing, but the Web is full of spam and garbage.

NathanB
April 3rd, 2011, 03:48 AM
Plenty of open-source options exist.

http://swish-e.org/

http://xapian.org/

http://www.htdig.org/

to name just a few from the huge list of search engines...

Can't find it now, but there is one out there that will let you download their compressed database of everything their crawler/bot has collected. With a little searching, you can stumble into lots of neat tools/resources.

coljohnhannibalsmith
April 7th, 2011, 05:23 PM
Thanks to all of you for your kind and generous help.

While stumbling through Synaptic, I've discovered that all of these packages, with the exception of Web-Glimpse, are available in the repository. Oh, happy day.:guitar:

That's not to say that it won't be a challenge to configure and run these applications.


One of you suggested that I do the following:

"If you still want to go ahead, you should read about PageRank and how other search engines are ranking pages, about web crawling technology and how to produce a ranking of pages based on a query."

I think this is sound advice since for web documents there will be, as I was warned, plently of spam and other useless material.


Thanks, Hannibal

coljohnhannibalsmith
April 7th, 2011, 08:47 PM
While searching for information about Webcrawlers I discovered the following:

http://en.wikipedia.org/wiki/Web_crawler


The ones that look the most interesting are:


crawer4j

http://code.google.com/p/crawler4j/




Nutch, which is used with Lucene:

http://en.wikipedia.org/wiki/Nutch


and,


YaCy (Yah, See) which claims to do most of the things I'm interested in:

http://yacy.net/en/

I'm downloading this now. I can't wait to try it.:lolflag:

coljohnhannibalsmith
April 7th, 2011, 09:03 PM
YaCy Rocks!!!:guitar:
http://yacy.net/en/

This is an out-of-the-box solution, and it even searchs the 'hidden-web,' and just like they promise, NO ADVERTISING, and NO CENSORSHIP!!!

Hoo-Yah.


I'm still going to experiment with Lucene & Nutch though!


Hannibal

coljohnhannibalsmith
May 17th, 2011, 03:53 PM
It appears that this problem is now solved!!!

http://tshirtgroove.com/wp-content/uploads/2008/10/problem-solved-tshirt.jpg