PDA

View Full Version : how does google work...indexing pages



badperson
December 12th, 2008, 01:57 AM
Okay,
I should probably know this, and I think I have an idea; but google indexes the content of webpages (dynamic sites also) and places that content on their own servers, so when you do a google search, you're searching through their database, correct?
bp

jimi_hendrix
December 12th, 2008, 02:25 AM
i think you are correct...i think it goes

webcrawler rips page -> parse for title -> stick in server

jespdj
December 12th, 2008, 09:33 AM
Yes, that's basically how Google works. They have hundreds of thousands of servers around the world, and they use very sophisticated software and algorithms, for example MapReduce (http://en.wikipedia.org/wiki/MapReduce). They only hire the smartest elite programmers - it's very hard to get a job at Google.

See: Google spotlights data center inner workings (http://news.cnet.com/8301-10784_3-9955184-7.html)

From that article:

...that would mean Google has more than 200,000 servers, and I'd guess it's far beyond that and growing every day.

Cracauer
December 13th, 2008, 06:15 PM
Okay,
I should probably know this, and I think I have an idea; but google indexes the content of webpages (dynamic sites also) and places that content on their own servers, so when you do a google search, you're searching through their database, correct?
bp

Of course they use a suitable index representation so that they don't run
`grep <searchterm> /mnt/bigdrive/the/iternet/*.*`
every time you hit them.

The algorithm to support the word-based search is pretty trivial.

The meat of google's cleverness is in the ranking.

For small web pages they actually store a local copy of the whole thing. For bigger ones they index on the fly.

pmasiar
December 13th, 2008, 09:45 PM
The meat of google's cleverness is in the ranking.

For small web pages they actually store a local copy of the whole thing. For bigger ones they index on the fly.

Index on the fly? Kidding, right?

Meat of Google cleverness is http://en.wikipedia.org/wiki/Bigtable - proprietary database system storing results, and http://en.wikipedia.org/wiki/Google_File_System to store the data on computer clusters - and the scripts (big part written in Python BTW) managing the deployment.

Ranking is only the icing on the cake.

slavik
December 13th, 2008, 10:09 PM
google also keeps track of how often pages change, so if a page changes often, it will index it more often.

Mickeysofine1972
December 13th, 2008, 10:15 PM
If your interested in making a search that you can use on your own site the i recommend using mysqls FULLTEXT indexing features along side the soundex functions available in PHP.

heres a example in the one of the sites I did lately :

http://www.northumberland.ac.uk

take a look at the course finder on the left which uses exactly that, you can even miss spell stuff and it will suggest the right spelling if it exists.

Mike

aszxcv
December 13th, 2008, 10:33 PM
http://highscalability.com/google-architecture