Google Search Architecture Overview
Wall Script
Wall Script
Thursday, September 25, 2008

Google Search Architecture Overview

Friends just imagine with out Google Search we can't ! Most of Google is implemented in C or C++ for efficiency and can run in either Solaris or Linux.

In Google, the web crawling (downloading of web pages) is done by several distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository.Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits.

High Level Google Architecture


The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.

The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.



The sorter takes the barrels, which are sorted by docID, and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.



Next Article LinkGoogle Search Major Data Structures Documentation
web notification

18 comments:

  1. Very nice high-level Architectural Analysis - Krishna Reddy

    ReplyDelete
  2. it nice pls send more details..if posible means send to [email protected]

    ReplyDelete
  3. Gr8,..thnks for ur valuable information.

    keep provide valuable information to our mails,..

    ReplyDelete
  4. Good Understanding

    How to write Program implementing Page Rank Algorithm?

    ReplyDelete
  5. Just a basic question: where do you know this from? Did you work at Google, got hold of some documentation, or did you just guess?

    ReplyDelete
  6. it's from the phd thesis paper of the creators of google!

    ReplyDelete
  7. very nice explaination.

    ReplyDelete
  8. more details here..

    http://infolab.stanford.edu/~backrub/google.html

    ReplyDelete
  9. good info pls brief on architectre style for google

    ReplyDelete
  10. plz give the workin f each parts from architecture specifically...

    ReplyDelete
  11. Cool one. Then what about Google website layout analysis process?

    ReplyDelete

mailxengine Youtueb channel
Make in India
X