Sie sind auf Seite 1von 61

How GOOGLE Seach Works???

By, Naveen Krishna

Contents
Search Engine? Facts Types? Basic Understanding? The Working of Google Search Engine > Spiders > Indexer > Query Processor Ranking Pages Query Expansion Summing up the working SERP Platform Search Tips Google SMS Service

History
Google began in March 1996 as a research project by Larry Page and Sergey Brin, Ph.D. students at Stanford University. Google's name is a variation of the word "googol," which is a mathematical term for a one followed by 100 zeros. Page and Brin felt the name helped illustrate Google's monumental mission: Organizing billions of bytes of data found on the Web.

Search Engine?
Search engine is a software program that searches for sites based on the words that you designate as search terms. Search engines look through their own databases of information in order to find what it is that you are looking for. Search engine is the popular term for an Information Retrieval (IR) system.

FACTS!
Archie First search tool for the Internet.

Gopher Indexed plain text documents.


Jughead Searched the files stored in Gopher index systems. Wandex First Web search engine.

Types of Search Engines


1. Crawler-based Search engines Crawler-based search engines use automated software programs to survey and categorize web pages. The programs used by the search engines to access your web pages are called spiders, crawlers, robots or bots.
Ex:

Types of Search Engines(Contd.)


2. Directory-based Search Engines A directory uses human editors who decide what category the site belongs to; They place websites within specific categories in the directories database. The human editors comprehensively check the website and rank it, based on the information they find, using a predefined set of rules. Ex: www.dmoz.org

Types of Search Engines(Contd.)


3. Hybrid Search Engines Hybrid search engines use a combination of both crawler-based results and directory results. More and more search engines these days are moving to a hybrid-based model. Ex:

Types of Search Engines(Contd.)

4. Meta search engines


Meta search engines take the results from all the other search engines results, and combine them into one large listing. Ex:

What user sees ??


Like an iceberg, 2/3rd below water

User Interface

Content

Search functionality

Basic Understanding:

Actual Topic of Discussion

GOOGLE SEARCH ENGINE

The Working
Happens in 3 major steps:
1. Googlebot, a web crawler that finds and fetches web pages. 2. The indexer that sorts every word on every page and stores the resulting index of words in a huge database. 3. The query processor, which compares your search query to the index and recommends the documents that it considers most relevant.

1. Googlebot, Googles Web Crawler or Spiders

Spiders (Contd.)
Googlebot is Googles web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. Actually, these bots never crawl.

It functions much like our web browser.


Downloads the entire page and hands over to the Googles Index servers.

Spiders (Contd.)

Spiders (Contd.)

Spiders (Contd.)
Finds pages in two ways:

Through an add URL form, www.google.com/addurl.html, and


Through finding links by crawling the web.

Spiders (Contd.)
When Googlebot fetches a page, it culls all the links appearing on the page and adds them to a queue for subsequent crawling. Deep Crawling--Can reach almost every page in the web.

Duplicates in the queue are eliminated to prevent Googlebot from fetching the same page again.
Re-crawling is a must for frequently changing websites.

Spiders (Contd.)
URL Server and the crawlers are implemented in Python and run on Solaris or Linux OS. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data.

Spiders (Contd.)
Does Googlebots crawl everything in the WEB?????
Robot Exclusion Protocol- Robots.txt file contains a list of sites that shouldnt be crawled. Ex-- .mil sites. Password protected sites Ex: Username and PWD. Sites with a NOINDEX meta-tag are not indexed.

Limited searching for certain kinds of nontext files, including images, audio, or video files. ---Fetches text around such files.
About PDF/DOCs only the initial 120KB data is fetched.

2. Indexer
Spiders gives the indexer the full text of the pages it finds. Stored in Index Database Servers. How?? This index is sorted alphabetically by search term, with each index entry storing a list of documents in which the term appears and the location within the text where it occurs. Uses YACC for parsing and Hashing technique for Indexing

A Simple Index Diagram

What the Index Needs?


Basic information for document or record
File name / URL / record ID Title or equivalent Size, date, MIME type

Full text of item More metadata


Product name, picture ID Category, topic, or subject Other attributes, for relevance ranking and display.

Advanced Index Diagram

Indexer (Contd.)
To improve search performance Stop Words are ignored. Ex: the, is, on, or, of, how, why, as well as certain single digits and single letters.

The indexer also ignores some punctuation and multiple spaces, as well as converting all letters to lowercase, to improve Googles performance.

3. Query Processor
The Query processor has several parts:

The user interface (search box). The engine that evaluates queries and matches them to relevant documents. The results formatter.

Ranking Pages
How does GOOGLE find the pages of high importance?? Google considers over 200 Search Engine Optimization (SEOs)factors to rank web pages. These are categorized in to 4 groups. 1. Positive ON-Page SEO Factors 2. Negative ON-Page SEO Factors 3. Positive OFF-Page SEO Factors 4. Negative OFF-Page SEO Factors

1. Positive ON-Page SEO Factors


Keyword in URL. Keyword in Title tag. Keyword density in body text. Keyword in H1, H2 and H3. Keyword font size. Keyword phrase order.

2. Negative ON-Page SEO Factors


Link to a bad neighborhood. Poison words. Flash page.---Graphics Robot exclusion "no index" tag. Poor spelling and grammar.

3. Positive OFF-Page SEO Factors


Page Rank. Age of link. Frequency of change of anchor text. Link from "Expert" site? Site Age - Old shows stability. Page Selection Rate CTR. Bookmark add. Domain Registration Time. Are associated sites legitimate?

4. Negative OFF-Page SEO Factors


Zero links to a site Cloaking Server Reliability/Up-Time Invisible text Illegal Sites

Most Imp Page Ranking


PageRank is a link analysis algorithm used by the Google Internet search engine that assigns a numerical weight to each document in its huge Collection. A hyperlink to a page counts as a vote of support.

Google assigns a numeric weighting from 0-10 for each webpage on the Internet; this PageRank denotes a sites importance in the eyes of Google

Example:
Assume four web pages: A, B, C and D . Each document would begin with an estimated PageRank of 0.25. If pages B, C, and D each only link to A, they would each confer 0.25 PageRank to A. All PageRanks would thus gather to A because all links would be pointing to A. This is 0.75. A Web crawler may use PageRank as one of a number of important metrics it uses to determine which URL to visit next during a crawl of the web.

0.25

0.75

0.25

0.25

Query Expansion
Query expansion (QE) is the process of reformulating a seed query to improve retrieval performance in information retrieval operations. Finding synonyms of words, and searching for the synonyms as well . Fixing spelling errors and automatically searching for the corrected form or suggesting it in the results. Re-weighting the terms in the original query .

Search Engine Results Page (SERP)

SERP- Search Engine Result Page

Googles Platform
Google requires large computational resources in order to provide their services.

Google's first production server rack, circa 1998

Original hardware
The original hardware (circa 1998) that was used by Google when it was located at Stanford University included: Sun Ultra II with dual 200 MHz processors, and 256 MB of RAM. This was the main machine for the original Backrub system. 2 300 MHz Dual Pentium II Servers donated by Intel, they included 512 MB of RAM and 10 9 GB hard drives between the two. It was on these that the main search ran. F50 IBM RS/6000 donated by IBM, included 4 processors, 512 MB of memory and 8 9 GB hard drives. Two additional boxes included 3 9 GB hard drives and 6 x 4 GB hard drives respectively (the original storage for Backrub). These were attached to the Sun Ultra II. IBM disk expansion box with another 8 9 GB hard drives donated by IBM. Homemade disk box which contained 10 9 GB SCSI hard drive.

Current hardware(2010)
Servers are commodity-class x86 PCs running customized versions of Linux. The goal is to purchase CPU generations that offer the best performance per dollar, not absolute performance. Servers as of 2009-2010 consisted of a custom made open top systems containing two processors (each with 2 cores), A considerable amount of RAM spread over 8 DIMM slots housing double height DIMMS Two SATA hard drives connected through a non-standard ATX sized power supply. Each server has a novel 12 volt battery to reduce costs and improve power efficiency.

Software Platform
C++, Java, and Python are favored over other programming languages.

For example, the back end of Gmail is written in Java and the back end of Google Search is written in C++.
Google has acknowledged that Python has played an important role from the beginning, and that it continues to do so as the system grows and evolves.

The software that runs the Google infrastructure includes


Google Web Server (GWS) Custom Linux-based Web server that Google uses for its online services. Storage systems:
Google File System and its successor, Colossus BigTable structured storage built upon GFS/Colossus Spanner planet-scale structured storage system, next generation of BigTable stack Google F1 a distributed, quasi-SQL DBMS based on Spanner, substituting a custom version of MySQL.

Borg job scheduling and monitoring system MapReduce and Sawzall programming language

Indexing/search systems:
TeraGoogle Google's large search index (launched in early 2006), designed by Anna Paterson of Cuil fame. Caffeine (Percolator) continuous indexing system (launched in 2010).

Server Types
Google web servers coordinate the execution of queries sent by users, then format the result into an HTML page. The execution consists of sending queries to index servers, merging the results, computing their rank, retrieving a summary for each hit (using the document server), asking for suggestions from the spelling servers, and finally getting a list of advertisements from the ad server. Data-gathering servers are permanently dedicated to spidering the Web. Google's web crawler is known as GoogleBot. They update the index and document databases and apply Google's algorithms to assign ranks to pages. Each index server contains a set of index shards. They return a list of document IDs ("docid"), such that documents corresponding to a certain docid contain the query word. These servers need less disk space, but suffer the greatest CPU workload.

Server Types (Contd.)


Document servers store documents. Each document is stored on dozens of document servers. When performing a search, a document server returns a summary for the document based on query words. They can also fetch the complete document when asked. These servers need more disk space. Ad servers manage advertisements offered by services like AdWords and AdSense. Spelling servers make suggestions about the spelling of queries.

Tips
Either/or Google normally searches for pages that contain all the words you type in the search box, but if you want pages that have one term or another (or both), use the OR operator -- or use the "|" symbol (pipe symbol) to save you a keystroke. [dumb | little | man]

Quotes If you want to search for an exact phrase, use quotes. ["dumb little man"] will only find that exact phrase. [dumb "little man"] will find pages that contain the word dumb and the exact phrase "little man".
Not If you don't want a term or phrase, use the "-" symbol. [-dumb little man] will return pages that contain "little" and "man" but that don't contain "dumb".

Tips(Contd)
Similar terms Use the "~" symbol to return similar terms. [~dumb little man -dumb] will get you pages that contain "funny little man" and "stupid little man" but not "dumb little man". Wildcard The "*" symbol is a wildcard. This is useful if you're trying to find the lyrics to a song, but can't remember the exact lyrics. [can't * me love lyrics] will return the Beatles song you're looking for. It's also useful for finding stuff only in certain domains, such as educational information: ["dumb little man" research *.edu].

Tips(Contd)
Calculator One of the handiest uses of Google, type in a quick calculation in the search box and get an answer. It's faster than calling up your computer's calculator in most cases. Use the +, -, *, / symbols and parentheses to do a simple equation.
Numrange This little-known feature searches for a range of numbers. For example, ["best books 2002..2007] will return lists of best books for each of the years from 2002 to 2007 (note the two periods between the two numbers).

Tips(Contd)
Movies Use the "movie:" operator to search for a movie title along with either a zip code or U.S. city and state to get a list of movie theaters in the area and show times.
Music The "music:" operator returns content related to music only. Unit converter Use Google for a quick conversion, from yards to meters for example, or different currency: [12 meters in yards]

Tips(Contd)
File types If you just want to search for .PDF files, or Word documents, or Excel spreadsheets, for example, use the "filetype:" operator. Location of term By default, Google searches for your term throughout a web page. But if you just want it to search certain locations, you can use operators such as "inurl:", "intitle:", "intext:", and "inanchor:". Those search for a term only within the URL, the title, the body text, and the anchor text (the text used to describe a link). Cached pages Looking for a version of a page the Google stores on its own servers? This can help with outdated or update pages. Use the "cached:" operator.

SMS Service
Send Key words to 9773300000. Cricket Scores: cri PNR: pnr number Availability: train avail 1018 Delhi to Agra on 31-03 Schedule: train 1018 Stock Quotes: reliance stock Horoscope: virgo Local Business Search: taxi vasant kunj News: news, business news Movies: movies Delhi Flight Status: 9w502 Dictionary: define zenith Weather: weather Delhi Calculator: 1 kilo in pound, 5*3/4 Currency Conversion: 5000 inr in usd Facts: GDP of india

Example of SMS Service


Keyword Rooman Technologies Rajajinagar Address Web: * Contact Us - Rooman Technologies www.rooman.net Rooman Technologies Pvt. Ltd. H.O: # 130, 1st Floor, Dr. Rajkumar Road, 1st Block, Rajajinagar, Bangalore - 560 010. Telephone : +91 (0)80 23127771, +91 ...

Reply with 'NEXT' to get more results.


View all results: http://m.google.com/u/R4Fiw0

Any Questions????

Thank You