Sie sind auf Seite 1von 15

Project Presentation

Subject: Professional Practises-3


Faculty: Mudassir Mahadik Sir

Project Details:
Group Members:
1.

Qasim Dadan

2.

Rashid Shaikh

3.

Saif Khan

4.

Wahaj Shaikh

Project Topic: Search Engine and Web


Crawler(Spider)

Search Engine Introduction


Search Engine Definition:

Aweb search engineisasoftwaresystemthatisdesignedtosearchfor


informationontheWorldWideWeb.Thesearchresultsaregenerallypresentedin
alineofresultsoftenreferredtoassearchengineresultspages(SERPs).The
informationmaybeamixofwebpages,images,andothertypesoffiles.Some
searchenginesalsominedataavailableindatabasesoropendirectories.Unlike
webdirectories,whicharemaintainedonlybyhumaneditors,searchenginesalso
maintainreal-timeinformationbyrunninganalgorithmonawebcrawler.

Purpose of Search Engine


Helping people find what theyre looking for

Starts with an information need

Convert to a query

Gets results

Materials available on these Systems

Web pages

Other formats

Deep Web

A search engine operates in the


following order:
1. Web Crawling:
It is bot for purpose of indexing pages into the
database.

2. Indexing:
It decides rank(priority) of indexed results.

3. Searching:
It is process of looking up into the search database by
firing a simple query.

Working Diagram of Search Engine

Explanation of Working Diagram:


Indexing Process:
The search engine analyzes the contents of each page to determine
how it should beindexed(for example, words can be extracted from the
titles, page content, headings, or special fields calledmeta tags). Data
about web pages are stored in an index database for use in later queries. A
query from a user can be a single word. The index helps find information
relating to the query as quickly as possible.

Searching Process:
When a user enters aqueryinto a search engine (typically by
usingkeywords), the engine examines itsindexand provides a listing of
best-matching web pages according to its criteria, usually with a short
summary containing the document's title and sometimes parts of the text.
The index is built from the information stored with the data and the
method by which the information is indexed.

Crawling Process:
A Web crawler starts with a list ofURLsto visit, called theseeds. As the
crawler visits these URLs, it identifies all thehyperlinksin the page and
adds them to the list of URLs to visit, called thecrawl frontier. URLs from
the frontier arerecursivelyvisited according to a set of policies. If the
crawler is performing archiving ofwebsitesit copies and saves the
information as it goes

Search is Mostly Invisible


Like an iceberg,
2/3 below water

user
interface

content

search
functionality

10

Web Crawling Introduction

AWeb crawleris anInternet botwhich systematically browses the


World Wide Web, typically for the purpose ofWeb indexing. A Web crawler
may also be called aWeb spider,anant, anautomatic indexer,or aWeb
scutter. Web search enginesand some other sites use Web crawling or
spidering software to update theirweb contentor indexes of others sites' web
content. Web crawlers can copy all the pages they visit for later processing by
a search engine whichindexesthe downloaded pages so theuserscan search
much more efficiently.

Crawlers can validatehyperlinksandHTMLcode. They can also be used for


web scraping

Utilities of a crawler

Web Crawling Definition:

A Web crawler is a computer program that browses the World Wide


Web in a methodical, automated manner. (Wikipedia)

Web Crawling Utilities:

Gather pages from the Web.

Support a search engine, perform data mining and so on.

Objects Processed by Crawler:

Text, video, image and so on.

Overview of Crawler

A Web crawler starts with a list ofURLsto visit, called theseeds. As the
crawler visits these URLs, it identifies all thehyperlinksin the page and adds
them to the list of URLs to visit, called thecrawl frontier. URLs from the
frontier arerecursivelyvisited according to a set of policies. If the crawler is
performing archiving ofwebsitesit copies and saves the information as it
goes. The archives are usually stored in such a way they can be viewed, read
and navigated as they were on the live web, but are preserved as snapshots'.

The large volume implies the crawler can only download a limited number of
the Web pages within a given time, so it needs to prioritize its downloads.
The high rate of change can imply the pages might have already been
updated or even deleted.

Thank You

Das könnte Ihnen auch gefallen