Sie sind auf Seite 1von 102

i

PERFORMANCE ANALYSIS OF CACHE BASED


WEB SEARCH ENGINE
ABSTRACT
Over the past decade, much research has been done to solve technical challenges regarding
the web search engine, such as crawling web documents, high performance indexes, and ranking
systems using hyperlink analysis. However, implementation details of its query processing system
are rarely dealt with in the literature. In this paper we present a distributed architecture for the
query processing system and its hierarchal cache scheme. Our paper is based on the development
experience of a commercial web search engine designed to answer 5 million user queries against
over 6.5 million web pages per day. Using the hierarchal cache scheme, we keep a portion of query
results in multi-level caches so that excessive I/O or CPU time is not used for query processing.
With that scheme, it is possible to reduce around 70% of the server costs. And we can Analyze the
Performance of Cache Based Web Search Engine .By Using This Project we can analyze the
performance of cache based web search engine and apply the same principles to large scale web
search engine.

Tools Used

Java JDK 1.6

My Eclipse 7.5 (IDE tool for Application Development)

Tomcat 6.0 (Web Server)

Rational Rose 2000 ( System Design using Unified Process)

Flash 8 ( 2D-Animation tool )

MS Access 2007 (As a Database )

ii

Table of Contents

CERTIFICATEi
ACKNOWLEDEMENT.ii
ABSTRACT.. iii
LIST OF FIGURES.viii
1. INTRODUCTION ................................................................................................................ 1
2. AIM OF THE PROJECT .................................................................................................... 3
3. APPLICATION OF THE PROJECT ................................................................................. 3
4. LITERATURE SURVEY ..................................................................................................... 5
4.1 What is Search Engine ? ............................................................................................. 5
4.2 Search engine survey .................................................................................................... 5
4.3 Basic Concepts ............................................................................................................. 5
4.4 Types of Search Engines .............................................................................................. 8
4.4.1 General Search Engines ................................................................................ 8
4.4.2 Meta search Engines .................................................................................... 9
4.4.3 Media Search Engines .................................................................................. 10
4.4.4 Genre Oriented Search Engines .................................................................... 11
4.4.5 Defunct Search Engines ............................................................................... 11
4.5 How Search Engine Works ......................................................................................... 11
4.5.1 Basic Building Blocks of Search Engine ...................................................... 12
4.5.1.1 .Web Crawling ................................................................................. 12
4.5.1.2 Building the Index ............................................................................ 14
4.5.1.3 Building a Search ............................................................................. 16
5. FEASIBILITY STUDY ...................................................................................................... 18
5.1 Economic Feasibility ................................................................................................. 18
5.2 Technical Feasibility ................................................................................................. 18

iii

5.3 Operational Feasibility .............................................................................................. 19


6. REQUIREMENT ANALYSIS ........................................................................................... 21
6.1 Purpose of the System ................................................................................................ 21
6.2 Scope of the system ................................................................................................... 21
6.3 Current System ........................................................................................................... 22
6.4 Proposed System ........................................................................................................ 22
6.4.1 Functional Requirements.............................................................................. 23
6.4.2 Non Functional Requirements ...................................................................... 23
6.4.2 .1 Usability ...................................................................................... 23
6.4.2.2 Reliability .................................................................................... 23
6.4.2.3 Performance ................................................................................ 23
6.4.2.4 Supportability .............................................................................. 24
6.5 Software Requirements and Specifications ............................................................... 25
6.5.1 Software Interfaces (Development Tools) .................................................... 25
6.5.2 Hardware Interfaces : ................................................................................... 25
6.5.3 Communication Interfaces .......................................................................... 26
6.5.4 User Interfaces ............................................................................................. 26
7. SYSTEM DESIGN ............................................................................................................ 28
7.1 Design Modeling Tools .............................................................................................. 28
7.1.1.1 Object Oriented Analysis .......................................................................... 28
7.1.1.2 Object Oriented Design ............................................................................. 29
7.1.2 System Modeling with UML Diagrams ........................................................ 31
7.1.2.1 About UML ..................................................................................... 31
7.1.2.2 Use case Model for the Query Processing System ............................ 35
7.1.2.3 Sequence Diagram for the Query Processing System ..................... 36
7.1.2.4 Activity diagram for the Query Processing System .......................... 37
7.1.2.5 Collaboration Diagram for Query Processing System ....................... 38
7.1.2.6 Classes required for Query Processing System. ................................. 39
7.1.2.7 System overall Class Diagram with Associations .............................. 42
7.2 System Development Environment ............................................................................ 43
7.2.1 JSP(Java Servlet Pages ) Role in this Project : ............................................. 48
7.2.2 Java Servlets Role in this Project : .............................................................. 48

iv

7.2.3 About MyEclipse 7.5 .................................................................................. 50


7.3 Proposed Software Architecture ................................................................................ 51
7.3.1 Sub System Decomposition .......................................................................... 51
7.3.1.1 User Authentication Module ......................................................... 51
7.3.1.2 Query processing system (QPS) Module ....................................... 51
7.3.1.3 Algorithm for Query processing steps of the coordinator server. . 53
7.3.1.4 Multi-level Cache Module ............................................................. 53
7.3.1.5 Algorithm for the CL1 cache (Level 1 Cache ) .............................. 54
8. IMPLEMENTATION ........................................................................................................ 57
8.1 Implementation of Data base connectivity Class........................................... 57
8.2 Implementation of Util Class.......................................................................... 62
8.3 Implementation of TextClassification Class .................................................. 67
8.4 Implementation of Loading Class ................................................................... 68
8.5 Implementation of Controller Class ( Client Servlet ) .................................... 69
9 . TESTING ........................................................................................................................... 72
9.0.1 White-box and black-box testing .................................................................. 72
9.0.2 Test levels .................................................................................................... 73
9.0.3 A sample Test Cycle .................................................................................... 73
9.1 Test Cases for LOGIN Module .................................................................................. 75
9.1.1 Test case of New User Login Forms ............................................................ 75
9.1.2 Test case of Existing User Login Forms ..................................................... 76
9.2 Test cases With Search Engine Interface Forms ......................................................... 77
9.2.1 Search Engine Interface Form ..................................................................... 77
9.2.2 Test case with the key word 1985 in Search Engine Interface Form ......... 77
9.2.3 Test case with the key word DMW in Search Engine Interface Form ....... 78
9.2.4 Test case with the key word TPO in Search Engine Interface Form.......... 78
9.2.5 Test case with the key word mobile in Search Engine Interface Form ...... 79
9.2.6 Test case with the key word research mining in Search Engine Interface
Form..................................................................................................................... 79
9.2.7 Test case with the key word sfdsadfsdhdf in Search Engine Interface Form
............................................................................................................................. 80
( No record Found Case ) ...................................................................................... 80

9.3 Test cases with Time Performance Analysis of Web Search Engine .......................... 81
9.3.1 First Time applying Query without Cache of Search Engine Interface ........ 81
9.3.1.1 Time Performance Ratio graph for the above result .................................. 81
9.3.2 Second Time applying Same Query with Cache of Search Engine Interface
............................................................................................................................. 82
9.3.2.1 Time Performance Ratio graph for the above result .................................. 82
10. PERFORMANCE ANALYSIS......................................................................................... 83
11. FUTURE ENHANCEMENT ............................................................................................ 84
12. CONCLUSION ................................................................................................................. 86
13. REFERENCES ................................................................................................................. 88
APPENDIX ....................................................................................................................... 89
GLOSSARY ...................................................................................................................... 89

LIST OF FIGURES

vi

Figure No

Description

Page No

1.1

Web Crawling Process.

13

1.2

Basic Rational Rose 2000 IDE....

34

2.1

Use case Model for the Query Processing System...

35

2.2

Sequence Diagram for the Query Processing System.....

36

3.1

Activity diagram for the Query Processing System.

37

3.2

Collaboration Diagram for Query Processing System..

38

4.1

Class Diagram for Database Accessing ...

39

4.2

Util Class diagram....

40

4.3

Controller Class Diagram.....

40

4.4

Text Classification Class Diagram...

40

4.5

Loading Class diagram....

41

5.1

System overall Class Diagram with Associations...

42

5.2

Query processing system (QPS) Module..

51

1. Introduction
Nowadays, the web search engine is widely used as a common way to find
information of interests and its indexed documents have reached the scale of multiple
billions. The web search engine is software that indexes web documents collected from
the Internet and gives orders to them according to their query relevancy with respect to
an entered user query. Much research has been done to solve various problems related
to the web search engine, such as crawling web documents, high-performance indexing,
hyperlink analysis, and topic sensitive searching. However there is not enough
information about the way how to implement the query processing system suitable for
large-scale web search engines. Since the system has to index a huge size of data, the
cost for yielding a query result could be very high.
This project mainly focus on the QPS of a large-scale of web search engine. We
first design a distributed architecture for the QPS. Since the amount of used CPU and
I/O resources is so huge for query processing, more than one server has to work in
parallel even for processing a single query. To make such cooperation efficient, the
QPS is designed as clustered servers and the server clusters are connected to each other
via high-speed LANs. Next, we describe the hierarchical cache scheme, which is the
main topic of this paper. The cache scheme is devised to have hierarchical 4-level cache
data. In the top-level cache, the recent search result pages are stored in main memory,
and the remaining lower levels of caches reside in the disk for saving more query
results. Using the multi-level caches, we can save 70% of server cost for query
processing. In this way, our system indexes 65 million web documents and can answer
5 millions of user queries against them per day at a cheap cost.

2. Aim of the project


The main objective of the system is to use the memory and disk space efficiently. It is
possible to reduce around 70% of the server costs.

3. Application of this Project


The basic application of this project is one can adopt this strategy this project to
improve the efficiency of a system and gives idea s where search engine developers
do the things in better way and we can analyze the performance of a typical cache based
Web search Engine.

4. Literature Survey
Literature has been carried out on search engines and related mechanisms of a typical
search engine.

4.1 What is Search Engine ?


A search engine is software system that assists people in locating resources. An
important class of search engines are those whose primary function is to search for
information. They are by now means ubiquitous, a large (and growing) number of
search engines assists people in locating other things than information, such as jobs,
services and physical objects (e.g. books, CDs, online auction items, houses, cars and
even other people). Such search engines fall outside the scope of this survey.

4.2 Search engine survey


At the time of writing (July 2000), more than 700 search engines can be accessed
through the Internet, most of them by means of a web page that allow users to set up
searches and view a ranked result.

This web page is part of an continuing effort to create a survey of some of these search
engines (in particular those targeted towards locating information) and the multifarious
data repositories and informations spaces they allow people to search in.

4.3 Basic Concepts


There is a lot of information (or data) in the world. Not all of it can be located through
the Internet (at least not yet), but a lot of it can now (at least in principle) be found by
by means of some Internet search engine.

Search engines can be thought of as mapmakers in information space. They explore the
landscape of information and create maps in the shape of internal structures that are
supposed to help travelers find their way in the chaotic warehouse of superabundant
information that most people see when they first are confronted with the World Wide
Web.

Not all search engines map the same information space. Some search engines map the
resources that can be downloaded from open ftp (file transfer protocol) repositories
around the globe, other the resources that are resident on the World Wide Web
(technically, this means that they map resources visible through the http, hyper text
transfer protocol, since this protocol is what technically defines which part of the
Internet landscape belongs to the World Wide Web). A third protocol that also defines a
clearly delineated information space is nntp (network news transfer protocol), which
technically defines what is known as "network news" (not to be confused with what
constitutes "news" in "old media" such as television and periodicals) or more
descriptive "Usenet discussion groups". To make things even more complicated, some
search engine not only explore information spaces delineated by technical protocol, but
by genre. For instance, they may monitor and map breaking news (in the "old media"
sense) distributed by wire services and/or news oriented media on the World Wide
Web.

In the survey, I've indicated this by creating categories accordingly. There are three
clear-cut categories corresponding to technical protocol (web/http, usenet/nntp and
ftp/ftp). In addition, some search engine provide access to proprietary data from sources
outside what is openly available on the Internet.

Brief description of the 10 information spaces:

7
1.

web: Information accessible through the world wide web

2.

usenet: Usenet discussion groups

3.

ftp: archives that can be accessed through so-called anonymous ftp

4.

prop.: proprietary information not openly available on the Internet

5.

legal: collection of laws, patents, rulings and other legal sources.

6.

wire: wire services, newspapers, news oriented periodicals

7.

science: scientific papers

8.

reports: business oriented reports and surveys

9.

ref.: reference works such as bibliographies, encyclopedias and dictionaries


]=[gdir.: directories and catalogue listings

Being aware of which information space a particular engine map is crucial if you want
to make efficient use of the engine. If you are looking for a specific company's home
page on the World Wide Web, you have a much higher chance of success if you use a
search engine that map the World Wide Web, rather than one that let you search the
information space made up of newswire telegrams and newspaper articles.

4.4 Types of Search Engines


4.4.1 General Search Engines
Alta Vista Search Engine:
AltaVista was originally created at Digital Equipment Corporation's Western Digital
Palo Alto lab as an on-the-web showcase for the 64 bit Alpha CPU developed by DEC
(now a part of Compaq). Its presence on the web helped establishing search engines as
part of the Internet landscape. Following the restructuring of DEC, AltaVista was set up
as a separate company and is now a major Internet portal with Internet searching as only
one of its many services. Among its special features is a technology called RealNames
that checks the search terms entered by the user against an internal database of registred
and common-law company, product and concept names and marketing slogans. If
RealNames finds a match, it points you to the name-owner's WebPage. AltaVista also
licenses Babelfish from the company Systran. This is useful for quick translations of
foreign webpages into (something resembling) English. AltaVista also offers a media
finder, that let user's search for particular media types (e.g. images, video and audio).
This is described below, in the section on media search engines.

Google Search Engine :


Google (earlier name WebBase) started out as a research project at Stanford University. Google
uses a simple, yet strikingly efficient, approach to ranking, known as "link cardinality". Google
has trademarked this with the name PageRank and applied for a patent. What is meant by
"link cardinality" is that it counts how many times sites, especially respected and well-known
sites, such as Yahoo!, have linked to a particular page from elsewhere on the web. This is a
variation of the recommender system approach, but instead of explicit voting, Google extracts
the "votes" implicit by counting links. Google has another useful feature. Like most web search
engines, it provides a hyperlink that the user may information click on to go to the information

9
at the place of origin (the non-hosted approach), in addition, it retains a local copy of the page
as it was originally downloaded and analyzed (the hosted approach). This local copy can be
used both if the original page no longer is available, or if it has changed so much since the time
it was analyzed by Google that the relevance ranking of the page no longer apply.

Invisible Web Search Engine :


The Invisible Web consists of searchable information resources whose contents cannot
be indexed by traditional search engines. These include databases, archived material,
and interactive tools such as calculators and dictionaries. Since these resources are
embedded within thousands of individual Web sites, they are not visible to the search
engines of today.

Northern Light Search Engine :


In addition to searching open Internet searches, Northern Light hosts a database called
the Special Collection, which it describes as "an online business library comprising
6,700 trusted, full-text journals, books, magazines, newswires, and reference sources."
Access to items in the special collections is on PPV basis.

4.4.2 Meta search Engines


Meta search engines forward queries to other search engines, extract the results,
perform a re-ranking and present them through a standardized user interface.

10

AskJeeves Search Engine :

The AskJeeves site accepts questions in plain English: How do I scan photographs?
Where can I find recipes for apple pie? AskeJeeves compares your question with its
internal knowledge base of questions and answers (compiled by human editors), and
finds those closest matching your question. You are then presented with a list of
questions it knows how to answer, and should pick one of these (if it still bears
resemblance to your original question). This question is then transformed into a set of
appropriate search requests and submitted to a number of search engines as well as (if
appropriate) reference works such as the Encyclopedia Britannica.
Go2Net Search Engine :

Both MetaCrawler and DogPile is part of the Go2Net Network.

4.4.3 Media Search Engines

AltaVista Search Engine :

AltaVita also features a media finder, that let user's search by keyword for particular
media types, such as images, video or audio files. The search results in a pointer to the
media file, and a presentation of derived metadata, such as file format and file size. For
images and videos, the results page includes thumbnail illustrations. In addition to
searching for media files on the web, AltaVista let users search for premium media
content sold through partners.

11

MP3 Search Engine :

Searches for audio files on the MP3 file format in open ftp archives.

4.4.4 Genre Oriented Search Engines


The genre oriented search engines are characterized by restricting their information
space to a particular genre (such as scientific papers or newswire telegrams). Many of
these provides extensive coverage to proprietary sources (often carefully maintained by
human ontologys), while others are set up to locate genre specific resources that are
freely available on the World Wide Web. I've removed the Usenet and Ftp categories,
since these doesn't apply in this context.

4.4.5 Defunct Search Engines


Cora
Cora and its companion service Sara searches the World Wide Web for research papers
in PostScript format from universities and laboratories. Cora's focus is computer
science.
Sara
Sara and its companion service Cora searches the World Wide Web for research papers
in PostScript format from universities and laboratories. Sara's focus is the field of
statistics.

4.5 How Search Engine Works


Internet search engines are special sites on the Web that are designed to help people find
information stored on other sites. There are differences in the ways various search engines
work, but they all perform three basic tasks

They search the Internet -- or select pieces of the Internet -- based on important
words.

12

They keep an index of the words they find, and where they find them.

They allow users to look for words or combinations of words found in that
index.

Early search engines held an index of a few hundred thousand pages and documents,
and received maybe one or two thousand inquiries each day. Today, a top search engine
will index hundreds of millions of pages, and respond to tens of millions of queries per
day.

4.5.1 Basic Building Blocks of Search Engine

There are 3 basic building blocks of Search engine they are namely

Web Crawling

Building The Index

Building a Search

4.5.1.1 .Web Crawling

Before a search engine can tell you where a file or document is, it must be found. To
find information on the hundreds of millions of Web pages that exist, a search engine
employs special software robots, called spiders, to build lists of the words found on
Web sites. When a spider is building its lists, the process is called Web crawling.
(There are some disadvantages to calling part of the Internet the World Wide Web -- a
large set of arachnid-centric names for tools is one of them.) In order to build and
maintain a useful list of words, a search engine's spiders have to look at a lot of pages.

13

How does any spider start its travels over the Web? The usual starting points are lists of
heavily used servers and very popular pages. The spider will begin with a popular site,
indexing the words on its pages and following every link found within the site. In this
way, the spidering system quickly begins to travel, spreading out across the most widely
used portions of the Web.

Figure 1.1 Web Crawling Process

"Spiders" take a Web page's content and create key search words that enable online
users to find pages they're looking for.

14

4.5.1.2 Building the Index


Once the spiders have completed the task of finding information on Web pages (and
we should note that this is a task that is never actually completed -- the constantly
changing nature of the Web means that the spiders are always crawling), the search
engine must store the information in a way that makes it useful. There are two key
components involved in making the gathered data accessible to users

The information stored with the data

The method by which the information is indexed

In the simplest case, a search engine could just store the word and the URL where it
was found. In reality, this would make for an engine of limited use, since there would
be no way of telling whether the word was used in an important or a trivial way on
the page, whether the word was used once or many times or whether the page
contained links to other pages containing the word. In other words, there would be no
way of building the ranking list that tries to present the most useful pages at the top
of the list of search results.

To make for more useful results, most search engines store more than just the word
and URL. An engine might store the number of times that the word appears on a
page. The engine might assign a weight to each entry, with increasing values
assigned to words as they appear near the top of the document, in sub-headings, in
links, in the meta tags or in the title of the page. Each commercial search engine has a
different formula for assigning weight to the words in its index. This is one of the

15

reasons that a search for the same word on different search engines will produce
different lists, with the pages presented in different orders. Regardless of the precise
combination of additional pieces of information stored by a search engine, the data
will be encoded to save storage space. For example, the original Google paper
describes using 2 bytes, of 8 bits each, to store information on weighting -- whether
the word was capitalized, its font size, position, and other information to help in
ranking the hit. Each factor might take up 2 or 3 bits within the 2-byte grouping (8
bits = 1 byte). As a result, a great deal of information can be stored in a very compact
form. After the information is compacted, it's ready for indexing.

An index has a single purpose: It allows information to be found as quickly as


possible. There are quite a few ways for an index to be built, but one of the most
effective ways is to build a hash table. In hashing, a formula is applied to attach a
numerical value to each word. The formula is designed to evenly distribute the
entries across a predetermined number of divisions. This numerical distribution is
different from the distribution of words across the alphabet, and that is the key to a
hash table's effectiveness.

In English, there are some letters that begin many words, while others begin fewer.
You'll find, for example, that the "M" section of the dictionary is much thicker than
the "X" section. This inequity means that finding a word beginning with a very
"popular" letter could take much longer than finding a word that begins with a less
popular one. Hashing evens out the difference, and reduces the average time it takes
to find an entry. It also separates the index from the actual entry. The hash table
contains the hashed number along with a pointer to the actual data, which can be
sorted in whichever way allows it to be stored most efficiently. The combination of

16

efficient indexing and effective storage makes it possible to get results quickly, even
when the user creates a complicated search.

4.5.1.3 Building a Search


Searching through an index involves a user building a query and submitting it
through the search engine. The query can be quite simple, a single word at minimum.
Building a more complex query requires the use of Boolean operators that allow you
to refine and extend the terms of the search.

The Boolean operators most often seen are:

AND - All the terms joined by "AND" must appear in the pages or
documents. Some search engines substitute the operator "+" for the word
AND.

OR - At least one of the terms joined by "OR" must appear in the pages or
documents.

NOT - The term or terms following "NOT" must not appear in the pages or
documents. Some search engines substitute the operator "-" for the word
NOT.

FOLLOWED BY - One of the terms must be directly followed by the other.

NEAR - One of the terms must be within a specified number of words of the

17

18

5. Feasibility Study

Feasibility study is a compressed capsule version of the entire System Analysis and
Design Process. The study begins by clarifying the problem definition. Feasibility
Study is not to solve the problem but to determine it is worth doing.
Once an acceptable problem definition has been generated; the
Analyst Develops a logical model of as reference. Next the alternatives are carefully
analyzed for Feasibility. At least three different types if feasibility are considered.

5.1 Economic Feasibility


A system that can be developed technically and that will be used if installed Must
still be a good investment for the organization. Financial benefits must equal or
exceed The cost. The cost of feasibility study should be approximately 5 to 10
percent of the estimated cost.

5.2 Technical Feasibility


The technical issues usually raised during feasibility are does the
necessary technology exist to do what is suggested ? Can the system be expanded if
developed ? The present object is being done after all the software requirements are
met and also there is provision for further enhancement . Language that can reach to
system level is needed to give a solution this problem . Scripting Language like JSP

19

to provide this option . Minimum hard is enough to fulfill the requirements to


develop this project hence we conclude that this project is technically reliable

5.3 Operational Feasibility


This test of feasibility asks if the system will works when it is
developed and installed . Here are the questions that help tests the operational
feasibility of a project. Is there

sufficient

support

for

the project

from

management and users? Will the proposed development of the project has been
done with the involvement of management and users and it is tested to work in all
conditions.? Have the users been done with the involvement of management and
users and it is tested to work in all conditions . So it can be considered as
operationally feasible.

20

21

6. Requirement Analysis
6.1 Purpose of the System
Mainly focus on the QPS (Query Processing System)of a large-scale of web
search engine. We first design a distributed architecture for the QPS. Since the
amount of used CPU and I/O resources is so huge for query processing, more than
one server has to work in parallel even for processing a single query.

6.2 Scope of the system

This Project is based on the development experience of a commercial


web search engine designed to answer 5 million user queries against over
6.5 million web pages per day.

The benefits of the implemented multi-level cache scheme.

To make response times shorter

To use the memory and disk space efficiently, the cache data are managed
across caches of four different levels. Using the multi-level caching
scheme, we can save around 70% of the server cost.

22

6.3 Current System

The web search engine is widely used as a common way to find information
of interests and its indexed documents have reached the scale of multiple billions.
The web search engine is software that indexes web documents collected from the
Internet and gives orders to them according to their query relevancy with respect to
an entered user query. Much research has been done to solve various problems
related to the web search engine, such as crawling web documents, high-performance
indexing, hyperlink analysis, and topic sensitive searching. However there is not
enough information about the way how to implement the query processing system
suitable for large-scale web search engines. Since the system has to index a huge size
of data, the cost for yielding a query result could be very high.

6.4 Proposed System


We mainly focus on the QPS of a large-scale of web search engine. We first
design a distributed architecture for the QPS. Since the amount of used CPU and I/O
resources is so huge for query processing, more than one server has to work in
parallel even for processing a single query. To make such cooperation efficient, the
QPS is designed as clustered servers and the server clusters are connected to each
other via high-speed LANs.
Using the multi-level caches, we can save 70% of server cost for query
processing. In this way, our system indexes 65 million web documents and can

23

answer 5 millions of user queries against them per day at a cheap cost.

6.4.1 Functional Requirements


The functional requirements includes, the user interfaces for the both the
client and server. First it proceed the user authentication next its goes to search
engine and implement the web search Engine with Cache scheme finally generate
the graphs of query result with respect the response time of retrieval links which will
help us to analyze performance of Web Search Engine.
User Can login to the system and apply key word into a search box inter face where
he will get the related hyperlinks .

6.4.2 Non Functional Requirements


6.4.2 .1 Usability
The system is used by the two class of persons namely the client and the server.

6.4.2.2 Reliability
The system is said to be reliable because the entire system was built using
java, which is most robust language. Reliability refers to the standards of the system.

6.4.2.3 Performance
System is highly functional and good in performance. The system must use
the minimal set of variables and minimal usage of the control structures will
dynamically increase the performance of the system.

24

6.4.2.4 Supportability
The system is supportable with different platforms and a wide range of
machines. The java code used in this project is more flexible and having a feature of
platform independence.

25

6.5 Software Requirements and Specifications


6.5.1 Software Interfaces (Development Tools)
Operating System

Microsoft Windows XP

Languages

Java (JDK1.5.0_06) ,HTML

IDE

My Eclipse 7.5

Web Server

Apache Tomcat 6.0

Database Interface

MS Access

Database Drivers

ODBC

UML Tool

Rational Rose 2000

6.5.2 Hardware Interfaces :


CPU

: Pentium IV 2.2 GHz

Memory

: 256 MB RAM

Hard Disk Drive

: 20 GB

Monitor

: VGA color Monitor

26

6.5.3 Communication Interfaces


HTTP ( Hyper Text Transfer Protocol )

6.5.4 User Interfaces


Login Interface form : User can login and access the search engine services
Search Engine Interface : User can enter a word and get related URLs
The interface is much more flexible to the user to use it. It uses the html and JSP
components.

27

28

7. System Design
7.1 Design Modeling Tools
Object Oriented Analysis and Design (OAD) is often part of the development of
large scale systems and programs often using the Unified Modeling Language
(UML). OAD applies object-modeling techniques to analyze the requirements for a
context for example, a system, a set of system modules, an organization, or a
business unit and to design a solution. Most modern object-oriented analysis and
design

methodologies

are

use

case

driven

across

requirements,

design,

implementation, testing, and deployment


Use cases were invented with object oriented programming, but they're also
very well suited for systems that will be implemented in the procedural paradigm.
The Unified Modeling Language (UML) has become the standard modeling language
used in object-oriented analysis and design to graphically illustrate system concepts
The basic reason behind usage of

OAD is it can be used in

developing programs that will have an extended lifetime.

7.1.1.1 Object Oriented Analysis


An object-oriented system is composed of objects. The behavior of the system is
achieved through collaboration between these objects, and the state of the system is
the combined state of all the objects in it. Collaboration between objects involves
them sending messages to each other. The exact semantics of message sending

29

between objects varies depending on what kind of system is being modeled. In some
systems, "sending a message" is the same as "invoking a method".
Object Oriented Analysis

aims to model the problem domain, the

problem we want to solve by developing an object-oriented (OO)System The source


of the analysis is a written requirement statements, and/or written use cases, UML
diagrams can be used to illustrate the statements
An analysis model will not take into account implementation constraints, such as
concurrency, distribution, persistence, or inheritance, nor how the system will be
built The model of a system can be divided into multiple domains each of which are
separately analyzed, and represent separate business, technological, or conceptual
areas of interest The result of object-oriented analysis is a description of what is to
be built, using concepts and relationships between concepts, often expressed as a
conceptual model. Any other documentation that is needed to describe what is to be
built, is also included in the result of the analysis. That can include a detailed user
interface mock-up document The implementation constraints are decided during the
object-oriented design (OOD) process

7.1.1.2 Object Oriented Design


Object-Oriented Design (OOD) is an activity where the designers are looking for
logical solutions to solve a problem, using Objects Object-oriented design takes the
conceptual model that is the result of object-oriented analysis, and adds
implementation constraints imposed by the environment, the programming language
and the chosen tools, as well as architectural assumptions chosen as basis of Design
The concepts in the conceptual model are mapped to concrete classes, to abstract

30

interfaces in APIs and to roles that the objects take in various situations. The
interfaces and their implementations for stable concepts can be made available as
reusable services. Concepts identified as unstable in object-oriented analysis will
form basis for policy classes that make decisions, implement environment-specific or
situation specific logic or algorithms
The result of the object-oriented design is a detail description how the system can be
built, using objects .Object-oriented software engineering (OOSE) is an object
modeling language and Methodology
OOSE was developed by Ivar Jacobson in 1992 while at Objectory AB. It is the first
object-oriented design methodology to employ use cases to drive software design. It
also uses other design products similar to those used by OMT
The tool Objectory was created by the team at Objectory AB to implement the OOSE
methodology. After success in the marketplace, other tool vendors also supported
OOSE After Rational bought Objectory AB, the OOSE notation, methodology, and tools
became superseded

As one of the primary sources of the Unified Modeling Language (UML),


concepts and notation from OOSE have been incorporated into UML

The methodology part of OOSE has since evolved into the Rational Unified
Process (RUP)

The OOSE tools have been replaced by tools supporting UML and RUP

OOSE has been largely replaced by the UML notation and by the RUP methodology

31

7.1.2 System Modeling with UML Diagrams


7.1.2.1 About UML
UML is a Modeling language which is mainly used for Design the System in
terms of graphical notation and shows blueprint of overall design of a system .
The Unified Modeling Language (UML) is used to specify, visualize, modify,
construct and document the artifacts of an object oriented software intensive system
under development. UML offers a standard way to visualize a system's architectural
blueprints such as

The following diagrams will help us to design System models


I am here Using Three kinds of UML Diagrams for designing Systems models are
1. Use Case Diagram
2. Sequence Diagram
3. Activity Diagram
4. Collaboration Diagram
5. Class Diagram

32

1. Use Case Diagram: A use case diagram shows a set of use cases and actors and
their relationships. You apply use case diagrams to illustrate the static use case view
of a system. Use case diagrams are especially important in organizing and modeling
the behaviors of a system.
2. Sequence Diagram: A sequence diagram is an interaction diagram that
emphasizes the time ordering of messages. A sequence diagram shows a set of object
and the messages sent and received by those objects. The objects are typically named
or anonymous instances of classes, but many also represent instances for other
things, such as collaborations, components, and nodes. You use sequence diagrams to
illustrate the dynamic view of a system
3. Activity Diagram: An activity diagram shows the flow from activity to activity
within system. An activity shows a set of activities, the sequential or branching flow
from activity to activity, and object that act and are acted upon. You use activity
diagram to illustrate the dynamic view of a system. Activity diagrams are especially
important in modeling the function of a system. Activity diagram emphasize the flow
of control among the objects
4. Collaboration Diagram: This diagram is the alternative diagram for sequence
diagram but will take less space to represent the sequence by using numbers. and
shows collaboration among all the objects .

33

5. Class Diagram : Class Diagram will shows how to implement a system through
real time objects as a classes and which is just before coding of a system moreover it
depicts Domain model of a entire system and gives various relations among all the
classes.

34

About Rational Rose 2000 :


Rational Rose 2000 is a software tool which will help us to design any system in
terms of visual tools and can capable to create language code according to model.

Figure 1.2 Basic Rational rose 2000 IDE


Basic IDE (Integrated Development Environment ) of Rational Rose 2000

35

Login

Create new user

Enter the Query

User

Get result

View Graph

Figure 2.1 Use case Model for the Query Processing System

7.1.2.2 Use case Model for the Query Processing System

36

User

Firewall

Load balancer

Webservers

Coordinator
Servers

Rankers

DST Servers

1: Access Firewall ()

2: Load ()
3: Access Server ()
4: Query Process ()

5:Assign Ranker ()

6 :Calculate Rank ()

7 :Query Result ()

8:Show Result to User ()

Figure 2.2. Sequence Diagram for the Query Processing System

7.1.2.3 Sequence Diagram for the Query Processing System

37

Start

User name
and Password

Create new
User
No

User Login
Process

Create new
user process

Validate

Yes

Enter Search
Query

Search From
Cache

Search key
word

Search

Yes
Show Result

Web Server

Store new Search Result to Cache

Coordinator
Server

Ranker

DST Server

View Result

View Graph

Stop

Figure 3.1 Activity diagram for the Query Processing System

7.1.2.4 Activity diagram for the Query Processing System

38

7.1.2.5 Collaboration Diagram for Query Processing System

Figure 3.2 Collaboration Diagram for Query Processing System

2: 2: Load ()

Firewall

Load
balancer
3: 3: Access Server ()
Webserv
ers

6: 6 :Calculate Rank ()

4: 4: Query Process ()

Rankers

5: 5:Assign Ranker ()

Coordinator
Servers

1: 1: Access Firewall ()

7: 7 :Query Result ()

8: 8:Show Result to User ()

DST
Servers

User

39

7.1.2.6 Classes required for Query Processing System.


Figure 4.1 .Class Diagram for Database Access

40

Figure 4.2 .Util Class diagram

Figure 4.3.Controller Class Diagram

Figure 4.4 . Text Classification Diagram Class Diagram

41

Figure 4.5 .Loading Class diagram

42

7.1.2.7 System overall Class Diagram with Associations

Figure 5.1 System overall Class Diagram with Associations

43

7.2 System Development Environment


About System Development Tools
About Java

Initially the language was called as oak but it was renamed as Java in 1995. The
primary motivation of this language was the need for a platform-independent (i.e.,
architecture neutral) language that could be used to create software to be embedded in
various consumer electronic devices.

Java is a programmers language.

Java is cohesive and consistent.

Except for those constraints imposed by the Internet environment, Java gives
the programmer, full control.

Finally, Java is to Internet programming where C was to system


programming.

Importance of Java to the Internet

Java has had a profound effect on the Internet. This is because; Java expands the
Universe of objects that can move about freely in Cyberspace. In a network, two
categories of objects are transmitted between the Server and the Personal computer.
They are: Passive information and Dynamic active programs. The Dynamic, Selfexecuting programs cause serious problems in the areas of Security and probability.
But, Java addresses those concerns and by doing so, has opened the door to an exciting
new form of program called the Applet.

44

Java can be used to create two types of programs

Applicat io ns and Applet s: An application is a program that runs on our Computer


under the operating system of that computer. It is more or less like one creating using C
or C++. Javas ability to create Applets makes it important. An Applet is an application
designed to be transmitted over the Internet and executed by a Java compatible web
browser. An applet is actually a tiny Java program, dynamically downloaded across the
network, just like an image. But the difference is, it is an intelligent program, not just a
media file. It can react to the user input and dynamically change.
Features Of Java Security
Every time you that you download a normal program, you are risking a viral
infection. Prior to Java, most users did not download executable programs frequently,
and those who did scanned them for viruses prior to execution. Most users still worried
about the possibility of infecting their systems with a virus. In addition, another type of
malicious program exists that must be guarded against. This type of program can gather
private information, such as credit card numbers, bank account balances, and
passwords. Java answers both these concerns by providing a firewall between a
network application and your computer.
When you use a Java-compatible Web browser, you can safely download Java applets
without fear of virus infection or malicious intent.
Portability
For programs to be dynamically downloaded to all the various types of platforms
connected to the Internet, some means of generating portable executable code is needed

45

.As you will see, the same mechanism that helps ensure security also helps create
portability. Indeed, Javas solution to these two problems is both elegant and efficient.
The Byte code
The key that allows the Java to solve the security and portability problems is that the
output of Java compiler is Byte code. Byte code is a highly optimized set of instructions
designed to be executed by the Java run-time system, which is called the Java Virtual
Machine (JVM). That is, in its standard form, the JVM is an interpreter for byte code.
Translating a Java program into byte code helps makes it much easier to run a program
in a wide variety of environments. The reason is, once the run-time package exists for a
given system, any Java program can run on it.
Although Java was designed for interpretation, there is technically nothing about Java
that prevents on-the-fly compilation of byte code into native code. Sun has just
completed its Just In Time (JIT) compiler for byte code. When the JIT compiler is a part
of JVM, it compiles byte code into executable code in real time, on a piece-by-piece,
demand basis. It is not possible to compile an entire Java program into executable code
all at once, because Java performs various run-time checks that can be done only at run
time. The JIT compiles code, as it is needed, during execution.
Java Virtual Machine (JVM)
Beyond the language, there is the Java virtual machine. The Java virtual machine is an
important element of the Java technology. The virtual machine can be embedded within
a web browser or an operating system. Once a piece of Java code is loaded onto a
machine, it is verified. As part of the loading process, a class loader is invoked and does
byte code verification makes sure that the code thats has been generated by the
compiler will not corrupt the machine that its loaded on. Byte code verification takes

46

place at the end of the compilation process to make sure that is all accurate and correct.
So byte code verification is integral to the compiling and executing of Java code.

Overall Description
Source

Java

Java byte code

JavaVM

.Class

Picture showing the development process of JAVA Program


Java programming uses to produce byte codes and executes them. The first box
indicates that the Java source code is located in a. Java file that is processed with a Java
compiler called javac. The Java compiler produces a file called a. class file, which
contains the byte code. The. Class file is then loaded across the network or loaded
locally on your machine into the execution environment is the Java virtual machine,
which interprets and executes the byte code.
Java Architecture

Java architecture provides a portable, robust, high performing environment for


development. Java provides portability by compiling the byte codes for the Java Virtual
Machine, which is then interpreted on each platform by the run-time environment. Java
is a dynamic system, able to load code when needed from a machine in the same room
or across the planet.
Compilation of code
When you compile the code, the Java compiler creates machine code (called byte code)
for a hypothetical machine called Java Virtual Machine (JVM). The JVM is supposed to
execute the byte code. The JVM is created for overcoming the issue of portability. The

47

code is written and compiled for one machine and interpreted on all machines. This
machine is called Java Virtual Machine.
Compiling and interpreting Java Source Code

During run-time the Java interpreter tricks the byte code file into thinking
That it is running on a Java Virtual Machine. In reality this could be a Intel Pentium
Windows 95 or SunSARC station running Solaris or Apple Macintosh running system
and all could receive code from any computer through Internet and run the Applets.
Simple
Java was designed to be easy for the Professional programmer to learn and to use
effectively. If you are an experienced C++ programmer, learning Java will be even
easier. Because Java inherits the C/C++ syntax and many of the object oriented features
of C++. Most of the confusing concepts from C++ are either left out of Java or
implemented in a cleaner, more approachable manner. In Java there are a small number
of clearly defined ways to accomplish a given task.

Object-Oriented
Java was not designed to be source-code compatible with any other language. This
allowed the Java team the freedom to design with a blank slate. One outcome of this
was a clean usable, pragmatic approach to objects. The object model in Java is simple
and easy to extend, while simple types, such as integers, are kept as high-performance
non-objects.
Robust

48

The multi-platform environment of the Web places extraordinary demands on a


program, because the program must execute reliably in a variety of systems. The ability
to create robust programs was given a high priority in the design of Java. Java is strictly
typed language; it checks your code at compile time and run time.
Java virtually eliminates the problems of memory management and deallocation, which
is completely automatic. In a well-written Java program, all run time errors can and
should be managed by your program.

7.2.1 JSP(Java Servlet Pages ) Role in this Project :


JSP (Java Server Pages) is a server side Programming and which will help us to develop
sever side applications. In this Project JSP is used as better user interface and
implemented graph generation for given time slots of a retrieval data from data base.
JSP is covering both Model and View of MVC ( Model View Controller ) architecture .

7.2.2 Java Servlets Role in this Project :


Java servlets playing vital role in the project

because there are 5 Classes are

implemented in this project All the business Logic is implemented using Servlets only
.and Servlets covers Controller
Part in MVC ( Model View Controller ) .
Servlet is a java program which can handle request and respond of HTTP Protocol and
we can
enhance the functionality of Web server.

49

Advantages of Servlet

No CGI limitations
Abundant third-party tools and Web servers supporting Servlet
Access to entire family of Java APIs
scalability

Reliable, better performance and

Platform and server independent


Secure
Most servers allow automatic reloading of Servlet's by administrative action

50

7.2.3 About MyEclipse 7.5


My Eclipse is a IDE for java Application Development .

MyEclipse has following features

Supports Browser side Tools


Server Side Tools
Enterprise Service Tools
Database Tools
UML and XML Tools

Ex: AJAX Web Browser ,AJAX Monitor


Ex: Visual Designer ( HTML,JSP,JSF,STRUTS )
Ex :Web Services
Ex: DB Browser
Ex: UML Designer

Eclipse based product is structured as a collection of plug-ins. Each plug-in contains the
code that provides some of the product's functionality. The code and other files for a
plug-in are installed on the local computer, and get activated automatically as required.
A product's plug-ins are grouped together into features. A feature is a unit of separately
downloadable and installable functionality.

The fundamentally modular nature of the Eclipse platform makes it easy to install
additional features and plug-ins into an Eclipse based product, and to update the
product's existing features and plug-ins. You can do this either by using traditional
native installers running separately from Eclipse, or by using the Eclipse platform's own
update manager. The Eclipse update manager can be used to discover, download, and
install updated features and plug-ins from special web based Eclipse update sites.

51

7.3 Proposed Software Architecture

7.3.1 Sub System Decomposition


The system can be divided as follows

7.3.1.1 User Authentication Module

This subsystem will check User name and password with the existing data in the
database. If both are verified it allows to open a search engine page. And we have
option to create new users.

7.3.1.2 Query processing system (QPS) Module


Figure 5.2 Query processing system (QPS) Module

52

To design the architecture of the QPS, it is needed to understand how a user


query is answered in the system. When a user inputs a web query, the query is sent to a
web server in the system and it is represented by an URL, which contains users
keyword(s) and retrieval range. Here, the retrieval range specifies the rank range of
query-matching documents to be shown in the query result page returned to the user.
The retrieval range is shifted with users clicks on the next/previous page link given in
the current query result page. By parsing the query URL, the web server extracts the
keyword(s) and retrieval range, and then sends them to other query processing server.

The query processing server is responsible for performing the equijoin and rank
evaluation. The equijoin is to select the query matching documents with respect to given
keywords. For this, inverted files are used to identify the documents where all of the
given keywords occur at least once. After the equi-join operation, rank evaluation is
performed to give rank scores to the query-matching documents according to
documents query relevancy.
At this point, our ranking system refers to various index data including the
hyperlink analysis results, keywords occurrence positions, and HTML-tag related
information. This index data is also stored in inverted files along with other data used
for equi-join. From the steps of equijoin and ranking, we come to have a set of <DID,
rank score>, where DID stands for the document ID. By sorting them in the decreasing
order of rank scores, we can finally determine the set of DIDs pertaining to users retrieval
range.

53

7.3.1.3 Algorithm for Query processing steps of the coordinator server.


Step 1. Accept a query of keywords and a retrieval range from a web server.
Step 2. forall s _ /* _ is a set of ranker servers toward which the query will be sent. */
Send keywords to s for equi-join and ranking operations.
Receive the raking results of <DID, rank score> from s, and merge them into R.
Step 3. Sort _ in the decreasing order of rank scores and select the ranking results with
DUP_K highest rank scores. Let Top-K be the selected subset.
Step 4. Divide Top-K into DUP_K/20 subsets and send every subset to a DST server
and receive DSTs from them.
Step 5. Merge the DSTs returned from DUP_K/20 number of DST servers. Let _ be the
merged DSTs, which has the format of <DID, DST> for each document.
Step 6. Perform duplication detection on T. And then, delete some duplicates from R
as well as T, and let R and T be the modified ones, respectively.
Step 7. Make the query results using R and T.

7.3.1.4 Multi-level Cache Module

We design a hierarchical cache scheme composed of four-level caches. In this


paper, we name those four level caches as CL1, CL2, CL3, and CL4, from the top level
to the bottom level. In the hierarchical cache, Cache look-ups arise in the order from
CL1
to CL4.
The CL1
is managed on main memory of the web server in the QPS, and the cache
records of CL1 are compressed for saving memory space. For fast accesses in CL1, a
hashing technique is used to determine the address to the cache record with respect to a
given query URL. For this, we make a hash table mapping a query URL to an address
pointing to the memory bucket, where multiple memory slots of a fixed size reside. The

54

web server first accesses an associated memory bucket, and then explores the memory
slots within that for the search of the matched cache record. If a matched cache record is
found, then it is uncompressed for its use. The uncompressed data is the HTML-coded
data to be transferred to users web browser for the view of a query result page. With
the cache hit on CL1, the user query is completed without any connection to the
coordinator server. Otherwise, if not found, a request for query processing will be sent
toward a coordinator server.

7.3.1.5 Algorithm for the CL1 cache (Level 1 Cache )


(1) h_keyget_key(keywords, retrieval range).
(2) id h_key mod BUCKET_NO. /* bucket id */
(3) Let _ be the set of cache slots in the bucket with identifier of id.
(4) record = nil.
(5) foreach s S
(6) if( s.h_key == h_key) then /* cache hit */
(7) record pointer to s.
(8) s.popularity += 1.
(9) else
(10) s.popularity -= 1.
(11) endfor
(12) if(record != nill) then
(13) Uncompress the cache record saved in record for returning a search result page.
(14) else
(15) Forward the user query to a coordinator server and
receive the query result saved in q_result.

55

(16) Select a free slot s from s and save the compressed q_result into s.data, and set
s.popularity
to 50, if the current result is to be cached.
(17) endif

56

57

8. Implementation
8.1 Implementation of Data base connectivity Class
Class Name : Database .java
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
import java.util.Vector;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Database {
Connection con;
Statement st;
ResultSet rs;
String s;
Vector data;
Vector subject;
int slno = 1;
public Database() {
try {
createCon();
} catch (Exception e) {
e.printStackTrace();
}
}
public void createCon() {
try {
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con = DriverManager.getConnection("jdbc:odbc:search");
st = con.createStatement();
}
catch (Exception e) {
e.printStackTrace();
}
}

58

public String check(String id, String pwd) {


try {
rs = st.executeQuery("select * from login where username='" + id
+ "' and password='" + pwd + "'");
if (rs.next()) {
s = "Valid";
} else {
s = "Invalid";
}
con.close();
} catch (Exception e) {
e.printStackTrace();
}
return s;
}
public boolean insertToDB(Vector vec) {
return new Insert().reply(vec);
}

class Insert extends Thread {


Vector vec;
Insert() {
}
public boolean reply(Vector vec) {
this.vec = vec;
start();
return true;
}
public void run() {
try {
st.executeUpdate("insert into for values('" + vec.get(0)
+ "','" + vec.get(1) + "','" + vec.get(2) +
"','"
+ vec.get(3) + "')");
con.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}

59

public String getSlno() {


try {
rs = st.executeQuery("select * from for ");
while (rs.next()) {
slno = rs.getInt("sno") + 1;
}
con.close();
} catch (Exception e) {
e.printStackTrace();
}
return String.valueOf(slno);
}
public boolean insertToLogin(Vector<String> vec) {
boolean ret = false;
try {
st.executeUpdate("insert into login values('" + vec.get(0) + "','"
+ vec.get(1) + "','" + vec.get(2) + "','" + vec.get(3)
+ "','" + vec.get(4) + "')");
ret = true;
con.close();
} catch (Exception e) {
e.printStackTrace();
}
return ret;
}
public boolean chkUserNme(String usrNme) {
boolean result = true;
try {
rs = st.executeQuery("select username from login where
username='"
+ usrNme + "'");
if (rs.next()) {
result = false;
}
con.close();
} catch (Exception e) {
e.printStackTrace();
}
return result;
}
public Vector<String> getRecord(String srchstr) {
Vector<String> retv = new Vector<String>();

60

try {
rs = st.executeQuery("select * from datas");
String title;
String data;
long btime = 0;
long atime = 0;
long time = 0;
while (rs.next()) {
btime = System.nanoTime();
System.out.println("btime:" + btime);
title = rs.getString("title");
data = rs.getString("data");
if (compareRegularEx(srchstr, data)) {
atime = System.nanoTime();
System.out.println("atime:" + atime);
time = atime - btime;
System.out.println("time:" + time);
retv.add(title + ":" + data + ":" + time);
}
}
con.close();
} catch (Exception e) {
e.printStackTrace();
}
return getDb2(retv, srchstr);
}
private Vector<String> getDb2(Vector<String> retv, String srchstr) {

try {
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con = DriverManager.getConnection("jdbc:odbc:db2");
st = con.createStatement();
} catch (Exception e) {
e.printStackTrace();
}
try {
rs = st.executeQuery("select * from datas");
String title;
String data;
long btime = 0;
long atime = 0;
long time = 0;
while (rs.next()) {
btime = System.nanoTime();
System.out.println("btime:" + btime);
title = rs.getString("title");

61

data = rs.getString("data");
if (compareRegularEx(srchstr, data)) {
atime = System.nanoTime();
System.out.println("atime:" + atime);
time = atime - btime;
System.out.println("time:" + time);
retv.add(title + ":" + data + ":" + time);
}
}
con.close();
} catch (Exception e) {
e.printStackTrace();
}
return retv;
}
private boolean compareRegularEx(String srchstr, String text) {
String[] srch = srchstr.split(" ");
boolean result = false;
for (int i = 0; i < srch.length; i++) {
Pattern p = Pattern.compile(srch[i]);
Matcher m = p.matcher(text);
if (m.find()) {
result = true; }}return result;}}

62

8.2 Implementation of Util Class


Class Name : Util.java
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Properties;
import java.util.Vector;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.servlet.http.HttpSession;
public class Util {
TextClassification datas;
public Util() {
// TODO Auto-generated constructor stub
datas = new TextClassification();
}
public String getDBLoc(String path) {
// TODO Auto-generated method stub
String ret = "";
try {
Properties properties = new Properties();
FileInputStream fis = new FileInputStream(path
+ "/dataloc.properties");
properties.load(fis);
ret = properties.getProperty("loc");
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return ret;
}
public void insertData(PrintWriter out, HttpServletRequest request,
HttpServletResponse response) {
// TODO Auto-generated method stub
try {
String uname = request.getParameter("username");
String pwd = request.getParameter("password");

63

String fname = request.getParameter("fname");


String lname = request.getParameter("lname");
String email = request.getParameter("email");
Vector<String> data = new Vector<String>();
data.add(uname);
data.add(pwd);
data.add(fname);
data.add(lname);
data.add(email);
if (new Database().chkUserNme(uname)) {
if (new Database().insertToLogin(data)) {

request.getRequestDispatcher("/search.jsp").include(
request, response);
out.println("<html><head><script
type='text/javascript'>"
+ "function logout()" + "{" +
"alert('" + uname
+ " Welcome to Wikipedia Text
Classification.');"
+ "}" + "</script>" + "</head>"
+ "<body onload='logout()'>" +
"</body>"
+ "</html>");

response.sendRedirect(request.getRequestURL().substring(0,
request.getRequestURL().lastIndexOf("servlet"))
+ "home.jsp");
}
} else {
request.getRequestDispatcher("/new_user.jsp").include(request,
response);
out
.println("<html><head><script
type='text/javascript'>"
+ "function logout()"
+ "{"
+ "alert('Username already
Registered. Please choose a different user name');"
+ "}" + "</script>" +
"</head>"
+ "<body
onload='logout()'>" + "</body>"
+ "</html>");
}

64

} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ServletException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void login(PrintWriter out, HttpServletRequest request,
HttpServletResponse response) {
// TODO Auto-generated method stub
try {
String usr = request.getParameter("username");
String pwd = request.getParameter("password");
if (new Database().check(usr, pwd).equals("Valid")) {

response.sendRedirect(request.getRequestURL().substring(0,
request.getRequestURL().lastIndexOf("servlet"))
+ "home.jsp");
} else {
request.getRequestDispatcher("/login.jsp").include(request,
response);
out
.println("<table align=center cellspacing=3
cellpadding=3 style='BORDER-RIGHT: red 2px solid; BORDER-TOP: red 2px solid;
BORDER-LEFT: red 2px solid; BORDER-BOTTOM: red 2px solid'><tr><td>Invalid
Username and Password</td></tr></table>");
}
} catch (ServletException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void search(PrintWriter out, HttpServletRequest request,
HttpServletResponse response) {
try {
String te = request.getRealPath("/");

65

te = te.substring(0, te.indexOf("."));
String s = request.getContextPath();
System.out.println("te+s:" + te + s);
Vector<String> resultVec = datas.getData(getDBLoc(te + s),
request
.getParameter("txtsearch"));
HttpSession ses = request.getSession(true);
ses.setAttribute("Search_Res", resultVec);
HttpSession chart = request.getSession(true);
ses.setAttribute("Res_Chart", resultVec);
response.sendRedirect(request.getRequestURL().substring(0,
request.getRequestURL().lastIndexOf("servlet"))
+ "search_Res.jsp");
out.println(resultVec);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void upload(PrintWriter out, HttpServletRequest request,
HttpServletResponse response) {
// TODO Auto-generated method stub
try {
String te = request.getRealPath("/");
te = te.substring(0, te.indexOf("."));
String s = request.getContextPath();
String path = request.getParameter("txtbrowse");
if (new Loading().upload(te + s, path)) {

request.getRequestDispatcher("/Welcome.jsp").include(request,
response);
out
.println("<table align=center cellspacing=3
cellpadding=3 style='BORDER-RIGHT: 2px solid; BORDER-TOP: 2px solid;
BORDER-LEFT: 2px solid; BORDER-BOTTOM: 2px solid'><tr><td>File Uploaded
Successfully.</td></tr></table>");
} else {
request.getRequestDispatcher(
request.getRequestURL().substring(0,
request.getRequestURL().lastIndexOf("servlet"))
+
"Welcome.jsp").include(request, response);
out

66

.println("<table align=center cellspacing=3


cellpadding=3 style='BORDER-RIGHT: red 2px solid; BORDER-TOP: red 2px solid;
BORDER-LEFT: red 2px solid; BORDER-BOTTOM: red 2px solid'><tr><td>File
Upload Failed.</td></tr></table>");
}
} catch (ServletException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

67

8.3 Implementation of TextClassification Class


Class Name :TextClassification.java
import java.util.Vector;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TextClassification {
int maxcnt = 0;
String path = "";
String finalstr = "";
public TextClassification() {
// TODO Auto-generated constructor stub
}
public Vector<String> getData(String path, String srchstr) {
// TODO Auto-generated method stub
Vector<String> retV = new Vector<String>();
return new Database().getRecord(srchstr);
}
private boolean compareRegularEx(String srchstr, String text, String path) {
String[] srch = srchstr.split(" ");
boolean result = false;
int cnt = 0;
for (int i = 0; i < srch.length; i++) {
Pattern p = Pattern.compile(srch[i]);
Matcher m = p.matcher(text);
if (m.find()) {
result = true;
cnt++;
}
}
System.out.println("cnt:" + cnt);
if (maxcnt < cnt) {
maxcnt = cnt;
this.path = path;
this.finalstr = text;
System.out.println("cPath:" + path);
}return result; }}

68

8.4 Implementation of Loading Class


Class Name : Loading.java
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
public class Loading {
Util tools;
public Loading() {
// TODO Auto-generated constructor stub
tools = new Util();
}
public boolean upload(String path, String uploadpath) {
// TODO Auto-generated method stub
boolean flag = true;
try {
File f = new File(uploadpath);
FileInputStream fis = new FileInputStream(f);
byte[] b = new byte[fis.available()];
fis.read(b);
fis.close();
String fipt = tools.getDBLoc(path);
String subpath = f.getAbsolutePath();
subpath = subpath.substring(subpath.lastIndexOf("\\"), subpath
.length());
FileOutputStream fos = new FileOutputStream(fipt + subpath,
true);
fos.write(b);
fos.close();
flag = true;
} catch (Exception e) {
// TODO Auto-generated catch block
flag = false;
e.printStackTrace();
}
return flag;
}
}

69

8.5 Implementation of Controller Class ( Client Servlet )


Class Name : Controller.java

import java.io.IOException;
import java.io.PrintWriter;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
public class Controller extends HttpServlet {
private static final long serialVersionUID = 1L;
Util tools = new Util();
public Controller() {
super();
}
public void destroy() {
super.destroy();
}
public void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
response.setContentType("text/html");
PrintWriter out = response.getWriter();
out.flush();
out.close();
}
public void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
response.setContentType("text/html");
PrintWriter out = response.getWriter();

String status = request.getParameter("status");


if (status.equals("search")) {
tools.search(out, request, response);
} else if (status.equals("login")) {

70

tools.login(out, request, response);


} else if (status.equals("upload")) {
tools.upload(out, request, response);
} else if (status.equals("newuser")) {
tools.insertData(out, request, response);
}
out.flush();
out.close();
}
public String getServletInfo() {
return "";
}
public void init() throws ServletException {
// Put your code here
}
}

71

72

9 . Testing

Software Testing is the process used to help identify the correctness,


completeness, security, and quality of developed computer software.

Testing is a process of technical investigation, performed on behalf of


stakeholders, that is intended to reveal quality-related information about the
product with respect to the context in which it is intended to operate.

An important point is that software testing should be distinguished from the


separate discipline of Software Quality Assurance (SQA), which encompasses
all business process areas, not just testing.

9.0.1 White-box and black-box testing

White box and black box testing are terms used to describe the point of view a
test engineer takes when designing test cases.

Black box being an external view of the test object and white box being an internal
view.

In order to achieve consistency in the Testing style, it is imperative to have and


follow a set of testing principles. This enhances the efficiency of Testing within
SQA team members and thus contributes to increased productivity. The purpose
of this document is to provide overview of the testing, plus the techniques.

There are 3-levels of Software testing in SDLC namely

Unit Testing: in which each unit (basic component) of the software is tested
to verify that the detailed design for the unit has been correctly implemented

Integration testing: in which progressively larger groups of tested software


components corresponding to elements of the architectural design are
integrated and tested until the software works as a whole.

73

System testing: in which the software is integrated to the overall product


and tested to show that all requirements are met .

Acceptance testing: upon which the acceptance of the complete software is


based. The clients often do this.

Regression testing: is used to refer the repetition of the earlier successful


tests to ensure that changes made in the software have not introduced new
bugs/side effects.

9.0.2 Test levels

Unit testing tests the minimal software component and sub-component or


modules by the programmers.

Integration testing exposes defects in the interfaces and interaction between


integrated components(modules).

Functional testing tests the product according to programmable work.

System testing tests an integrated system to verify/validate that it meets its


requirements.

Acceptance testing can be conducted by the client. It allows the end-user or


customer or client to decide whether or not to accept the product.
Acceptance testing may be performed after the testing and before the
implementation phase.

9.0.3 A sample Test Cycle

Requirements Analysis: Testing should begin in the requirements phase of the


software development life cycle.
o During the design phase, testers work with developers in determining
what aspects of a design are testable and under what parameter those
tests work.

Test Planning: Test Strategy, Test Plan(s), Test Bed creation.

74

Test Development: Test Procedures, Test Scenarios, Test Cases, Test Scripts to
use in testing software.

Test Execution: Testers execute the software based on the plans and tests and
report any errors found to the development team.

Test Reporting: Once testing is completed, testers generate metrics and make
final reports on their test effort and whether or not the software tested is ready
for release.

Retesting the Defects

75

9.1 Test Cases for LOGIN Module


9.1.1 Test case of New User Login Forms

76

9.1.2 Test case of Existing User Login Forms

77

9.2 Test cases With Search Engine Interface Forms


9.2.1 Search Engine Interface Form

9.2.2 Test case with the key word 1985 in Search Engine Interface Form

78

9.2.3 Test case with the key word DMW in Search Engine Interface Form

9.2.4 Test case with the key word TPO in Search Engine Interface Form

79

9.2.5 Test case with the key word mobile in Search Engine Interface
Form

9.2.6 Test case with the key word research mining in Search Engine
Interface Form

80

9.2.7 Test case with the key word sfdsadfsdhdf in Search Engine
Interface Form
( No record Found Case )

81

9.3 Test cases with Time Performance Analysis of Web Search Engine
9.3.1 First Time applying Query without Cache of Search Engine Interface

9.3.1.1 Time Performance Ratio graph for the above result

82

9.3.2 Second Time applying Same Query with Cache of Search Engine
Interface

9.3.2.1 Time Performance Ratio graph for the above result

83

10.Performance Analysis
From the above results and graphs it is being observed that in the first time applying
query has taken 1,96,673 nano seconds of time . for the same query applied for second
time it has taken only 1,30,464 nano seconds of time , hence we can conclude that by
using cache at server side can reduce the response time of a typical web search engine.

84

11. Future Enhancement


In this project only one level of cache mechanism is used and this can be enhanced
to implement multilevel cache mechanisms and this analysis method can be enhanced to
a Large Scale Web Search Engine in Future.

85

86

12. Conclusion

The main objective this project is to design the Query processing System and
Cache based Technique here is to reduce the response time of a typical Web
Search Engine and Graphs are generated to Analyze the performance of Web
Search Engine .

This Model can be applied to any large Scale Web Search Engine.

87

88

13.References
[1] Search Engine Report,Http://www.searchenginewatch.com, 2005.

[2] Soumen Chakrabarti, Kunal Punera, and Mallela Subramanyam., Accelerated


focused crawling through online relevance feedback. In Proc. of the 11th International
Conf. on World Wide Web, pp. 148-159,2002.

[3] Sriram Raghvan and Hector Garcia-Molina. Crawling the Hidden Web. In Proc. of
the VLDB Conference, pp.129-138, 2001.

[4] Andrei Z. Broder, Marc Najork, and Janet L. Wiener,Efficient URL Caching for
World Wide Crawling, In Proc. of the 12th WWW Conference, Budapest,Hungary,
2003.

[5] Maxim Lifantsev and Tzi-cker Chiueh, I/O-Conscious Data Preparation for LargeScale Web Search Engines,In Proc. of the 28th VLDB Conf., pp. Hong Kong, 2002.

[6] Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina.
Building a distributed Full-text Index for the Web, In Proc. of the 10th International
World Wide Web inference. pp. 396-406, 2001.

89

APPENDIX
Glossary

A
Access

: Right to use the resources

Algorithm : series of steps to complete a task.


Analysis

: Examining the things and predict the result

B
Bandwidth: The amount of information that can be sent between computers through a
telephone wire .more over we can also define bandwidth as range of frequencies used to
send data from one place to another place over a telephone lines.
Broad band: A system that enables many messages or large amounts of information to
be send from one place to other place quickly.

C
Cache : A hidden store of things
Cache Memory : Type of Computer memory in which the information that is often in
use can be stored temporarily and accessed quickly.
Coordinator Server : Coordinator server receives user queries via web servers and
performs

two-phase query processing in coordination with ranker severs and DST

(Document Summarizing Text) servers. During the first phase, the coordinator server
sends a query to four ranker servers at once.

90

Cluster : A group of similar things that are choose together to perform a given task

D
DID : Document ID
DST Server : DST stands for Document Summarizing Text Server
DST Server functionality
DST servers create DST data for each received DID and return it to the coordinator
server. For this, the DST server stores URLs, titles, and tag-free body text of all the
crawled web documents in the disk, and uses a hash scheme to read each of them. By
merging the DSTs, the coordinator server finishes the second phase.
DUP_K : Duplication Key

E
Electronic Documents : all Web pages can be called electronic documents since they
can connect together electronically.

F
Firewall : A specialized device or a Program that stops people accessing a computer
without permission while it is connected to the Internet.

91

G
Generate HTML document : A java program can generate a HTML code for a client
such kind of java program called as Servlet.

H
Hash Key : Hash key is a phrase which is used to extract the value from a Hash table .
Hash Table : It is a kind of Data Structure which can store key value pair .

I
IDC Port : IDC Stands for Internet Data Center and used to get internet connection to
any computer.
Internet : Internet is a collection of networks and it works as Information highway.

L
Layer 4 Switch : A switch based on the OSI "transport" layer, which allows for policybased switching (for example, limiting different types of traffic on specific end-user
switch ports, or for prioritizing certain packet types, such as database or application
server traffic).
Load Balancer : Layer 4 Switch works as load balancer and dispatches user queries
toward four web servers in round-robin fashion. The performance monitor repeatedly
gathers the performance statistics such as response times, the rate of entered queries,

92

servers workloads, etc. If any problem is detected, then it sends a warning message to
the administrator.

M
Monitoring : Keep observation on a Display screen to track the information.

N
Network : a large system consisting of many similar parts that are connected together
to allow movement or communication between or along the parts or between the parts
and a control centre.
Node : A computer within a network called node

O
Organize : To make arrangements for something to happen

P
Process : a series of actions that you take in order to achieve a result
Performance : Performance is a metric to measure the how well the system works .

93

Q
Query : A statement which is expecting some result .
Query Process : There are series of actions collectively working together and gives the
result which can be done by a computer software.
Query Processing System : A software that can perform query processing and generate
result.

R
Ranker : A program that can decide the rank of a web page according to the duplicate
word count or word frequency of web page .
Ranker Server : ranker server calculates a rank score for every DID (Document ID)
selected from the equi-join and thus it has to read additional index data such as
keywords occurrence positions, HTML-tag related data.

Request : Request is generally generated by any web browser


Response : Response is generally generated by any Web server

S
Server : In computing, a server is any combination of hardware or software designed
to provide services to clients. When used alone, the term typically refers to a computer

94

which may be running a server operating system, but is also used to refer to any
software or dedicated hardware capable of providing services.

Service : Service can be any web service in the computing environment .

System : Collection of functional units collectively achieve some services.

T
Thread: A thread is a sequence of executing instructions that can run independently of
other threads yet can directly share data with other threads. Java is a multithreaded
language.

Threads resemble independent agents which are at your disposal. You give each one a
list of instructions (method calls) and send it on its way. Each agent works on its own
list of instructions until they are finished or it is told to stop. Thus a thread resembles a
process. Sometimes they are referred to as "lightweight processes".

U
Usability : Usability is a term used to denote the ease with which people can employ a
particular tool or other human-made object in order to achieve a particular goal.
Usability can also refer to the methods of measuring usability and the study of the
principles behind an object's perceived efficiency or elegance.

95

Usability is a qualitative attribute that assesses how easy user interfaces are to use. The
word "usability" also refers to methods for improving ease-of-use during the design
process. Usability consultant Jakob Nielsen and computer science professor Ben
Shneiderman have written (separately) about a framework of system acceptability,
where usability is a part of "usefulness" and is composed of

Learnability: How easy is it for users to accomplish basic tasks the first time
they encounter the design?

Efficiency: Once users have learned the design, how quickly can they perform
tasks?

Memorability: When users return to the design after a period of not using it, how
easily can they re establish proficiency?

Errors: How many errors do users make, how severe are these errors, and how
easily can they recover from the errors?

Satisfaction: How pleasant is it to use the design?

W
Web Application : In software engineering, a web application is an application that is
accessed via a web browser over a network such as the Internet or an intranet. The term
may also mean a computer software application that is hosted in a browser-controlled
environment (e.g. a Java applet)[citation needed] or coded in a browser-supported language
(such as JavaScript, combined with a browser-rendered markup language like HTML)
and reliant on a common web browser to render the application executable.

96

Web Browser : A web browser is a software application for retrieving, presenting, and
traversing information resources on the World Wide Web. An information resource is
identified by a Uniform Resource Identifier (URI) and may be a web page, image,
video, or other piece of content. Hyperlinks present in resources enable users to easily
navigate their browsers to related resources.

Web Server : A Web server is a computer program that delivers (serves) content, such
as Web pages, using the Hypertext Transfer Protocol (HTTP), over the World Wide
Web. The term Web server can also refer to the computer or virtual machine running
the program. In large commercial deployments, a server computer running a Web server
can be rack-mounted in a server rack or cabinet with other servers to operate a Web
farm.

Web Search Engine : A web search engine is designed to search for information on
the World Wide Web. The search results are usually presented in a list of results and are
commonly called hits. The information may consist of web pages, images, information
and other types of files. Some search engines also mine data available in databases or
open directories. Unlike Web directories, which are maintained by human editors,
search engines operate algorithmically or are a mixture of algorithmic and human input.

Das könnte Ihnen auch gefallen