Beruflich Dokumente
Kultur Dokumente
Tools Used
ii
Table of Contents
CERTIFICATEi
ACKNOWLEDEMENT.ii
ABSTRACT.. iii
LIST OF FIGURES.viii
1. INTRODUCTION ................................................................................................................ 1
2. AIM OF THE PROJECT .................................................................................................... 3
3. APPLICATION OF THE PROJECT ................................................................................. 3
4. LITERATURE SURVEY ..................................................................................................... 5
4.1 What is Search Engine ? ............................................................................................. 5
4.2 Search engine survey .................................................................................................... 5
4.3 Basic Concepts ............................................................................................................. 5
4.4 Types of Search Engines .............................................................................................. 8
4.4.1 General Search Engines ................................................................................ 8
4.4.2 Meta search Engines .................................................................................... 9
4.4.3 Media Search Engines .................................................................................. 10
4.4.4 Genre Oriented Search Engines .................................................................... 11
4.4.5 Defunct Search Engines ............................................................................... 11
4.5 How Search Engine Works ......................................................................................... 11
4.5.1 Basic Building Blocks of Search Engine ...................................................... 12
4.5.1.1 .Web Crawling ................................................................................. 12
4.5.1.2 Building the Index ............................................................................ 14
4.5.1.3 Building a Search ............................................................................. 16
5. FEASIBILITY STUDY ...................................................................................................... 18
5.1 Economic Feasibility ................................................................................................. 18
5.2 Technical Feasibility ................................................................................................. 18
iii
iv
9.3 Test cases with Time Performance Analysis of Web Search Engine .......................... 81
9.3.1 First Time applying Query without Cache of Search Engine Interface ........ 81
9.3.1.1 Time Performance Ratio graph for the above result .................................. 81
9.3.2 Second Time applying Same Query with Cache of Search Engine Interface
............................................................................................................................. 82
9.3.2.1 Time Performance Ratio graph for the above result .................................. 82
10. PERFORMANCE ANALYSIS......................................................................................... 83
11. FUTURE ENHANCEMENT ............................................................................................ 84
12. CONCLUSION ................................................................................................................. 86
13. REFERENCES ................................................................................................................. 88
APPENDIX ....................................................................................................................... 89
GLOSSARY ...................................................................................................................... 89
LIST OF FIGURES
vi
Figure No
Description
Page No
1.1
13
1.2
34
2.1
35
2.2
36
3.1
37
3.2
38
4.1
39
4.2
40
4.3
40
4.4
40
4.5
41
5.1
42
5.2
51
1. Introduction
Nowadays, the web search engine is widely used as a common way to find
information of interests and its indexed documents have reached the scale of multiple
billions. The web search engine is software that indexes web documents collected from
the Internet and gives orders to them according to their query relevancy with respect to
an entered user query. Much research has been done to solve various problems related
to the web search engine, such as crawling web documents, high-performance indexing,
hyperlink analysis, and topic sensitive searching. However there is not enough
information about the way how to implement the query processing system suitable for
large-scale web search engines. Since the system has to index a huge size of data, the
cost for yielding a query result could be very high.
This project mainly focus on the QPS of a large-scale of web search engine. We
first design a distributed architecture for the QPS. Since the amount of used CPU and
I/O resources is so huge for query processing, more than one server has to work in
parallel even for processing a single query. To make such cooperation efficient, the
QPS is designed as clustered servers and the server clusters are connected to each other
via high-speed LANs. Next, we describe the hierarchical cache scheme, which is the
main topic of this paper. The cache scheme is devised to have hierarchical 4-level cache
data. In the top-level cache, the recent search result pages are stored in main memory,
and the remaining lower levels of caches reside in the disk for saving more query
results. Using the multi-level caches, we can save 70% of server cost for query
processing. In this way, our system indexes 65 million web documents and can answer
5 millions of user queries against them per day at a cheap cost.
4. Literature Survey
Literature has been carried out on search engines and related mechanisms of a typical
search engine.
This web page is part of an continuing effort to create a survey of some of these search
engines (in particular those targeted towards locating information) and the multifarious
data repositories and informations spaces they allow people to search in.
Search engines can be thought of as mapmakers in information space. They explore the
landscape of information and create maps in the shape of internal structures that are
supposed to help travelers find their way in the chaotic warehouse of superabundant
information that most people see when they first are confronted with the World Wide
Web.
Not all search engines map the same information space. Some search engines map the
resources that can be downloaded from open ftp (file transfer protocol) repositories
around the globe, other the resources that are resident on the World Wide Web
(technically, this means that they map resources visible through the http, hyper text
transfer protocol, since this protocol is what technically defines which part of the
Internet landscape belongs to the World Wide Web). A third protocol that also defines a
clearly delineated information space is nntp (network news transfer protocol), which
technically defines what is known as "network news" (not to be confused with what
constitutes "news" in "old media" such as television and periodicals) or more
descriptive "Usenet discussion groups". To make things even more complicated, some
search engine not only explore information spaces delineated by technical protocol, but
by genre. For instance, they may monitor and map breaking news (in the "old media"
sense) distributed by wire services and/or news oriented media on the World Wide
Web.
In the survey, I've indicated this by creating categories accordingly. There are three
clear-cut categories corresponding to technical protocol (web/http, usenet/nntp and
ftp/ftp). In addition, some search engine provide access to proprietary data from sources
outside what is openly available on the Internet.
7
1.
2.
3.
4.
5.
6.
7.
8.
9.
Being aware of which information space a particular engine map is crucial if you want
to make efficient use of the engine. If you are looking for a specific company's home
page on the World Wide Web, you have a much higher chance of success if you use a
search engine that map the World Wide Web, rather than one that let you search the
information space made up of newswire telegrams and newspaper articles.
9
at the place of origin (the non-hosted approach), in addition, it retains a local copy of the page
as it was originally downloaded and analyzed (the hosted approach). This local copy can be
used both if the original page no longer is available, or if it has changed so much since the time
it was analyzed by Google that the relevance ranking of the page no longer apply.
10
The AskJeeves site accepts questions in plain English: How do I scan photographs?
Where can I find recipes for apple pie? AskeJeeves compares your question with its
internal knowledge base of questions and answers (compiled by human editors), and
finds those closest matching your question. You are then presented with a list of
questions it knows how to answer, and should pick one of these (if it still bears
resemblance to your original question). This question is then transformed into a set of
appropriate search requests and submitted to a number of search engines as well as (if
appropriate) reference works such as the Encyclopedia Britannica.
Go2Net Search Engine :
AltaVita also features a media finder, that let user's search by keyword for particular
media types, such as images, video or audio files. The search results in a pointer to the
media file, and a presentation of derived metadata, such as file format and file size. For
images and videos, the results page includes thumbnail illustrations. In addition to
searching for media files on the web, AltaVista let users search for premium media
content sold through partners.
11
Searches for audio files on the MP3 file format in open ftp archives.
They search the Internet -- or select pieces of the Internet -- based on important
words.
12
They keep an index of the words they find, and where they find them.
They allow users to look for words or combinations of words found in that
index.
Early search engines held an index of a few hundred thousand pages and documents,
and received maybe one or two thousand inquiries each day. Today, a top search engine
will index hundreds of millions of pages, and respond to tens of millions of queries per
day.
There are 3 basic building blocks of Search engine they are namely
Web Crawling
Building a Search
Before a search engine can tell you where a file or document is, it must be found. To
find information on the hundreds of millions of Web pages that exist, a search engine
employs special software robots, called spiders, to build lists of the words found on
Web sites. When a spider is building its lists, the process is called Web crawling.
(There are some disadvantages to calling part of the Internet the World Wide Web -- a
large set of arachnid-centric names for tools is one of them.) In order to build and
maintain a useful list of words, a search engine's spiders have to look at a lot of pages.
13
How does any spider start its travels over the Web? The usual starting points are lists of
heavily used servers and very popular pages. The spider will begin with a popular site,
indexing the words on its pages and following every link found within the site. In this
way, the spidering system quickly begins to travel, spreading out across the most widely
used portions of the Web.
"Spiders" take a Web page's content and create key search words that enable online
users to find pages they're looking for.
14
In the simplest case, a search engine could just store the word and the URL where it
was found. In reality, this would make for an engine of limited use, since there would
be no way of telling whether the word was used in an important or a trivial way on
the page, whether the word was used once or many times or whether the page
contained links to other pages containing the word. In other words, there would be no
way of building the ranking list that tries to present the most useful pages at the top
of the list of search results.
To make for more useful results, most search engines store more than just the word
and URL. An engine might store the number of times that the word appears on a
page. The engine might assign a weight to each entry, with increasing values
assigned to words as they appear near the top of the document, in sub-headings, in
links, in the meta tags or in the title of the page. Each commercial search engine has a
different formula for assigning weight to the words in its index. This is one of the
15
reasons that a search for the same word on different search engines will produce
different lists, with the pages presented in different orders. Regardless of the precise
combination of additional pieces of information stored by a search engine, the data
will be encoded to save storage space. For example, the original Google paper
describes using 2 bytes, of 8 bits each, to store information on weighting -- whether
the word was capitalized, its font size, position, and other information to help in
ranking the hit. Each factor might take up 2 or 3 bits within the 2-byte grouping (8
bits = 1 byte). As a result, a great deal of information can be stored in a very compact
form. After the information is compacted, it's ready for indexing.
In English, there are some letters that begin many words, while others begin fewer.
You'll find, for example, that the "M" section of the dictionary is much thicker than
the "X" section. This inequity means that finding a word beginning with a very
"popular" letter could take much longer than finding a word that begins with a less
popular one. Hashing evens out the difference, and reduces the average time it takes
to find an entry. It also separates the index from the actual entry. The hash table
contains the hashed number along with a pointer to the actual data, which can be
sorted in whichever way allows it to be stored most efficiently. The combination of
16
efficient indexing and effective storage makes it possible to get results quickly, even
when the user creates a complicated search.
AND - All the terms joined by "AND" must appear in the pages or
documents. Some search engines substitute the operator "+" for the word
AND.
OR - At least one of the terms joined by "OR" must appear in the pages or
documents.
NOT - The term or terms following "NOT" must not appear in the pages or
documents. Some search engines substitute the operator "-" for the word
NOT.
NEAR - One of the terms must be within a specified number of words of the
17
18
5. Feasibility Study
Feasibility study is a compressed capsule version of the entire System Analysis and
Design Process. The study begins by clarifying the problem definition. Feasibility
Study is not to solve the problem but to determine it is worth doing.
Once an acceptable problem definition has been generated; the
Analyst Develops a logical model of as reference. Next the alternatives are carefully
analyzed for Feasibility. At least three different types if feasibility are considered.
19
sufficient
support
for
the project
from
management and users? Will the proposed development of the project has been
done with the involvement of management and users and it is tested to work in all
conditions.? Have the users been done with the involvement of management and
users and it is tested to work in all conditions . So it can be considered as
operationally feasible.
20
21
6. Requirement Analysis
6.1 Purpose of the System
Mainly focus on the QPS (Query Processing System)of a large-scale of web
search engine. We first design a distributed architecture for the QPS. Since the
amount of used CPU and I/O resources is so huge for query processing, more than
one server has to work in parallel even for processing a single query.
To use the memory and disk space efficiently, the cache data are managed
across caches of four different levels. Using the multi-level caching
scheme, we can save around 70% of the server cost.
22
The web search engine is widely used as a common way to find information
of interests and its indexed documents have reached the scale of multiple billions.
The web search engine is software that indexes web documents collected from the
Internet and gives orders to them according to their query relevancy with respect to
an entered user query. Much research has been done to solve various problems
related to the web search engine, such as crawling web documents, high-performance
indexing, hyperlink analysis, and topic sensitive searching. However there is not
enough information about the way how to implement the query processing system
suitable for large-scale web search engines. Since the system has to index a huge size
of data, the cost for yielding a query result could be very high.
23
answer 5 millions of user queries against them per day at a cheap cost.
6.4.2.2 Reliability
The system is said to be reliable because the entire system was built using
java, which is most robust language. Reliability refers to the standards of the system.
6.4.2.3 Performance
System is highly functional and good in performance. The system must use
the minimal set of variables and minimal usage of the control structures will
dynamically increase the performance of the system.
24
6.4.2.4 Supportability
The system is supportable with different platforms and a wide range of
machines. The java code used in this project is more flexible and having a feature of
platform independence.
25
Microsoft Windows XP
Languages
IDE
My Eclipse 7.5
Web Server
Database Interface
MS Access
Database Drivers
ODBC
UML Tool
Memory
: 256 MB RAM
: 20 GB
Monitor
26
27
28
7. System Design
7.1 Design Modeling Tools
Object Oriented Analysis and Design (OAD) is often part of the development of
large scale systems and programs often using the Unified Modeling Language
(UML). OAD applies object-modeling techniques to analyze the requirements for a
context for example, a system, a set of system modules, an organization, or a
business unit and to design a solution. Most modern object-oriented analysis and
design
methodologies
are
use
case
driven
across
requirements,
design,
29
between objects varies depending on what kind of system is being modeled. In some
systems, "sending a message" is the same as "invoking a method".
Object Oriented Analysis
30
interfaces in APIs and to roles that the objects take in various situations. The
interfaces and their implementations for stable concepts can be made available as
reusable services. Concepts identified as unstable in object-oriented analysis will
form basis for policy classes that make decisions, implement environment-specific or
situation specific logic or algorithms
The result of the object-oriented design is a detail description how the system can be
built, using objects .Object-oriented software engineering (OOSE) is an object
modeling language and Methodology
OOSE was developed by Ivar Jacobson in 1992 while at Objectory AB. It is the first
object-oriented design methodology to employ use cases to drive software design. It
also uses other design products similar to those used by OMT
The tool Objectory was created by the team at Objectory AB to implement the OOSE
methodology. After success in the marketplace, other tool vendors also supported
OOSE After Rational bought Objectory AB, the OOSE notation, methodology, and tools
became superseded
The methodology part of OOSE has since evolved into the Rational Unified
Process (RUP)
The OOSE tools have been replaced by tools supporting UML and RUP
OOSE has been largely replaced by the UML notation and by the RUP methodology
31
32
1. Use Case Diagram: A use case diagram shows a set of use cases and actors and
their relationships. You apply use case diagrams to illustrate the static use case view
of a system. Use case diagrams are especially important in organizing and modeling
the behaviors of a system.
2. Sequence Diagram: A sequence diagram is an interaction diagram that
emphasizes the time ordering of messages. A sequence diagram shows a set of object
and the messages sent and received by those objects. The objects are typically named
or anonymous instances of classes, but many also represent instances for other
things, such as collaborations, components, and nodes. You use sequence diagrams to
illustrate the dynamic view of a system
3. Activity Diagram: An activity diagram shows the flow from activity to activity
within system. An activity shows a set of activities, the sequential or branching flow
from activity to activity, and object that act and are acted upon. You use activity
diagram to illustrate the dynamic view of a system. Activity diagrams are especially
important in modeling the function of a system. Activity diagram emphasize the flow
of control among the objects
4. Collaboration Diagram: This diagram is the alternative diagram for sequence
diagram but will take less space to represent the sequence by using numbers. and
shows collaboration among all the objects .
33
5. Class Diagram : Class Diagram will shows how to implement a system through
real time objects as a classes and which is just before coding of a system moreover it
depicts Domain model of a entire system and gives various relations among all the
classes.
34
35
Login
User
Get result
View Graph
Figure 2.1 Use case Model for the Query Processing System
36
User
Firewall
Load balancer
Webservers
Coordinator
Servers
Rankers
DST Servers
1: Access Firewall ()
2: Load ()
3: Access Server ()
4: Query Process ()
5:Assign Ranker ()
6 :Calculate Rank ()
7 :Query Result ()
37
Start
User name
and Password
Create new
User
No
User Login
Process
Create new
user process
Validate
Yes
Enter Search
Query
Search From
Cache
Search key
word
Search
Yes
Show Result
Web Server
Coordinator
Server
Ranker
DST Server
View Result
View Graph
Stop
38
2: 2: Load ()
Firewall
Load
balancer
3: 3: Access Server ()
Webserv
ers
6: 6 :Calculate Rank ()
4: 4: Query Process ()
Rankers
5: 5:Assign Ranker ()
Coordinator
Servers
1: 1: Access Firewall ()
7: 7 :Query Result ()
DST
Servers
User
39
40
41
42
43
Initially the language was called as oak but it was renamed as Java in 1995. The
primary motivation of this language was the need for a platform-independent (i.e.,
architecture neutral) language that could be used to create software to be embedded in
various consumer electronic devices.
Except for those constraints imposed by the Internet environment, Java gives
the programmer, full control.
Java has had a profound effect on the Internet. This is because; Java expands the
Universe of objects that can move about freely in Cyberspace. In a network, two
categories of objects are transmitted between the Server and the Personal computer.
They are: Passive information and Dynamic active programs. The Dynamic, Selfexecuting programs cause serious problems in the areas of Security and probability.
But, Java addresses those concerns and by doing so, has opened the door to an exciting
new form of program called the Applet.
44
45
.As you will see, the same mechanism that helps ensure security also helps create
portability. Indeed, Javas solution to these two problems is both elegant and efficient.
The Byte code
The key that allows the Java to solve the security and portability problems is that the
output of Java compiler is Byte code. Byte code is a highly optimized set of instructions
designed to be executed by the Java run-time system, which is called the Java Virtual
Machine (JVM). That is, in its standard form, the JVM is an interpreter for byte code.
Translating a Java program into byte code helps makes it much easier to run a program
in a wide variety of environments. The reason is, once the run-time package exists for a
given system, any Java program can run on it.
Although Java was designed for interpretation, there is technically nothing about Java
that prevents on-the-fly compilation of byte code into native code. Sun has just
completed its Just In Time (JIT) compiler for byte code. When the JIT compiler is a part
of JVM, it compiles byte code into executable code in real time, on a piece-by-piece,
demand basis. It is not possible to compile an entire Java program into executable code
all at once, because Java performs various run-time checks that can be done only at run
time. The JIT compiles code, as it is needed, during execution.
Java Virtual Machine (JVM)
Beyond the language, there is the Java virtual machine. The Java virtual machine is an
important element of the Java technology. The virtual machine can be embedded within
a web browser or an operating system. Once a piece of Java code is loaded onto a
machine, it is verified. As part of the loading process, a class loader is invoked and does
byte code verification makes sure that the code thats has been generated by the
compiler will not corrupt the machine that its loaded on. Byte code verification takes
46
place at the end of the compilation process to make sure that is all accurate and correct.
So byte code verification is integral to the compiling and executing of Java code.
Overall Description
Source
Java
JavaVM
.Class
47
code is written and compiled for one machine and interpreted on all machines. This
machine is called Java Virtual Machine.
Compiling and interpreting Java Source Code
During run-time the Java interpreter tricks the byte code file into thinking
That it is running on a Java Virtual Machine. In reality this could be a Intel Pentium
Windows 95 or SunSARC station running Solaris or Apple Macintosh running system
and all could receive code from any computer through Internet and run the Applets.
Simple
Java was designed to be easy for the Professional programmer to learn and to use
effectively. If you are an experienced C++ programmer, learning Java will be even
easier. Because Java inherits the C/C++ syntax and many of the object oriented features
of C++. Most of the confusing concepts from C++ are either left out of Java or
implemented in a cleaner, more approachable manner. In Java there are a small number
of clearly defined ways to accomplish a given task.
Object-Oriented
Java was not designed to be source-code compatible with any other language. This
allowed the Java team the freedom to design with a blank slate. One outcome of this
was a clean usable, pragmatic approach to objects. The object model in Java is simple
and easy to extend, while simple types, such as integers, are kept as high-performance
non-objects.
Robust
48
implemented in this project All the business Logic is implemented using Servlets only
.and Servlets covers Controller
Part in MVC ( Model View Controller ) .
Servlet is a java program which can handle request and respond of HTTP Protocol and
we can
enhance the functionality of Web server.
49
Advantages of Servlet
No CGI limitations
Abundant third-party tools and Web servers supporting Servlet
Access to entire family of Java APIs
scalability
50
Eclipse based product is structured as a collection of plug-ins. Each plug-in contains the
code that provides some of the product's functionality. The code and other files for a
plug-in are installed on the local computer, and get activated automatically as required.
A product's plug-ins are grouped together into features. A feature is a unit of separately
downloadable and installable functionality.
The fundamentally modular nature of the Eclipse platform makes it easy to install
additional features and plug-ins into an Eclipse based product, and to update the
product's existing features and plug-ins. You can do this either by using traditional
native installers running separately from Eclipse, or by using the Eclipse platform's own
update manager. The Eclipse update manager can be used to discover, download, and
install updated features and plug-ins from special web based Eclipse update sites.
51
This subsystem will check User name and password with the existing data in the
database. If both are verified it allows to open a search engine page. And we have
option to create new users.
52
The query processing server is responsible for performing the equijoin and rank
evaluation. The equijoin is to select the query matching documents with respect to given
keywords. For this, inverted files are used to identify the documents where all of the
given keywords occur at least once. After the equi-join operation, rank evaluation is
performed to give rank scores to the query-matching documents according to
documents query relevancy.
At this point, our ranking system refers to various index data including the
hyperlink analysis results, keywords occurrence positions, and HTML-tag related
information. This index data is also stored in inverted files along with other data used
for equi-join. From the steps of equijoin and ranking, we come to have a set of <DID,
rank score>, where DID stands for the document ID. By sorting them in the decreasing
order of rank scores, we can finally determine the set of DIDs pertaining to users retrieval
range.
53
54
web server first accesses an associated memory bucket, and then explores the memory
slots within that for the search of the matched cache record. If a matched cache record is
found, then it is uncompressed for its use. The uncompressed data is the HTML-coded
data to be transferred to users web browser for the view of a query result page. With
the cache hit on CL1, the user query is completed without any connection to the
coordinator server. Otherwise, if not found, a request for query processing will be sent
toward a coordinator server.
55
(16) Select a free slot s from s and save the compressed q_result into s.data, and set
s.popularity
to 50, if the current result is to be cached.
(17) endif
56
57
8. Implementation
8.1 Implementation of Data base connectivity Class
Class Name : Database .java
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
import java.util.Vector;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Database {
Connection con;
Statement st;
ResultSet rs;
String s;
Vector data;
Vector subject;
int slno = 1;
public Database() {
try {
createCon();
} catch (Exception e) {
e.printStackTrace();
}
}
public void createCon() {
try {
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con = DriverManager.getConnection("jdbc:odbc:search");
st = con.createStatement();
}
catch (Exception e) {
e.printStackTrace();
}
}
58
59
60
try {
rs = st.executeQuery("select * from datas");
String title;
String data;
long btime = 0;
long atime = 0;
long time = 0;
while (rs.next()) {
btime = System.nanoTime();
System.out.println("btime:" + btime);
title = rs.getString("title");
data = rs.getString("data");
if (compareRegularEx(srchstr, data)) {
atime = System.nanoTime();
System.out.println("atime:" + atime);
time = atime - btime;
System.out.println("time:" + time);
retv.add(title + ":" + data + ":" + time);
}
}
con.close();
} catch (Exception e) {
e.printStackTrace();
}
return getDb2(retv, srchstr);
}
private Vector<String> getDb2(Vector<String> retv, String srchstr) {
try {
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con = DriverManager.getConnection("jdbc:odbc:db2");
st = con.createStatement();
} catch (Exception e) {
e.printStackTrace();
}
try {
rs = st.executeQuery("select * from datas");
String title;
String data;
long btime = 0;
long atime = 0;
long time = 0;
while (rs.next()) {
btime = System.nanoTime();
System.out.println("btime:" + btime);
title = rs.getString("title");
61
data = rs.getString("data");
if (compareRegularEx(srchstr, data)) {
atime = System.nanoTime();
System.out.println("atime:" + atime);
time = atime - btime;
System.out.println("time:" + time);
retv.add(title + ":" + data + ":" + time);
}
}
con.close();
} catch (Exception e) {
e.printStackTrace();
}
return retv;
}
private boolean compareRegularEx(String srchstr, String text) {
String[] srch = srchstr.split(" ");
boolean result = false;
for (int i = 0; i < srch.length; i++) {
Pattern p = Pattern.compile(srch[i]);
Matcher m = p.matcher(text);
if (m.find()) {
result = true; }}return result;}}
62
63
request.getRequestDispatcher("/search.jsp").include(
request, response);
out.println("<html><head><script
type='text/javascript'>"
+ "function logout()" + "{" +
"alert('" + uname
+ " Welcome to Wikipedia Text
Classification.');"
+ "}" + "</script>" + "</head>"
+ "<body onload='logout()'>" +
"</body>"
+ "</html>");
response.sendRedirect(request.getRequestURL().substring(0,
request.getRequestURL().lastIndexOf("servlet"))
+ "home.jsp");
}
} else {
request.getRequestDispatcher("/new_user.jsp").include(request,
response);
out
.println("<html><head><script
type='text/javascript'>"
+ "function logout()"
+ "{"
+ "alert('Username already
Registered. Please choose a different user name');"
+ "}" + "</script>" +
"</head>"
+ "<body
onload='logout()'>" + "</body>"
+ "</html>");
}
64
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ServletException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void login(PrintWriter out, HttpServletRequest request,
HttpServletResponse response) {
// TODO Auto-generated method stub
try {
String usr = request.getParameter("username");
String pwd = request.getParameter("password");
if (new Database().check(usr, pwd).equals("Valid")) {
response.sendRedirect(request.getRequestURL().substring(0,
request.getRequestURL().lastIndexOf("servlet"))
+ "home.jsp");
} else {
request.getRequestDispatcher("/login.jsp").include(request,
response);
out
.println("<table align=center cellspacing=3
cellpadding=3 style='BORDER-RIGHT: red 2px solid; BORDER-TOP: red 2px solid;
BORDER-LEFT: red 2px solid; BORDER-BOTTOM: red 2px solid'><tr><td>Invalid
Username and Password</td></tr></table>");
}
} catch (ServletException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void search(PrintWriter out, HttpServletRequest request,
HttpServletResponse response) {
try {
String te = request.getRealPath("/");
65
te = te.substring(0, te.indexOf("."));
String s = request.getContextPath();
System.out.println("te+s:" + te + s);
Vector<String> resultVec = datas.getData(getDBLoc(te + s),
request
.getParameter("txtsearch"));
HttpSession ses = request.getSession(true);
ses.setAttribute("Search_Res", resultVec);
HttpSession chart = request.getSession(true);
ses.setAttribute("Res_Chart", resultVec);
response.sendRedirect(request.getRequestURL().substring(0,
request.getRequestURL().lastIndexOf("servlet"))
+ "search_Res.jsp");
out.println(resultVec);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void upload(PrintWriter out, HttpServletRequest request,
HttpServletResponse response) {
// TODO Auto-generated method stub
try {
String te = request.getRealPath("/");
te = te.substring(0, te.indexOf("."));
String s = request.getContextPath();
String path = request.getParameter("txtbrowse");
if (new Loading().upload(te + s, path)) {
request.getRequestDispatcher("/Welcome.jsp").include(request,
response);
out
.println("<table align=center cellspacing=3
cellpadding=3 style='BORDER-RIGHT: 2px solid; BORDER-TOP: 2px solid;
BORDER-LEFT: 2px solid; BORDER-BOTTOM: 2px solid'><tr><td>File Uploaded
Successfully.</td></tr></table>");
} else {
request.getRequestDispatcher(
request.getRequestURL().substring(0,
request.getRequestURL().lastIndexOf("servlet"))
+
"Welcome.jsp").include(request, response);
out
66
67
68
69
import java.io.IOException;
import java.io.PrintWriter;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
public class Controller extends HttpServlet {
private static final long serialVersionUID = 1L;
Util tools = new Util();
public Controller() {
super();
}
public void destroy() {
super.destroy();
}
public void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
response.setContentType("text/html");
PrintWriter out = response.getWriter();
out.flush();
out.close();
}
public void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
response.setContentType("text/html");
PrintWriter out = response.getWriter();
70
71
72
9 . Testing
White box and black box testing are terms used to describe the point of view a
test engineer takes when designing test cases.
Black box being an external view of the test object and white box being an internal
view.
Unit Testing: in which each unit (basic component) of the software is tested
to verify that the detailed design for the unit has been correctly implemented
73
74
Test Development: Test Procedures, Test Scenarios, Test Cases, Test Scripts to
use in testing software.
Test Execution: Testers execute the software based on the plans and tests and
report any errors found to the development team.
Test Reporting: Once testing is completed, testers generate metrics and make
final reports on their test effort and whether or not the software tested is ready
for release.
75
76
77
9.2.2 Test case with the key word 1985 in Search Engine Interface Form
78
9.2.3 Test case with the key word DMW in Search Engine Interface Form
9.2.4 Test case with the key word TPO in Search Engine Interface Form
79
9.2.5 Test case with the key word mobile in Search Engine Interface
Form
9.2.6 Test case with the key word research mining in Search Engine
Interface Form
80
9.2.7 Test case with the key word sfdsadfsdhdf in Search Engine
Interface Form
( No record Found Case )
81
9.3 Test cases with Time Performance Analysis of Web Search Engine
9.3.1 First Time applying Query without Cache of Search Engine Interface
82
9.3.2 Second Time applying Same Query with Cache of Search Engine
Interface
83
10.Performance Analysis
From the above results and graphs it is being observed that in the first time applying
query has taken 1,96,673 nano seconds of time . for the same query applied for second
time it has taken only 1,30,464 nano seconds of time , hence we can conclude that by
using cache at server side can reduce the response time of a typical web search engine.
84
85
86
12. Conclusion
The main objective this project is to design the Query processing System and
Cache based Technique here is to reduce the response time of a typical Web
Search Engine and Graphs are generated to Analyze the performance of Web
Search Engine .
This Model can be applied to any large Scale Web Search Engine.
87
88
13.References
[1] Search Engine Report,Http://www.searchenginewatch.com, 2005.
[3] Sriram Raghvan and Hector Garcia-Molina. Crawling the Hidden Web. In Proc. of
the VLDB Conference, pp.129-138, 2001.
[4] Andrei Z. Broder, Marc Najork, and Janet L. Wiener,Efficient URL Caching for
World Wide Crawling, In Proc. of the 12th WWW Conference, Budapest,Hungary,
2003.
[5] Maxim Lifantsev and Tzi-cker Chiueh, I/O-Conscious Data Preparation for LargeScale Web Search Engines,In Proc. of the 28th VLDB Conf., pp. Hong Kong, 2002.
[6] Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina.
Building a distributed Full-text Index for the Web, In Proc. of the 10th International
World Wide Web inference. pp. 396-406, 2001.
89
APPENDIX
Glossary
A
Access
B
Bandwidth: The amount of information that can be sent between computers through a
telephone wire .more over we can also define bandwidth as range of frequencies used to
send data from one place to another place over a telephone lines.
Broad band: A system that enables many messages or large amounts of information to
be send from one place to other place quickly.
C
Cache : A hidden store of things
Cache Memory : Type of Computer memory in which the information that is often in
use can be stored temporarily and accessed quickly.
Coordinator Server : Coordinator server receives user queries via web servers and
performs
(Document Summarizing Text) servers. During the first phase, the coordinator server
sends a query to four ranker servers at once.
90
Cluster : A group of similar things that are choose together to perform a given task
D
DID : Document ID
DST Server : DST stands for Document Summarizing Text Server
DST Server functionality
DST servers create DST data for each received DID and return it to the coordinator
server. For this, the DST server stores URLs, titles, and tag-free body text of all the
crawled web documents in the disk, and uses a hash scheme to read each of them. By
merging the DSTs, the coordinator server finishes the second phase.
DUP_K : Duplication Key
E
Electronic Documents : all Web pages can be called electronic documents since they
can connect together electronically.
F
Firewall : A specialized device or a Program that stops people accessing a computer
without permission while it is connected to the Internet.
91
G
Generate HTML document : A java program can generate a HTML code for a client
such kind of java program called as Servlet.
H
Hash Key : Hash key is a phrase which is used to extract the value from a Hash table .
Hash Table : It is a kind of Data Structure which can store key value pair .
I
IDC Port : IDC Stands for Internet Data Center and used to get internet connection to
any computer.
Internet : Internet is a collection of networks and it works as Information highway.
L
Layer 4 Switch : A switch based on the OSI "transport" layer, which allows for policybased switching (for example, limiting different types of traffic on specific end-user
switch ports, or for prioritizing certain packet types, such as database or application
server traffic).
Load Balancer : Layer 4 Switch works as load balancer and dispatches user queries
toward four web servers in round-robin fashion. The performance monitor repeatedly
gathers the performance statistics such as response times, the rate of entered queries,
92
servers workloads, etc. If any problem is detected, then it sends a warning message to
the administrator.
M
Monitoring : Keep observation on a Display screen to track the information.
N
Network : a large system consisting of many similar parts that are connected together
to allow movement or communication between or along the parts or between the parts
and a control centre.
Node : A computer within a network called node
O
Organize : To make arrangements for something to happen
P
Process : a series of actions that you take in order to achieve a result
Performance : Performance is a metric to measure the how well the system works .
93
Q
Query : A statement which is expecting some result .
Query Process : There are series of actions collectively working together and gives the
result which can be done by a computer software.
Query Processing System : A software that can perform query processing and generate
result.
R
Ranker : A program that can decide the rank of a web page according to the duplicate
word count or word frequency of web page .
Ranker Server : ranker server calculates a rank score for every DID (Document ID)
selected from the equi-join and thus it has to read additional index data such as
keywords occurrence positions, HTML-tag related data.
S
Server : In computing, a server is any combination of hardware or software designed
to provide services to clients. When used alone, the term typically refers to a computer
94
which may be running a server operating system, but is also used to refer to any
software or dedicated hardware capable of providing services.
T
Thread: A thread is a sequence of executing instructions that can run independently of
other threads yet can directly share data with other threads. Java is a multithreaded
language.
Threads resemble independent agents which are at your disposal. You give each one a
list of instructions (method calls) and send it on its way. Each agent works on its own
list of instructions until they are finished or it is told to stop. Thus a thread resembles a
process. Sometimes they are referred to as "lightweight processes".
U
Usability : Usability is a term used to denote the ease with which people can employ a
particular tool or other human-made object in order to achieve a particular goal.
Usability can also refer to the methods of measuring usability and the study of the
principles behind an object's perceived efficiency or elegance.
95
Usability is a qualitative attribute that assesses how easy user interfaces are to use. The
word "usability" also refers to methods for improving ease-of-use during the design
process. Usability consultant Jakob Nielsen and computer science professor Ben
Shneiderman have written (separately) about a framework of system acceptability,
where usability is a part of "usefulness" and is composed of
Learnability: How easy is it for users to accomplish basic tasks the first time
they encounter the design?
Efficiency: Once users have learned the design, how quickly can they perform
tasks?
Memorability: When users return to the design after a period of not using it, how
easily can they re establish proficiency?
Errors: How many errors do users make, how severe are these errors, and how
easily can they recover from the errors?
W
Web Application : In software engineering, a web application is an application that is
accessed via a web browser over a network such as the Internet or an intranet. The term
may also mean a computer software application that is hosted in a browser-controlled
environment (e.g. a Java applet)[citation needed] or coded in a browser-supported language
(such as JavaScript, combined with a browser-rendered markup language like HTML)
and reliant on a common web browser to render the application executable.
96
Web Browser : A web browser is a software application for retrieving, presenting, and
traversing information resources on the World Wide Web. An information resource is
identified by a Uniform Resource Identifier (URI) and may be a web page, image,
video, or other piece of content. Hyperlinks present in resources enable users to easily
navigate their browsers to related resources.
Web Server : A Web server is a computer program that delivers (serves) content, such
as Web pages, using the Hypertext Transfer Protocol (HTTP), over the World Wide
Web. The term Web server can also refer to the computer or virtual machine running
the program. In large commercial deployments, a server computer running a Web server
can be rack-mounted in a server rack or cabinet with other servers to operate a Web
farm.
Web Search Engine : A web search engine is designed to search for information on
the World Wide Web. The search results are usually presented in a list of results and are
commonly called hits. The information may consist of web pages, images, information
and other types of files. Some search engines also mine data available in databases or
open directories. Unlike Web directories, which are maintained by human editors,
search engines operate algorithmically or are a mixture of algorithmic and human input.