Beruflich Dokumente
Kultur Dokumente
1. Problem Definition---------------------------------------------------------------------- 4
1.1 Project Overview---------------------------------------------------------------- 5
1.2 Project Deliverable-------------------------------------------------------------- 6
2. System architecture-----------------------------------------------------------------------7-13
2.1Page rank algorithm-----------------------------------------------------------------7
2.2Simplified algorithm----------------------------------------------------------------8
2.3How page rank works---------------------------------------------------------------9
2.4How is page rank calculated------------------------------------------------------10
2.5Different criterias used in page rank-------------------------------------------10-12
2.6Keyword relevance----------------------------------------------------------------12
2.7Database connector----------------------------------------------------------------13
3.Project Organization---------------------------------------------------------------------- 14-21
3.1 Software Process Model--------------------------------------------------------- 14
3.2 Roles and Responsibilities---------------------------------------------------------17
3.3 Tools and Techniques ------------------------------------------------------------ 19
3.4Brief description of components used---------------------------------------------19
7.1 Introduction 51
7.2 Test Cases & Results 58
Abstract
Problem statement: Develop a framework (Rules Engine) for popularity based ranking
algorithms.
Platform: Visual studio 2003,Microsoft .net framework
Detail information:
What is a page rank?
Page Rank is a numeric value that represents how important a page is on the web. When one
page links to another page, it is effectively casting a vote for the other page. The more votes that
are cast for a page, the more important the page must be. Also, the importance of the page that is
casting the vote determines how important the vote itself is.
How the page rank is calculated?
To calculate the Page Rank for a page, all of its inbound links are taken into account. These are
links from within the site and links from outside the site.
PR(A) = (1-d) + d(PR(t1)/C(t1) + ... + PR(tn)/C(tn))
How to make use of page ranking algorithm:
Every search engine has their own algorithm for ranking the pages in order to retrieve the pages
to the user in return of their query used at the time of searching. Using a good searching strategy
will help to make the search much faster and efficient. This ranking algorithm will provide the
engine with much more refined searching strategies.
Application of project: To develop a Rules engine that will accept the user input and the search
criteria as specified by the user and give proper results. This Rules Engine will accept different
criteria along with the algorithm and will generate the most popular result on the basis of criteria
defined. This engine will even perform processing of the rules. Processing will include indexing,
stemming and stop word removal depending on the parameter passed by the user. The criteria
specified will be used by the Rules Engine and accordingly the most popular results will be given
to the user.
PROBLEM STATEMENT
CHAPTER 1
PROJECT OVERVIEW
Functional description: Rules engine is a user application which has been developed in order to
rank any kind of data the user wants. It is a framework that has two basic elements the first one is
a connector second one is the rules .The model can accommodate any type of a connector the
basic aim of a connector is to fetch a certain kind of data for which it is specialized, the user can
then click in any type of connector and use any ranking algorithm to rank the data. For example a
user can use a WebCrawler as a connector and rank the web pages either using a page ranking
algorithm or by using a keyword relevance algorithm.
For example if we want to find out one of the best communities on Orkut then that can be done
by generating a connector which can find out various communities, we can then use certain
criteria to decide how to judge a best community the criteria can be like maximum number of
people who have joined the community can be ranked higher on the same lines we have to
develop the raking algorithm using this criteria. Finally we get the best community.
The major Area of work is in fields of Information retrieval, text processing and ranking of data.
PROJECT DELIVERABLES
Sr no
1
2
3
4
5
6
7
8
9
Date
27 Aug 2008
29 Sep 2008
05 Oct 2008
01 Jan 2009
15 Jan 2009
20 Jan 2009
21 Feb 2009
25 Feb 2009
05 Mar 2009
Deliverables
Page Ranking
Web crawler
Keyword Relevance
Merging of PR,KR,Crawler
Com DLL
Testing the system
Database connector
Testing the Database connector
Delivering entire system
CHAPTER 2
SYSTEM ARCHITECTURE
The Page ranking Algorithm [23]:
This algorithm is used by all the search engines. It is a method to rank web pages giving
to it a numeric value that represents their importance. Based on the link structure of the
web a page X has a high rank if:
-
Basic idea: Pages rank determined by the number of links to the page (also known as
citations). If citing page is more important (has a high page rank/authority page) then the
pages it cites are more important. If citing page has many links, then cited page is less
important (normalize for number of links on citing page). PR(P) is page rank of page P,
T1, , TN are pages that cite P,C(P) is the number of links from Page P, D is a decay
factor, e.g., 0.85 then:
PR (P) = (1 d) + d (PR (T1)/C (T1) + + PR (Tn)/C (Tn))
Page Rank is a probability distribution used to represent the likelihood that a person
randomly clicking on links will arrive at any particular page. Page Rank can be calculated
for any-size collection of documents. It is assumed in several research papers that the
distribution is evenly divided between all documents in the collection at the beginning of
the computational process. The Page Rank computations require several passes, called
"iterations", through the collection to adjust approximate Page Rank values to more
closely reflect the theoretical true value.
A probability is expressed as a numeric value between 0 and 1. A 0.5 probability is
commonly expressed as a "50% chance" of something happening. Hence, a Page Rank of
0.5 means there is a 50% chance that a person clicking on a random link will be directed
to the document with the 0.5 Page Rank.
Simplified algorithm:
10
This is 0.75.
Again, suppose page B also has a link to page C, and page D has links to all three pages.
The value of the link-votes is divided among all the outbound links on a page. Thus, page
B gives a vote worth 0.125 to page A and a vote worth 0.125 to page C. Only one third of
D's Page Rank is counted for A's Page Rank (approximately 0.083).
11
In other words, the Page Rank conferred by an outbound link L( ) is equal to the
document's own Page Rank score divided by the normalized number of outbound links (it
is assumed that links to specific URLs only count once per document).
In the general case, the Page Rank value for any page u can be expressed as:
,
i.e. the Page Rank value for a page u is dependent on the Page Rank values for each page
v out of the set Bu (this set contains all pages linking to page u), divided by the number
L(v) of links from page v.
How is Page Rank calculated?
To calculate the Page Rank for a page, all of its inbound links are taken into account.
These are links from within the site and links from outside the site.
PR (A) = (1-d) + d (PR (t1) / C (t1) + ... + PR (tn) / C (tn))
That's the equation that calculates a page's Page Rank.
In the equation 't1 - tn' are pages linking to page A, 'C' is the number of outbound links
that a page has and 'd' is a damping factor, usually set to 0.85 and(1-d) is called as
normalization factor .
Different criteria used in page ranking algorithm
Inbound links:
Inbound links (links into the site from the outside) are one way to increase a site's total
Page Rank. The other is to add more pages. The linking page's Page Rank is important,
but so is the number of links going from that page. Once the Page Rank is injected into
your site, the calculations are done again and each page's Page Rank is changed.
Depending on the internal link structure, some pages' Page Rank is increased, some are
unchanged but no pages lose any Page Rank.
It is beneficial to have the inbound links coming to the pages to which you are channeling
your Page Rank. A Page Rank injection to any other page will be spread around the site
through the internal links. The important pages will receive an increase, but not as much
of an increase as when they are linked to directly. The page that receives the inbound link
makes the biggest gain.
12
It is easy to think of our site as being a small, self-contained network of pages. When we
do the Page Rank calculations we are dealing with our small network. If we make a link
to another site, we lose some of our network's Page Rank, and if we receive a link, our
network's Page Rank is added to. But it isn't like that. For the Page Rank calculations,
there is only one network - every page that Google has in its index. Each iteration of the
calculation is done on the entire network and not on individual websites.
Outbound links:
Outbound links are a drain on a site's total Page Rank. They leak Page Rank. To counter
the drain, try to ensure that the links are reciprocated. Because of the Page Rank of the
pages at each end of an external link, and the number of links out from those pages,
reciprocal links can gain or lose Page Rank. We need to take care when choosing where
to exchange links.
When Page Rank leaks from a site via a link to another site, all the pages in the internal
link structure are affected. The page that you link out from makes a difference to which
pages suffer the most loss. Without a program to perform the calculations on specific link
structures, it is difficult to decide on the right page to link out from, but the generalization
is to link from the one with the lowest Page Rank.
Many websites need to contain some outbound links that are nothing to do with Page
Rank. Unfortunately, all 'normal' outbound links leak Page Rank. But there are 'abnormal'
ways of linking to other sites that don't result in leaks. Page Rank is leaked when Google
recognizes a link to another site. The answer is to use links that Google doesn't recognize
or count. These include form actions and links contained in JavaScript code.
Damping factor:
The Page Rank theory holds that even an imaginary surfer who is randomly clicking on
links will eventually stop clicking. The probability, at any step, that the person will
continue is a damping factor d. Various studies have tested different damping factors, but
it is generally assumed that the damping factor will be set around 0.85.
The damping factor is subtracted from 1 (and in some variations of the algorithm, the
result is divided by the number of documents in the collection) and this term is then
added to the product of the damping factor and the sum of the incoming Page Rank
scores.
That is,
13
So any page's Page Rank is derived in large part from the Page Ranks of other pages. The
damping factor adjusts the derived value downward. Google recalculates Page Rank
scores each time it crawls the Web and rebuilds its index. As Google increases the
number of documents in its collection, the initial approximation of Page Rank decreases
for all documents.
The formula uses a model of a random surfer who gets bored after several clicks and
switches to a random page. The Page Rank value of a page reflects the chance that the
random surfer will land on that page by clicking on a link. If a page has no links to other
pages, it becomes a sink and therefore terminates the random surfing process. However,
the solution is quite simple. If the random surfer arrives at a sink page, it picks another
URL at random and continues surfing again.
When calculating Page Rank, pages with no outbound links are assumed to link out to all
other pages in the collection. Their Page Rank scores are therefore divided evenly among
all other pages. In other words, to be fair with pages that are not sinks, these random
transitions are added to all nodes in the Web, with a residual probability of usually d =
0.85, estimated from the frequency that an average surfer uses his or her browser's
bookmark feature.
So, the equation is as follows:
where p1,p2,...,pN are the pages under consideration, M(pi) is the set of pages that link to
pi, L(pj) is the number of outbound links on page pj, and N is the total number of page.
Keyword relevance algorithm:
In keyword relevance algorithm the page which has maximum count of the keyword is
ranked high.
Two terms used in keyword relevance algorithm are Total count and total keyword
occurrence. Total count actually stands for total number of keywords that occur in a
webpage for example if a page has keywords sun, moon, earth, moon then total count
is 4. If a page has 5 keywords say sun, moon, earth, sun, sun then total keyword
occurrence is 3 since sun has occurred thrice but meanwhile the total keyword count
is incremented only once.
Database connector:
The database connector used in project is used for populating a set of records. The
connector is used to insert specific records by the user. Once the user has entered all the
records he has to recommend some of things like places, foods or
14
restaurants.Recommendation of all the users will be saved which in turn will be used for
endorsement of a specific thing. This application can be used to create brand awareness
on social networking site. This application can even be further integrated with any of the
ranking algorithm and the ranking algorithm could be used to rank the data.
CHAPTER 3
PROJECT ORGANISATION
15
16
When an incremental model is used, the first increment is often a core product. That is
basic requirements are addressed, but many supplementary features remain undelivered. The
core product is used by the customer. As a result of use and/or evaluation, a plan is developed
for the next increment. The plan addresses the modification of the core product to better meet
the needs of the customer and the delivery of additional features and functionality. This
process is repeated following the delivery of each increment, until the complete product is
produced.
The incremental model is iterative in nature. It focuses on the delivery of an operational
product with each increment. Early increments are stripped down versions of the final
product, but they do provide capability that serves the user and also provides a platform for
evaluation by the user. In addition, increments can be planned to manage technical risks.
17
Increment #2
Increment #3
Implementation of Crawler
Implementation of HTML parser
Increment #4
Increment #5
Increment #6
Increment #7
Increment #8
Increment #9
18
Understanding the requirements, purpose, goals and the scale of the project
a. Dnyaneshwari Chandarana
b. Nitu Singh
19
20
1. At Stage 1 we gathered requirements from the client and formulated the requirement
Analysis in Microsoft Word 2003.
21
the cycle. Visual Studio supports languages by means of language services, which allow
any programming language to be supported (to varying degrees) by the code editor and
debugger, provided a language-specific service has been authored. Built-in languages
include C/C++ (via Visual), VB.NET (via Visual Basic .NET), and C# (via Visual C#).
Support for other languages such as Chrome, F#, Python, and Ruby among others has
been made available via language services which are to be installed separately. It also
supports XML/XSLT, HTML/XHTML, JavaScript and CSS
22
As a Functional test suite, it works together with HP Quick Test Professional and
supports enterprise quality assurance. HP Win Runners intuitive recording process helps
you produce robust functional tests. To create a test, HP Win Runner simply records a
typical business process by emulating user actions, such as ordering an item or opening a
vendor account. During recording, you can directly edit generated scripts to meet the
most complex test requirements. Next, testers can add checkpoints, which compare
expected and actual outcomes from the test run. HP Win Runner offers a variety of
checkpoints, including test, GUI, bitmap and web links. HP Win Runner can also verify
database values to determine transaction accuracy and database integrity, highlighting
records that have been updated, modified, deleted and inserted. With a few mouse clicks,
the Data Driver Wizard feature lets you convert a recorded business process into a data
driven test that reflects the real-life actions of multiple users. For further test
enhancement, the Function Generator feature presents a quick and reliable way to
program tests, while the Virtual Object Wizard feature lets you teach HP Win Runner to
recognize, record and replay any unknown or custom object. As HP Win Runner executes
tests, it operates the application automatically, as though a real user were performing each
step in the business process. If test execution occurs after hours or in the absence of a
quality assurance (QA) engineer, the Recovery Manager and Exception Handling
mechanisms automatically troubleshoot unexpected events, errors and application crashes
so that tests can complete smoothly. Once tests are run, HP Win Runners interactive
reporting tools help your team interpret results by providing detailed, easy-to-read reports
that list errors and their originations. HP Win Runner lets your organization build
reusable tests to repeat throughout an applications lifecycle. Thus, if developers modify
an application over time, testers do not need to modify multiple tests. Instead, they can
apply changes to the Graphical User Interface (GUI) Map, a central repository of testrelated information, and HP Win Runner automatically propagates changes to all relevant
script
CHAPTER 4
PROJECT MANAGEMENT PLAN
23
Task:
A task set is a collection of software engineering work tasks, deliverables and milestones,
resources, dependencies, constraints, risks and contingencies that must be accomplished to
complete a particular project. Our project can be carried out with a structured degree of rigor.
Our project has the following main tasks to be carried out.
Task Name
Description : This algorithm should rank the pages according to inbound and
outbound links
Resources needed:
Project plan:
24
Timeline chart:
Task Name
Duration
Start
Finish
Sponsorship Search
8 days
Mon 02/07/08
Mon 09/07/08
Formalities at Ubiqtas
7 days
Tues 10/07/08
Mon 16/07/08
Confirmation letter
1 day
Tues 17/07/08
Tues 17/07/08
9 days
Wed 18/07/08
Thu 26/07/08
8 days
Fri 27/07/08
Fri 03/08/08
Making synopsis
10 days
Sat 04/08/08
Mon 13/08/07
1 day
Tues 14/08/08
Tues 14/08/08
Confirmation of problem
statement from college
1 day
Thu 16/08/08
Thu 16/08/08
Information gathering
15 days
Fri 17/08/08
Fri 31/08/08
10
9 days
Sat 01/09/08
Sun 09/09/08
11
Preparation of presentation
8 days
Mon 10/09/08
Mon17/09/08
12
Delivery of seminar
1 day
Tues 18/09/08
Tues18/09/08
13
Literature survey
20 days
Wed 19/009/08
Mon 18/10/08
14
Requirement specification
26 days
Tues 09/10/08
Sat 03/11/08
15
Initial design
25 days
Sun 04/11/08
Wed 28/11/08
16
Verify design
9 days
Tues 01/01/09
Wed 09/01/09
17
60 days
Thu 10/01/08
09/03/09
25
18
GUI
15 days
Mon 10/03/08
Mon 24/03/08
19
28 days
Tues 25/03/09
Tues 29/04/09
20
10 days
Wed 30/04/09
10/04/09
CHAPTER 5
SOFTWARE REQUIRMENT SPECIFICATION
Hardware Requirement:
26
Video Monitor (800 600 or higher resolution) with at least 256 colors (1024 768 High
color 16-bit recommended).
Software Requirement:
Database: MS access
User Documentation: User guide or manual should be small and contain all the
information in user understandable format.
In user manual also provide the picture or diagram for proper way to guide user.
System features:
1. Helps in ranking the web pages according to the popularity of web pages.
2. Helps in ranking the web pages according to the relevancy of web pages.
3. Provide an interface/tool to create awareness on social networking sites.
User Interfaces:
We designed a simple user interface using the Microsoft Visual studio 2003
development tool and C# as the programming language. Our user interface is similar to
most of the standard search engines, and contains buttons for performing the basic
functions as specified in the user requirements.
Most of the error messages will be pop up in a dialog box.
Hardware Interfaces:
A computer with minimum 512 MB of RAM with internet connectivity is required.
27
Software Interfaces:
The Rules Engine will run only if the server (in our case Authenticator) is running on
the server machine. The server includes the MS Access database.
The COM DLL is needed to load the project at run time.
The page ranking algorithm computes the rank of pages from a specified set of
pages and displays the most highly ranked pages accordingly.
The Keyword relevance algorithm gives result according to the maximum
frequency count of words on a particular page. The page with the highest
frequency count will be on the top rank.
The database connector is used to insert particular information from the user such
as his/her name, likes as well as the recommendation made.
Communication Protocols:
The communication protocol used in our system is FTP
28
File Transfer Protocol (FTP) is a network protocol used to exchange and manipulate files over
a TCP computer network, such as the internet. An FTP client may connect to an FTP server to
manipulate files on that server.
FTP runs over TCP.[1] It defaults to listen on port 21 for incoming connections from FTP clients.
A connection to this port from the FTP Client forms the control stream on which commands are
passed from the FTP client to the FTP server and on occasion from the FTP server to the FTP
client. FTP uses out-of-band control, which means it uses a separate connection for control and
data. Thus, for the actual file transfer to take place, a different connection is required which is
called the data stream. Depending on the transfer mode, the process of setting up the data stream
is different. Port 21 for control (or program), port 20 for data.
In active mode, the FTP client opens a dynamic port, sends the FTP server the dynamic port
number on which it is listening over the control stream and waits for a connection from the FTP
server. When the FTP server initiates the data connection to the FTP client it binds the source
port to port 20 on the FTP server.
The objectives of FTP are:
1. To promote sharing of files (computer programs and/or data).
2. To encourage indirect or implicit use of remote computers.
3. To shield a user from variations in file storage systems among different hosts.
4. To transfer data reliably, and efficiently.
Reliability:
The reliability of the overall program depends on the reliability of the separate
components.
Availability: The system can be made available if u have the specific kind of connector
for specific kind of data one wants to search .Internet availability is a must.
Security:
29
Passwords will be saved in the database in order to ensure the user's privacy.
Maintainability:
The maintainability of the project has been addressed by assigning appropriate variable
names, following appropriate naming convention for functions and appropriate coding
standards. The segregation of code makes it easy to understand, maintain and modify.
Portability:
The application is Windows-Xp based and should be compatible with other systems. The
end-user part is fully portable and any system having any operating system should be able
to use the features of the application.
Database Requirements:
A database is maintained in MS access to keep a list of all users, their likes and
recommendation made by them. Following are the tables maintained in the database:
FIELD NAME
DATATYPE
VALIDATION
Name
TEXT
Likes
TEXT
NOT NULL
Recommendation
TEXT
NOT NULL
30
CHAPTER 6
SOFTWARE DESIGN DESCRIPTION
31
Rules engine will perform processing of the rules. Processing includes different functions like
indexing, stemming and stop word removal depending on the parameters passed by the user.
Algorithm: We will develop an algorithm which takes these parameters as input and
generates the most popular result on the basis of the criteria defined.
This algorithm will have many criteria defined which will allow the user to search
specific information according to his own chosen criteria.
Whenever we search something on search engines the results are displayed according to
popularity of pages meaning those pages which have high rank will be displayed first.
Instead of this we can let user decide the criteria of searching and have the results
according to their own chosen criteria.
32
33
a. Memory management:
In Win32, the DLL files are organized into sections. Each section has its own set of
attributes, such as being writable or read-only, executable (for code) or non-executable
(for data), and so on.
The code in a DLL is usually shared among all the processes that use the DLL; that is,
they occupy a single place in physical memory, and do not take up space in the page file.
If the physical memory occupied by a code section is to be reclaimed, its contents are
discarded, and later reloaded directly from the DLL file as necessary.
In contrast to code sections, the data sections of a DLL are usually private; that is, each
process using the DLL has its own copy of all the DLL's data. Optionally, data sections
can be made shared, allowing inter-process communication via this shared memory area.
However, because user restrictions do not apply to the use of shared DLL memory, this
creates a security hole; namely, one process can corrupt the shared data, which will likely
cause all other sharing processes to behave undesirably. For example, a process running
under a guest account can in this way corrupt another process running under a privileged
account. This is an important reason to avoid the use of shared sections in DLLs.
If a DLL is compressed by certain executable packers (e.g. UPX), all of its code sections
are marked as read-and-write, and will be unshared. Read-and-write code sections, much
like private data sections, are private to each process. Thus DLLs with shared data
sections should not be compressed if they are intended to be used simultaneously by
multiple programs, since each program instance would have to carry its own copy of the
DLL, resulting in increased memory consumption.
b. Import libraries
Linking to dynamic libraries is usually handled by linking to an import library when
building or linking to create an executable file. The created executable then contains an
import address table (IAT) by which all DLL function calls are referenced (each
referenced DLL function contains its own entry in the IAT). At run-time, the IAT is filled
with appropriate addresses that point directly to a function in the separately-loaded DLL.
Like static libraries, import libraries for DLLs are noted by the .lib file extension. For
example, kernel32.dll, the primary dynamic library for Windows' base functions such as
file creation and memory management, is linked via kernel32.lib.
34
Example
Basic use of curl involves simply typing curl at the command line, followed by the URL
of the output you want to retrieve.
To retrieve the Wikipedia homepage, type:
curl www.wikipedia.org
Curl defaults to displaying the output it retrieves to the standard output specified on the
system, which is usually the terminal window. So running the command above would, on
most systems, display the www.wikipedia.org source code in the terminal window.
35
4. Library file HTMLReader_src is an html parser used to parse HTML web pages [16].
An events-based parser uses the callback mechanism to report parsing events. These
callbacks turn out to be protected virtual member functions that you will override.
Events, such as the detection of an opening tag or the closing tag of an element, will
trigger a call to the corresponding member function of your class. The application
implements and registers an event handler with the reader. It is up to the application to
put some code in the event handlers designed to achieve the objective of the application.
Events-based parsers provide a simple, fast, and a lower-level access to the document
being parsed.
Events-based parsers do not create an in-memory representation of the source document.
They simply parse the document and notify client applications about various elements
they find along the way. What happens next is the responsibility of the client application.
Events-based parsers don't cache information and have an enviably small memory
footprint.
The page ranking algorithm, the keyword relevance algorithm and the web crawler are
integrated to form a web connector we create a com DLL further we import this DLL into
our windows application .A database connector which uses ms-access at backend is
created for the same windows application.
It creates a copy of all the visited pages for later processing by a search engine that will
index the downloaded page to provide fast searches. Checking links or validating HTML
code can be used to gather specific type of information from web pages such as
harvesting e-mail address (spam). Web crawling is modeled as a multiple queue, singleserver polling system on which the web crawler is the server and the web sites are
queues. The objective of crawler is to keep the average freshness of pages in its collection
as high as possible or to keep the average age of pages as low as possible. To improve the
freshness we should penalize the elements that change too often. A web crawler (also
known as a web spider, web robot, orespecially in the FOAF communityweb scatter)
is a program or automated script which browses the World Wide Web in a methodical,
automated manner. Other less frequently used names for web crawlers are ants, automatic
36
indexers, bots, and worms. This process is called web crawling. Many sites, in particular
search engines, use crawling as a means of providing up-to-date data. Web crawlers are
mainly used to create a copy of all the visited pages for later processing by a search
engine that will index the downloaded pages to provide fast searches. Crawlers can also
be used for automating maintenance tasks on a website, such as checking links or
validating HTML code. Also, crawlers can be used to gather specific types of information
from Web pages, such as harvesting e-mail addresses (usually for spam).A web crawler is
one type of BOT, or software agent. In general, it starts with a list of URLs to visit, called
the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and
adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier
are recursively visited according to a set of policies.
Algorithm for Web Crawler:
Go to that URL scan the entire page find out if any links are present, if any
URLs are present dump them into that linked list.
All the URLs present in the linked list are called as child URLs and the one
present in queue are called as parent.
Now pick first child URL from linked list dump it into the queue this URL
then becomes the parent repeat step 1.
Repeat the process for each and every child URL present in the linked list.
Keep on doing so till the depth mentioned at the start of the code is reached
6. Database Connector:
This connector is used for populating a set of records. The connector is used to
insert specific records by the user.
Once the user has entered all the records he has to recommend some of things
like places, foods or restaurants.
Recommendation of all the users will be saved which in turn will be used for
endorsement of a specific thing.
37
This application can even be further integrated with any of the ranking
algorithm and the ranking algorithm could be used to rank the data.
A place/thing with the maximum number of votes will be the most popular
among all the data.
When a user fires a query to see endorsements of a specific thing the one with
the highest number of votes would be on the top of the list.
UML diagrams:
1. Use case diagram:
38
2. Class Diagram:
39
3. Activity Diagram:
40
4. Sequence Diagram:
41
42
5. Communication Diagram:
43
44
6. Component Diagram:
45
7. Deployment Diagram:
46
Implementation Details:
1. Page Rank Algorithm:
#include<conio.h>
#include<iostream>
using namespace std;
int main()
{
int m=0, i, j, k;
const double d=0.85;
const double n=1-d;
float linkMap[10][10]= {
{1,2,3,0,0,0,1,2,3,1},
{4,5,6,1,0,0,1,2,3,1},
{0,2,3,1,1,1,1,2,3,0},
{1,2,3,0,0,0,1,0,0,0},
{1,3,3,0,0,0,1,2,1,1},
{1,1,1,1,1,0,0,0,0,0},
{0,0,0,0,0,0,1,1,1,1},
{1,1,1,0,0,0,2,2,2,2},
{1,1,1,0,0,0,1,1,2,1},
{1,2,2,0,3,2,1,2,3,1}};
47
float pageValue[10]
= {0};
for(m=0;m<10;m++)
{
48
for(k=0;k<10;k++)
{
pageValue[j] += pageValue[k]/outboundLinks[j];
}
}
pageValue[j] = n + d* (pageValue[j]);
printf("PageValue[%d] = %f\n", j+1, pageValue[j]);
}
printf("-------------------------------\n");
49
i--;
}
getch();
return 0;
}
2. Keyword Relevance Algorithm
for(temp1 = Head[0]; temp1 != NULL; temp1 = temp1->next)
{
for(temp2 = temp1->next; temp2 != NULL; temp2 = temp2->next)
{
if ((temp1->keywordOccurance < temp2->keywordOccurance) ||
( (temp1->keywordOccurance == temp2->keywordOccurance) && (temp1->TotalCount
< temp2->TotalCount)))
{
ptr = temp1->nodePtr;
temp1->nodePtr = temp2->nodePtr;
temp2->nodePtr = ptr;
i = temp1->keywordOccurance;
temp1->keywordOccurance = temp2->keywordOccurance;
temp2->keywordOccurance = i;
i = temp1->TotalCount;
50
temp1->TotalCount = temp2->TotalCount;
temp2->TotalCount = i;
}
}
3. Database connector
private void ExecuteInsertQuery()
{
int rows = 0;
OpenConnection();
crawlerAdapter.SelectCommand.Connection = crawlerConnection;
crawlerAdapter.SelectCommand.CommandText = "Select * From
table1 WHERE Person = '" + textBox4.Text + "' AND Category = '" + textBox6.Text
+ "' AND Object = '" + textBox5.Text + "'";
crawlerAdapter.SelectCommand.ExecuteNonQuery();
CloseConnection();
DataSet ds = new DataSet();
crawlerAdapter.Fill(ds);
if (ds.Tables[0].Rows.Count > 0)
{
MessageBox.Show("Entry is already present");
return;
}
int ID = 0;
OpenConnection();
crawlerAdapter.SelectCommand.Connection = crawlerConnection;
crawlerAdapter.SelectCommand.CommandText = "Select * from
table1";
crawlerAdapter.SelectCommand.ExecuteNonQuery();
CloseConnection();
ds = new DataSet();
crawlerAdapter.Fill(ds);
if (((int)ds.Tables[0].Rows.Count) > 0)
{
OpenConnection();
crawlerAdapter.SelectCommand.Connection =
crawlerConnection;
crawlerAdapter.SelectCommand.CommandText = "Select
MAX(ID) from table1";
crawlerAdapter.SelectCommand.ExecuteNonQuery();
CloseConnection();
ds=new DataSet();
51
crawlerAdapter.Fill(ds);
ID =(int)ds.Tables[0].Rows[0][0];
ID++;
}
else
{
ID = 1;
}
OpenConnection();
crawlerAdapter.InsertCommand.Connection=crawlerConnection;
crawlerAdapter.InsertCommand.CommandText = "INSERT INTO table1
VALUES (" + ID.ToString() + ", '" + textBox4.Text + "', '" + textBox6.Text + "', '" +
textBox5.Text + "')";
MessageBox.Show(crawlerAdapter.InsertCommand.CommandText);
crawlerAdapter.InsertCommand.Connection = crawlerConnection;
rows = crawlerAdapter.InsertCommand.ExecuteNonQuery();
CloseConnection();
PopulateComboBox();
MessageBox.Show("Insert Successful");
}
CHAPTER 7
52
Our goal is to design a series of test cases that would have a high likelihood of
finding errors. The software testing techniques provide a systematic guidance
for designing tests that exercise the internal logic of software components and
exercise the input & output domains of the program to uncover errors in
program function, behavior and performance.
53
Takes a single input as user id for the detection of anomalies that is used to generate the
recommendations.
Appropriates alerts are generated as per the condition for user convenience.
Thus requires the user to be registered with the system before use.
Takes a single input as user id for the detection of anomalies that is used to generate the
recommendations.
Appropriates alerts are generated as per the condition for user convenience.
At least one and preferably all of the following types of testing before releasing
application to customers should be performed.
54
Performance testing
Load testing
Stress testing
Performance Testing
Performance testing is designed to test run-time performance of application within
the context of an integrated system . Proper response time for user actions is critical to
maintaining and enhancing user base.
Load Testing
Load testing demonstrates how the application performs under concurrent user
sessions for typical user scenarios. Setting up common scenarios that execute for a short
period of time allows seeing how the application operates under a multiple-user load.
Stress Testing
Stress test allows examining how the application behaves under a maximum user
load. To stress test application, remove the think time for load scripts and execute the
scripts against the server to overload use of the application. If there are unhandled
exceptions in a stress test, the application may not be robust enough to handle a sudden
unexpected increase in user activity. Stress tests generally execute for a longer period of
time, and can be used to catch difficult-to-diagnose problems like subtle memory leaks in
the application
Items to be tested:
The following items are the ones that constitute the proposed system.
55
Hear we ensure that all the modules, classes and libraries are included when
integrated properly.
No
1.
2.
3.
4.
Name
The page rank algorithm
Keyword relevance algorithm
Web crawler
Database connector
Identifier
C1
C2
C3
C4
Version no
1
1
1
1
Feature to be tested:
Hear we will be testing all features provided by the proposed system to ensure all the
features that distinguish the system are implemented properly. The following are the features that
will be testing here.
Approach:
56
Test Deliverables:
The Following are the resultants of testing:
1. Test plan
2. Test cases
3. Test procedure sections
4. Test summary reports
5. Test logs
Test Tasks:
57
Test Environment:
Software requirements:
Category (Software tools)
Software Name
Operating System
Microsoft Windows XP
.net 2003
Front End
VC++,C#.net
Back End
MS access, Files
Hardware requirements:
Hardware
Minimum Requirement
Microprocessor
512 MB RAM.
20GB
(min.
free
useable
space).Network
Responsibilities:
Sr.no
Name
Designation
Task
disk
58
Dnyaneshwari
Test Manager
Nitu
Test Manager
3.
Dnyaneshwari
Test Engineer
4.
Nitu
Test Engineer
Risks:
1. Power failure.
2. Hardware failure.
3. Server crashes.
4. Unable to handle site.
GENERAL INFORMATION:
PRODUCT NAME: RULES ENGINE
2. EXECUTION INFORMATION:
59
60
Test Items to be
id
tested
Steps
Input
Actual output
Expected output
Pass/fail
1.
URL address
Display
success
Display message
successful
Pass
URL
61
2.
System check
for proper
address entered
by the user
System
compares the
data entered by
the user and the
data present in
the database.
If address is
valid
Make
connection
Make connection
Pass
If address is
invalid
Report
improper
address
Report error
Pass
3.
System
computes page
rank
System
downloads
relevant pages
from the web.
4.
User enters
URL to
compute rank
by keyword
relevance
algorithm
System checks
if the URL
entered is in
correct format.
URL address
Display
message
successful
Display message
successful
Pass
Test cases
Descriptions
62
To check whether user has selected the correct application which is set by
admin to him/her
63
TE
ST
ID
ITEM TO BE
TESTED
STEPS
INPUT
ACTUAL
OUTPUT
EXPECTED
OUTPUT
PASS
OR
FAIL
1.
User selects
application from the
application list
2.
Names of the
user
User name
username
pass
System checks if
duplicated records are
present in the database
Display error if
present
Display error
if present
3.
4.
User fills
System updates
recommendation information and
assigns user vote to
that particular
place/thing
User
recommendatio
n
5.
User searches to
see the
popularity of
particular
place/thing
Place/thing
The most
pass
popular record
TEST ID
TEST ACTIONS
64
To check whether the user recommendation of one user is made available to all
CHAPTER 8
FURTHER WORKS
Future work:
User can use different connectors and ranking algorithms to rank different types of data
To do this user just has to add an extra tab and insert his code inside the framework
65
CHAPTER 9
SCREENSHOTS
66
67
68
69
CHAPTER 10
70
REFERENCES
BOOKS:
1. Roger pressman Software Engineering
2. Information Retrieval
WEBSITES:
1)http://blog.taragana.com/index.php/archive/clean-room-implementation-of-google-page-rankalgorithm/
2)http://www.stanford.edu/group/reputation/ClickThroughAlg_Tutorial.pdf
3)http://kojotovski.diinoweb.com/files/The_mathematical_model_of_Google.pdf
4)http://citeseer.ist.psu.edu/cache/papers/cs/7144/http:zSzzSzwwwdb.stanford.eduzSz~backrubzSzpageranksub.pdf/page98pagerank.pdf
5)http://www.suchmaschinen-doktor.de/index.html
6)http://wwwhome.math.utwente.nl/~litvakn/IntMath07.pdf
7)http://www2006.org/programme/files/xhtml/3101/p3101-Richardson.html
8)http://www.texaswebdevelopers.com/docs/pagerank.pdf
8)http://pr.efactory.de/e-pagerank-implementation.shtml
8)http://www.rankforsales.com/n-aa/095-seo-may-31-03.html
9)http://www.pwqsoft.com/search-engine-ranking.htm#case2
10)http://www.webworkshop.net/pagerank.html
11)http://www.ianrogers.net/google-page-rank/
71
12)http://www.webworkshop.net/pagerank_calculator.html
13)http://www.linkingmatters.com/WhyLinkingIsImportant.html
14)http://www.example-code.com/vcpp/spider_simplecrawler.asp
15)http://en.wikipedia.org/wiki/Web_crawler
16)http://www.codeproject.com/KB/library/GomzyHTMLReader.aspx
17)http://en.wikipedia.org/wiki/CURL
18)http://en.wikipedia.org/wiki/Dynamic-link_library
19)http://en.wikipedia.org/wiki/WinRunne
20)http://www.nokiasoftware.net/general-discussions/19871-net-framework.html
21)http://cache.phazeddl.com/1412686/Microsoft%20Visual%20Studio%206.0
22)www.rocw.raifoundation.org/management/mba/.../lecture-10.pdf
23)http://en.wikipedia.org/wiki/PageRank#Algorithm