Web Minning

WEB MINING
1. Starter 2
2. A Taxonomy of Web Mining 3
2.1 Web Content Mining 3
2.1.1 Web Crawler 4
2.1.2 Harvest System 7
2.1.3 Virtual Web View 7
2.1.4. Personalization 7
3. Web Structure mining 8
3.1 Page Rank 9
3.1.1 Important Pages 10
4 Web Usage Mining 14
5. The WEBMINER system 15
5.1 Browsing Behavior Models 16
5.2 Developer’s Model16
6. Preprocessing 17
7. Data Structures 19
8. Finding Unusual Itemsets 21
8.1 The DICE Engine 21
8.2 Books and Authors 22
8.3 What is a pattern? 23
8.4 Data Occurrences 24
8.5 Finding Data Occurrences Given Data 24
8.6 Building Patterns from Data Occurrences 24
8.7 Finding Occurrences Given Patterns 25
SUSHIL KULKARNI 2
1.Starter
On the World Wide Web there are billions of documents that are spread over millions of
different web servers. The data on the web is called web data and is classified as
follows:
(1) Content of Web pages. These pages have the following structures:
(a) Intra-page structures that include HTML or XML code for the page.
(b) Inter-page structure that includes the actual linkage of different web pages.
(2) Usage data that describe how web pages are accessed by visitors.
(3) Usage profiles gives the characteristics of visitors that include demographics,
psychographics, and technographics.
Demographics are tangible attributes such as home address, income, purchasing

responsibility, or recreational equipment ownership.
Psychographics are personality types that might be revealed in a psychological survey,

such as highly protective feelings toward children (commonly called "gatekeeper
moms"), impulse-buying tendencies, early technology interest, and so on.
Technographics are attributes of the visitor's system, such as operating system,

browser, domain, and modem speed.
With the explosive growth of information sources available on the World Wide Web, it
has become increasingly necessary for users to utilize automated tools in order to find,
extract, filter, and evaluate the desired information and resources. In addition, with the
transformation of the Web into the primary tool for electronic commerce, it is imperative
for organizations and companies, who have invested millions in Internet and Intranet
technologies, to track and analyze user access patterns. These factors give rise to the
necessity of creating server-side and client-side intelligent systems that can effectively
mine for knowledge both across the Internet and in particular Web localities.
Web mining can be broadly defined as:
The discovery and analysis of useful information from the World Wide Web.
This definition describes the automatic search and retrieval of information and resources
available from millions of sites and on-line databases. This is called Web content
mining, and the discovery and analysis of user access patterns from one or more Web
servers or on-line services is called Web usage mining. The structure of the Web
organization is model using Web structure mining.
In this chapter, we provide an overview of tools, techniques, and problems associated

with the three dimensions above.
sushiltry@yahoo.co.in
SUSHIL KULKARNI 3
There are several important issues, unique to the Web paradigm, that come into play if
sophisticated types of analyses are to be done on server side data collections. They
include:
o The necessity of integrating various data sources such as server access logs
o Referrer logs, user registration or profile information
o Resolving difficulties in the identification of users due to missing unique key

attributes in collected data
o The importance of identifying user sessions or transactions from usage data, site
topologies, and models of user behavior.
2. A Taxonomy of Web Mining
In this section we present taxonomy of Web mining along its three primary
dimensions, namely Web content mining, Web structure mining and Web usage
mining. This taxonomy is depicted in the following figure.
Web Mining
Web Content Web Structure Web Usage

Mining Mining Mining
Web Page Search Result General Access Customized

Content Mining Mining Pattern Tracking Usage Mining
2.1 Web Content Mining
The World Wide Web contains the information sources, that is heterogeneous and
without a structure such as hypertext, extensible mark up documents. In fact it is
difficult to locate Web-based information automatically as well as to organize and
manage.
Web Content mining is helpful for retrieving the pages, locating and ranking relevant
web pages, browsing through relevant and related web pages. Extracting and gathering
SUSHIL KULKARNI 4
the information from the web pages
Traditional search and indexing tools of the Internet and the World Wide Web such as
Lycos, Alta Vista, WebCrawler, ALIWEB, MetaCrawler, and others provide some
comfort to users, but they do not generally provide structural information nor
categorize, filter, or interpret documents. Here we see few tools which are commonly
used.
2.1.1 Web Crawler
Let us concentrate to search the Web pages. The significant problems to search a
particular web page is as follows are as follows:
(a) Scale: The Web grows at a faster rate than machines and disks.
(b) Variety: The Web pages are not always documents.
(c) Duplicates: There are various Web pages that are mirrored or copied.
(d) Domain Name Resolution: A symbolic address is mapped to an IP address and

takes long time to invoke.
To tackle this problem, one can use a program called web crawler.
A web crawler (or robot or spider) is a program, which automatically traverses the
web by downloading documents and following links from page to page. The page from
where the crawler starts is called seed URL’s. The links from this page are stored in a
queue for search engine. These new pages are searched and their links are stored in a
queue for search engine.
While the crawler moves downwards and collects the information about each page, that
includes extract keywords and stores in indices for users for associated search engine.
Web crawlers are also known as spiders, robots, worms etc.
The design of web crawler is implemented in Java and designed to scale to tens of
millions of web pages. The main components of the system are a URL list, downloader,
link extractor etc.
The crawler was designed so that at most one worker thread will download from a given
server. This was done to avoid overloading any servers. The crawler uses user-specified
URL filters (domain, prefix, and protocol) to decide whether or not to download
documents. It is possible to use the conjunction, disjunction or negation of filters.
Crawler are of different types and are discussed below:
a. Traditional Crawler: A traditional crawler visits entire Web and gathers the
information and builds the indices by replacing the existing index.
SUSHIL KULKARNI 5
b. Periodic Crawler: A periodic crawler visits a specific number of pages and stop. It
builds the index and replaces the existing index. This crawler is activated periodically.
c. Incremental Crawling: An incremental crawler is one which updates an existing set

of downloaded pages instead of restarting the crawl from scratch each time. This
involves some way of determining whether a page has changed since the last time it
was crawled.
d. Focused Crawling: A general-purpose web crawler normally tries to gather as many

pages as it can from a particular set of sites. In contrast, a focused crawler is designed
to only gather documents on a specific topic, thus reducing the amount of network
traffic and downloads.
Searching hypertext documents is based on depth-first search algorithm. This algorithm

uses the school of fish metaphor with multiple processes or threads following links
from pages. The "fish" follow more links from relevant pages, based on keyword and
regular expression matching. This type of system can have heavy demands on the
network, and has various caching strategies to deal with.
Here the crawler starts by using canonical topic taxonomy and user specified starting
points (e.g. bookmarks). A user marks interesting pages as they browse, which are then
placed in a category in the taxonomy.
The main components of the focused crawler were a classifier, distiller and crawler.
The classifier makes relevance judgements on pages to decide on link expansion and
the distiller determines centrality of pages to determine visit priorities. This is based on
connectivity analysis using harvest ratio, which is the rate at which relevant pages are
acquired, and how effectively irrelevant pages are filtered away.
To use focused crawler, the user identifies the name of the topic that (s) he is looking
for. While the user browsers on the Web, (s) he identifies the documents that are of
interest. These are then classified based on hierarchical classification tree and node in
the tree are marked as good, thus indicating that this node in the tree has associated
with it document(s) that are of interest. These documents are then used as the seed
documents to begin the focused crawling. During crawler phase relevant documents are
found and determine whether it makes sense to follow the links out of these documents.
Each document is classified into a leaf node of the taxonomy tree.
In recent years intelligent tools are developed for information retrieval, such as
intelligent Web agents, as well as to extend database and data mining techniques to
provide a higher level of organization for semi-structured data available on the Web. We
summarize these efforts below.
[A] Agent Based Crawling
Many software agents are developed for web crawling on the Internet, or created act as
browsing assistants.
The InfoSpiders system (formerly ARACHNID) is based on ecology of agents, which
SUSHIL KULKARNI 6
search through the network for information. As an example, a user's bookmarks could
be used as a start point, with the agents then analyzing the "local area" around these
start points. Link relevance estimates are used to move to another page, with agents
being rewarded with energy (credit) if documents appear to be relevant. Agents are
charged energy costs for using network resources, and use user assessments if a
document has been previously visited. If an agent moves off-topic it will eventually die
off due to loss of energy.
[B] Database Approach
The database approaches to Web mining have generally focused on techniques for
integrating and organizing the heterogeneous and semi-structured data on the Web into
more structured and high-level collections of resources, such as in relational databases,
and using standard database querying mechanisms and data mining techniques to
access and analyze this information.
(i) Multilevel Databases
The main idea behind it is that the lowest level of the database contains primitive semi-
structured information stored in various Web repositories, such as hypertext documents.
At the higher level(s) meta-data or generalizations are extracted from lower levels and
organized in structured collections such as relational or object-oriented databases.
(ii) Web Query Systems
There have been many Web-base query systems and languages developed recently that
attempt to utilize standard database query languages such as SQL, structural
information about Web documents, and even natural language processing for
accommodating the types of queries that are used in World Wide Web searches. We
mention a few examples of these Web-base query systems here.
W3QL: combines structure queries, based on the organization of hypertext documents,

and content queries, based on information retrieval techniques.
WebLog: Logic-based query language for restructuring extracted information from Web
information sources.
Lorel and UnQL: query heterogeneous and semi-structured information on the Web
using a labeled graph data model.
TSIMMIS: extracts data from heterogeneous and semi-structured information sources

and correlates them to generate an integrated database representation of the extracted
information.
WebML: This is a query language to access the documents using data mining
operations and list of operations based on the use of concept hierarchies for the
following keywords.
1. COVERS: One concept covers another.
SUSHIL KULKARNI 7
2. COVERED BY: This is reverse of above.

3. LIKE: One concept is similar to another.
4. CLOSE TO: One concept is closed to another.
Following is the example of WebML:
SELECT *
FROM document in www.ed.smm.edu
WHERE ONE OF keywords COVERS “dog”
2.1.2 Harvest System
This system is based on the use of catching, indexing and crawling. This system is a set
of tools that is used to collect information from different sources. It is designed on the
basis on collector and brokers.
A collector obtains the information for indexing from an Internet service provider, while
a broker provides the index and query interface. The relationship between collector and
broker can very. Broker may interface directly with collector or may go through other
brokers to get to the collectors. Indices in Harvest are topic specific as are brokers.
2.1.3 Virtual Web View
This approach is based on the database discussed earlier.
2.1.4. Personalization
With Web personalization, users can get more information on the Internet faster
because Web sites already know their interests and needs. But to gain this convenience,
users must give up some information about themselves and their interests — and give
up some of their privacy. Web personalization is made possible by tools that enable Web
sites to collect information about users.
One of the ways this is accomplished is by having visitors to a site fill out forms with
information fields that populate a database. The Web site then uses the database to
match a user's needs to the products or information provided at the site, with
middleware facilitating the process by passing data between the database and the Web
site.
An example is Amazon.com Inc.'s ability to suggest books or CDs users may want to
purchase based on interests they list when registering with the site.
Customers tend to buy more when they know exactly what's available at the site and
they do not have to hunt around for it
Cookies may be the most recognizable personalization tools. Cookies are bits of code
that sit in a user's Internet browser memory and tell Web sites who the person is —
that's how a Web site is able to greet users by name.
SUSHIL KULKARNI 8
A less obvious means of Web personalization is collaborative-filtering software that

resides on a Web site and tracks users' movements. Wherever users go on the Internet,
they can't help but leave footprints. And software is getting better at reading the paths
users take across the Web to discern their interests and viewing habits: from the
amount of time they spend on one page to the types of pages they choose.
Collaborative-filtering software compares the information it gains about one user's

behavior against data about other customers with similar interests. In this way, users
get recommendations like Amazon's "Customers who bought this book also bought. . ."
These are "rules-based personalization systems" If you have historical information,

you can buy data-mining tools from a third party to generate the rules. Rules-based
personalization systems are usually deployed in situations where there are limited
products or services offered, such as insurance and financial institutions, where human
marketers can write a small number of rules and walk away.
Other personalization systems, such as Andromedia LikeMinds, emphasize automatic real

time selection of items to be offered or suggested. Systems that use the idea that
"people like you make good predictors for what you will do" are called "collaborative
filters." These systems are usually deployed in situations where there are many items
offered, such as clothing, entertainment, office supplies, and consumer goods. Human
marketers go insane trying to determine what to offer to whom, when there are
thousands of items to offer. As a result automatic systems are usually more effective in
these environments. Personalizing from large inventories is complex, unintuitive, and
requires processing huge amounts of data.
Let us consider the example, Ms. Heena Mehta purchase the items by visiting online
shopping through the web site ABC.com. She first log in using an ID. This ID is useful to
keep track what she purchase as well as which pages she visits. Data mining tools of the
ABC.com is used to develop detailed user profile for Heena using the purchases and web
usage data. This profile is used later on to display an advertisement. For example,
suppose Heena purchase bulk of chocolates last week and today she log in and goes to
the page that contains the attractive Barbie dolls for buying. While looking at the pages
ABC shows a banner ad about some special sale on milk chocolates. Heena can’t resists.
She immediately follows the link and adds the chocolates to her shopping list. She
returns to the page with the Barbie she wants.
3. Web Structure mining
Web structure mining can be viewed as a model of the Web organization or a portion
thereof. This can be used to classify web pages or to create similarity measures between
the documents. We have already seen some structure mining ideas presented in the
previous article. These approaches used structure to improve on the effectiveness of
search engine and crawlers.
Following are two techniques used for structure mining.
SUSHIL KULKARNI 9
3.1 Page Rank
PageRank is one of the methods Google uses to determine a page’s relevance or

importance. The pagerank value for a page is calculated based on the number of pages
that point to it. This is actually a measure based on the number of backlinks to a page.
PageRank is displayed on the toolbar of your browser if you’ve installed the Google
toolbar (http://toolbar.google.com/). But the Toolbar PageRank only goes from 0 – 10
and seems to be something like a logarithmic scale:
Toolbar PageRank Real PageRank

(log base 10)
0 0 – 100
1 100 - 1,000
2 1,000 - 10,000
3 10,000 - 100,000
4 and so on...
Following are some of the terms used:
(1) PR: Shorthand for PageRank: the actual, real, page rank for each page as
calculated by Google.
(2) Toolbar PR: The PageRank displayed in the Google toolbar in your browser. This
ranges from 0 to 10.
(3) Backlink: If page A links out to page B, then page B is said to have a “backlink”
from page A.
We can’t know the exact details of the scale because the maximum PR of all pages on
the web changes every month when Google does its re-indexing! If we presume the
scale is logarithmic then Google could simply give the highest actual PR page a toolbar
PR of 10 and scale the rest appropriately.
Thus the question is “What is PageRank?”. So let’s answer it.
PageRank is a “vote”, by all the other pages on the Web, about how important a page
is. A link to a page counts as a vote of support. If there’s no link there’s no support (but
it’s an abstention from voting rather than a vote against the page).
The another definition given by Google is as follows:
We assume page A has pages T1...Tn which point to it (i.e., are citations). The
parameter d is a damping factor,which can be set between 0 and 1. We usually set d to
0.85. Also C(A) is defined as the number of links going out of page A.
The PageRank of a page A is given as follows:
SUSHIL KULKARNI 10
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Note that the PageRanks form a probability distribution over web pages, so the sum of
all web pages' PageRanks will be one.
Let us break down PageRank or PR(A) into the following sections:
1. PR(Tn) - Each page has a notion of its own self-importance. That’s “PR(T1)” for
the first page in the web all the way up to “PR(Tn)” for the last page
2. C(Tn) - Each page spreads its vote out evenly amongst all of it’s outgoing links.
The count, or number, of outgoing links for page 1 is “C(T1)”, “C(Tn)” for page
n, and so on for all pages.
3. PR(Tn)/C(Tn) - so if our page (page A) has a backlink from page “n” the share
of the vote page A will get is “PR(Tn)/C(Tn)”
4. d(... - All these fractions of votes are added together but, to stop the other
pages having too much influence, this total vote is “damped down” by
multiplying it by 0.85 (the factor “d”)
3.1.1 Important Pages
A page is important if important pages link to it. Following is the method of page rank.
Create a stochastic matrix of the Web; that is:
1. Each page i corresponds to row i and column i of the matrix.

2. If page j has n successors (links), then the i j th entry is 1/n if page i is one of these
n successors of page j, and 0 otherwise.
The goal behind this matrix is:
o Imagine that initially each page has one unit of importance. At each round, each
page shares whatever importance it has among its successors, and receives new
importance from its predecessors.
o Eventually, the importance of each page reaches a limit, which happens to be its
component in the principal eigen-vector of this matrix.
o That importance is also the probability that a Web surfer, starting at a random page,
and following random links from each page will be at the page in question after a
long series of links.
Let us consider few examples:
SUSHIL KULKARNI 11
Example 1: Assume that the Web consists of only three pages say, Netscape,
Microsoft, and Amazon. The links among these pages were as shown in the following
figure.
A M
m
Let [n; m; a] be the vector of important for the three pages: Netscape, Microsoft,
Amazon, in that order. Then the equation describing the asymptotic values of these
three variables is:
én ù é1 / 2 0 1 / 2 ù én ù
ê mú = ê0 0 1 / 2úú êm ú
ê ú ê ê ú
êë a úû êë1 / 2 1 0 úû êëa úû
For example, the first column of the matrix reflects the fact that Netscape divides its
importance between itself and Amazon. The second column indicates that Microsoft
gives all its importance to Amazon. The third column indicates that Amazon gives all its
importance to Netscape and Microsoft.
We can solve equations like this one by starting with the assumption n = m = a = 1,
and applying the matrix to the current estimate of these values repeatedly. The first four
iterations give the following estimates:
n=1 1 5/4 9/8 5/4

m=1 1/2 3/4 1/2 11/16
a=1 3/2 1 11/8 17/16
In the limit, the solution is n = a = 6/5 ; m = 3/5. That is, Netscape and Amazon each
have the same importance, and twice the importance of Microsoft.
Note that we can never get absolute values of n, m, and a, just their ratios, since the
initial assumption that they were each 1 was arbitrary.
SUSHIL KULKARNI 12
Since the matrix is stochastic (sum of each column is 1), the above relaxation process
converges to the principal eigen vector.
Following are the problems that are faced by on the Web
a. Dead ends: a page that has no successors has nowhere to send its importance.
Eventually, all importance will “leak out of" the Web.
b. Spider traps: a group of one or more pages that have no links out of the group will
eventually accumulate all the importance of the Web.
Let us consider the following example,
Example 2: Suppose Microsoft tries to duck charges that it is a monopoly by removing

all links from its site. The new Web is as shown in the following figure.
A M
m
The matrix describing transitions is:
én ù é1 / 2 0 1 / 2 ù én ù
ê mú = ê0 0 1 / 2úú êm ú
ê ú ê ê ú
ëê a ûú êë1 / 2 0 0 ûú ëêa ûú
And the Microsoft becomes a dead end, as the second column entries are all zeros.
The first four steps of the iterative solution are:
n=1 1 3/4 5/8 1/2

m=1 1/2 1/4 1/4 3/16
a=1 1/2 1/2 3/8 5/16
Eventually, each of n, m, and a become 0; i.e., all the importance leaked out.
SUSHIL KULKARNI 13
Example 3: Angered by the decision, Microsoft decides it will link only to itself from
now on. Now, Microsoft has become a spider trap. The new Web is in the following
figure,
A M
m
The matrix describing transitions is:
én ù é1 / 2 0 1 / 2 ù én ù
ê mú = ê0 1 1 / 2úú êmú
ê ú ê ê ú
êë a úû êë1 / 2 0 0 úû êëa úû
and the equation to solve is:
n=1 1 3/4 5/8 1/2
m=1 3/2 7/4 2 35/16
a=1 1/2 1/2 3/8 5/16
Now, m converges to 3, and n = a = 0. 2
Following is the Google Solution to Dead Ends and Spider Traps
Instead of applying the matrix directly, one can apply “tax" to each page. The tax is the
fraction of its current importance, and distributes the taxed importance equally among
all pages. Consider the following example,
Example 4: If we use a 20% tax, the equation of Example 3 becomes:
én ù é1 / 2 0 1 / 2 ù é n ù é0.2ù
ê mú = 0.8 ê0 1 1 / 2úú ê mú + ê0.2ú
ê ú ê ê ú ê ú
êë a úû êë1 / 2 0 0 úû êë a úû êë0.2úû
SUSHIL KULKARNI 14
The solution to this equation is n = 7=11; m = 21=11; a = 5=11.
Note that the sum of the three values is not 3, but there is a more reasonable
distribution of importance than in Example 3.
The use of Page rank to measure importance, rather than the more naive “number of
links into the page" also protects against spammers. The naive measure can be fooled
by the spammer who creates 1000 pages that mutually link to one another, while Page
rank recognizes that none of the pages have any real importance.
4 Web Usage Mining
It is the Web mining activity that involves the automatic discovery of user access
patterns from one or more Web servers. This activity is very important on the Internet
via World Wide Web to conduct business, the traditional strategies and techniques for
market analysis. Corporate generates and collects large volumes of data in their daily
operations.
Web usage mining performs mining on the Web usage data or web logs. A Web log is a
listing of page reference data. Web log is some times termed as clickstream data
because each entry corresponds to a mouse click. These logs can be examined from
either a client perspective or a server perspective. When evaluated from a server
perspective, mining uncovers information about the sites where the service resides. It
can be used to improve the design the sites. By evaluating a client’s sequence of clicks,
information about the user9or group of users) is determined. This could be used to
perform prefetching and catching of pages.
For example, the Webmaster of XYZ company found that a high percentage of users
have the following pattern of reference to pages: {A, B, A, D}. This means that a user
accesses page A then page B, then back to page A and finally to page D. Based on this
observation, he determines that a link is needed directly to page D from B. He then adds
this link.
Web usage mining consists of three activity steps given below:
1. Preprocessing activities center around reformatting the Web log data before
processing.
2. Pattern discovery activities form the major portion of the mining activities because
these activities look to find hidden pattern within log data.
3. Pattern analysis is the process of looking at and interpreting the results of discovery
activities.
We will learn these activities in the following articles. It should be noted that the web
application is totally different from other traditional data mining application, such as
“Goods Basket” model. We can interpret this problem from two aspects:
SUSHIL KULKARNI 15
1. Weak Relations between user and site:
Visitors could access the web site at any time from any place and even without any clear
idea about what they want from the web. On the other hand, it is not easy for the site
to discriminate different users. WWW brings great freedom and convenience for users
and sites, and great varieties among them as well. So the relation between supply and
demand becomes weak and vague.
2. Complicated behaviours:
Hyperlink and back tracking are the two important characters in web environment,
which make user’s activities more complicated. For the same visited contents, different
users can access them with different patterns. Also, the user’s behaviors are recorded as
visiting sequence in web logs, which can not exactly reflect the user’s real behaviors and
web site structures.
5. The WEBMINER system
This system divides the Web Usage Mining process into three main parts, as shown in
the following figure.
Input data consists of the three server logs - access referrer, and agent, the HTML files
that make up the site, and any optional data such as registration data or remote agent
logs. The first part of Web Usage Mining, called preprocessing, includes the domain
dependent tasks of data cleaning, user identification, session identification, and path
completion.
Data cleaning is the task of removing log entries that are not needed for the mining
process.
User identification is the process of associating page references, even those with the
same IP address, with different users. The site topology is required in addition to the
server logs in order to perform user identification.
Session identification takes all of the page references for a given user in a log and
breaks them up into user sessions. As with user identification, the site topology is
needed in addition to the server logs for this task.
Path completion fills in page references that are missing due to browser and proxy
server caching. This step differs from the others in that information is being added to
the log.
As shown in the figure, mining for association rules requires the added step of
transaction identification, in addition to the other preprocessing tasks. Transaction
identification is the task of identifying semantically meaningful groupings of page
references. In a domain such as market basket analysis, a transaction has a natural
definition - all of the items purchased by a customer at one time. However, the only
“natural” transaction definition in the Web domain is a user session, which is often too
SUSHIL KULKARNI 16
coarse grained for mining tasks such as the discovery of association rules. Therefore,
specialized algorithms are needed to redefine single user sessions into smaller
transactions.
The knowledge discovery phase uses existing data mining techniques to generate
rules and patterns. Included in this phase is the generation of general usage statistics,
such as number of “hits” per page, page most frequently accessed, most common
starting page and average time spent on each page. Association rule and sequential
pattern generation are the only data mining algorithms currently implemented in the
WEBMINER system, but the open architecture can easily accommodate any data mining
or path analysis algorithm. The discovered information is then fed into various pattern
analysis tools. The site filter is used to identify interesting rules and patterns by
comparing the discovered knowledge with the Web site designer’s view of how the site
should be used, as discussed in the next section. As shown in Fig. 2, the site filter can
be applied to the data mining algorithms in order to reduce the computation time, or the
discovered rules and patterns.
5.1 Browsing Behavior Models
In some respects, Web Usage Mining is the process of reconciling the Web site
developer’s view of how the site should be used with the way users are actually
browsing through the site.
5.2 Developer’s Model
The Web site developer’s view of how the site should be used is inherent in the
structure of the site. Each link between pages exists because the developer believes that
the pages are related in some way. Also, the content of the pages themselves provides
SUSHIL KULKARNI 17
information about how the developer expects the site to be used. Hence, an integral
step of the preprocessing phase is the classifying of the site pages and extracting the
site topology from the HTML files that make up the web site. The topology of a Web site
can be easily obtained by means of a site “crawler”, that parses the HTML files to create
a list of all of the hypertext links on a given page, and then follows each link until all of
the site pages are mapped.
The WEBMINER system recognizes five main types of pages:
Head Page - a page whose purpose is to be the first page that users visit, i.e. “home”
pages.
Content Page - a page that contains a portion of the information content that the Web
site is providing.
Navigation Page - a page whose purpose is to provide links to guide users on to

content pages.
Look-up Page - a page used to provide a definition or acronym expansion.
Personal Page - a page used to present information of a biographical or personal

nature for individuals associated with the organization running the Web site.
Each of these types of pages is expected to exhibit certain physical characteristics.
6. Preprocessing
The tasks of data preparation before processing in web usage mining include the
following:
A. Collection of usage data for web visitors:
Most usage data are often recorded as kinds of web server logs. In some of the services
need user registration or usage data are recorded as other file format.
Let us first define clicks and logs as follows:
P is a set of literals, called pages or clicks. U is a set of users. A log is defined

as a set of triplets given by {(u i, p i , t i): u i Î U, p i Î P} where t i is a time
stamp.
Standard log data consists of source site, destination site and time stamp. Source site,
destination site can be any URL or an IP address. This definition indicates that user ID
locates the source site and a page ID identifies the destination. The information about
the browser may be included in the above definition.
B. User identification:
SUSHIL KULKARNI 18
It is easy to identify different users and other registration situation, though it can not be
avoided that some private personal registration information is misused by hackers. But
for common web sites, it seems not easy to identify different users. In this situation,
user can freely visit the web site. User’s IP, Cookies and other limited client information,
such as agent and version of OS and browsers can be used for user identification. In this
step, the usage data for different users are separately collected.
C. Session construction:
After user identification, different sessions for the same user should be reconstructed
from this user’s usage data collected in the second step. A session is a visit performed
by a user from the time (s)he enters the web site till (s)he leaves. Two time constraints
are needed for this reconstruction, one is that the duration for any session can not
exceed a defined threshold; the other is that the time gap between any two
continuously accessed pages can not exceed another defined threshold.
In web usage mining, time set, users set and web pages set are the three key entities,
which are defined as a T, U and P. A session is a visit performed by a user from the
time when (s)he enters the web site to the time she leaves.
A session is a page sequence ordered by timestamp in usage data record and is

defined as S = <p1, p2 … pm> (pi Î P , 1 £ i £ m), and these pages can form another
page set S’ = {p’1, p’2 … p’k }, (p’iÎS , p’ i ¹p’ j , 1£ i , j £ k).
A session is alternatively defined as follows:
Let L be a log. A session S is an ordered list of pages accessed by a user. i.e.

S= {( p i , t i): u i Î U, p i Î P}, where there is a user u i Î U such that
{(u i, p i , t i): u i Î U, p i Î P} Í L
D. Behavior recovery:
Reconstructed sessions are not enough to depict the varieties of user navigation
behaviors in web usage mining. In most cases, any kinds of usage behaviors are only
recorded as a URL sequence in sessions. The revisiting and back tracking result in the
complexity of user navigation, so the task of recovery aims to rebuild the real user
behavior from the linear URLs in sessions.
User behavior is recovered from the session for this user and defined as b = (S’, R),
where R is the relations among S’ and all user behaviors bs form a behaviour set named
B.
Consider the following list of session:
S = <0, 292, 300, 304, 350, 326, 512, 510, 513, 512, 515, 513, 292, 319, 350, 517,
286 >
In this session, the pages are labeled with IDs. User raised 17 page/times access
SUSHIL KULKARNI 19
requirements. 0 and 286 were accessed separately as entrance and exit pages. Besides
the entrance and exit pages, there are two groups of pages, one group is (300, 304,
326, 510, 513, 319, 517, 515), which were accessed only once, and the other group is
(292, 350, 512) which were accessed more than once. Now we will explain several
strategies for recovering different user behaviours.
Following is the simple Behaviours Recovery strategy:
This strategy is the simplest one and overlooks all the repeated pages in a session. It
includes two kinds of behaviours. The first is that user behaviours are represented with
only those unique accessed pages, which is the simplest recovery strategy. So simple
user behaviours can be recovered from this session as:
S’ = {0, 292, 300, 304, 350, 326, 512, 510, 513, 515, 319, 517, 286}.
The second method is that user behaviours are represented with those unique accessed
pages and also the access sequence among these pages. For those pages accessed
more than once, we concern only the first happening. Based on this thinking, user
behaviours for this session can be recovered as:
<0 – 292 – 300 – 304 – 350 – 326 – 512 – 510 – 513 – 515 – 319 – 517 - 286>
From the user behaviours recovered by the first method, association rules and frequent
item sets can be mined in further step. Sequential patterns can be mined from the user
behaviours recovered by the second method.
7. Data Structures
The simple recovery strategies listed above play great importance in data mining.
While in web usage mining, revisiting and backtracking are the two important characters
in user behaviours, which take place as the form that some pages were accessed more
than once during a session. Those pages accessed more than once led to different
access directions, which formed behaviours like tree structure. Tree structure behaviours
not only depicted the visiting patterns, but also revealed some conceptual hierarchy on
site semantics.
In the tree structure behaviour, each different page happens only once. To recover
access tree t from session s, we used a page set named P to store the unique pages
that already exist in t, and we also used a pointer pr pointing to the last recovered
node during recovering in t. The recovery strategy is:
1. Set t = NULL;
2. Read the first entrance page in s as the tree root r, let pr pointing to r and insert
this page to P;
3. Read new page from s and judge if the same page exist in P;
i. Exist in P:
4. Find this already existing node n in t and set pr point to this node,
5. Go to step 3.
SUSHIL KULKARNI 20
ii. Not exist in P:

4. Insert this new page to P,
5. Create a new node and insert this new node as a new child for pr,
6. Let pr point to this new node,
7. Go to step 3.
The tree structure behaviors for the above session can be recovered with our strategy
as the following figure:
Tree structure behaviours can help to mine those access patterns with tree structure
and also help to mine most forward sequential patterns or deepest access path.
Trees are used to store strings for pattern matching application. Each characteristic in
the string is stored on the edge to the node. Common prefixes of strings are shared. A
problem in using tree for many long strings is the space required. This is shown in the
following figure, which shows a standard tree for three strings SAD, DIAL and DIALOG.
S A D
D I A L O G
The degree of each node is one and required more space. The extra node “S” denotes
the termination of the string DIAL. The above tree can be drawn as follows:
DIAL OG
SAD S
SUSHIL KULKARNI 21
8. Finding Unusual Itemsets
Consider the problem to find sets of words that appear together “unusually often" on
the Web, e.g., “New" and “York" or {“Dutchess", “of", “York"}.
“Unusually often" can be defined in various ways, in order to capture the idea that the
number of Web documents containing the set of words is much greater than what one
would expect if words were sprinkled at random, each word with its own probability of
occurrence in a document.
One appropriate way is entropy per word in the set. Formally, the interest of a set of
words S is
æ ö
ç ÷
ç P( S ) ÷
log
ç P P( w ) ÷
ç wÎ S ÷
è ø
S
Note that we divide by the size of S because there are so many sets of a given size that
some, by chance alone, will appear to be correlated.
For example, if words a, b, and c each appear in 1% of all documents, and S = {a; b; c}
appears in 0.1% of documents, then the interestingness of S is
æ 0.001 ö
log 2 ç ÷
è 0.01 ´ 0.01 ´ 0.01 ø = log æ 1000 ö = 3.3
2ç ÷
3 è 3 ø
This shows that a set S with a high value of interests, yet some, or even all, of its
immediate proper subsets are not interesting. In contrast, if S has high support, then all
of its subsets have support at least as high. This means that if more than 10 8 different
words appearing in the Web, it is not possible even to consider all pairs of words.
8.1 The DICE Engine
DICE (dynamic itemset counting engine) repeatedly visits the pages of the Web, in a
round-robin fashion. At all times, it is counting occurrences of certain sets of words, and
of the individual words in that set. The number of sets being counted is small enough
that the counts fit in main memory.
From time to time, say every 5000 pages, DICE reconsiders the sets that it is counting.
It throws away those sets that have the lowest interest, and replaces them with other
sets.
The choice of new sets is based on the heavy edge property, which is an experimentally
SUSHIL KULKARNI 22
justified observation that those words that appear in a high-interest set are more likely
than others to appear in other high-interest sets. Thus, when selecting new sets to start
counting, DICE is biased in favor of words that already appear in high-interest sets.
However, it does not rely on those words exclusively, or else it could never find high
interests sets composed of the many words it has never looked at. Some (but not all) of
the constructions that DICE uses to create new sets are:
1. Two random words. This is the only rule that is independent of the heavy edge
assumption, and helps new words get into the pool.
2. A word in one of the interesting sets and one random word.
3. Two words from two different interesting pairs.
4. The union of two interesting sets whose intersection is of size 2 or more.
5. { a; b; c} if all of {a; b}, {a; c}, and {b; c} are found to be interesting.
Of course, there are generally too many options to do all of the above in all possible
ways, so a random selection among options, giving some choices to each of the rules, is
used.
8.2 Books and Authors
The general idea is to search the Web for facts of a given type, typically what might
form the tuples of a relation such as Books (title; author). The computation is suggested
by the following figure
Sample data
Current pattern
Find Find
Pattern data
Current data
1. Start with a sample of the tuples one would like to find. Consider five examples of
book titles and their authors were used.
2. Given a set of known examples, find where that data appears on the Web. If a
pattern is found that identifies several examples of known tuples, and is sufficiently
specific that it is unlikely to identify too much, then accept this pattern.
SUSHIL KULKARNI 23
3. Given a set of accepted patterns, find the data that appears in these patterns, add it
to the set of known data.
4. Repeat steps (2) and (3) several times. In the example cited, four rounds were used,
leading to 15,000 tuples; about 95% were true title-author pairs.
But the question is what is a pattern? Let us answer it now.
8.3 What is a pattern?
The notion suggested consists of five elements:
1.The order: i.e., whether the title appears prior to the author in the text, or vice-
versa. In a more general case, where tuples have more than 2 components, the order
would be the permutation of components.
2. The URL prefix.
3. The prefixes of text, just prior to the first of the title or author.
4. The middle: text appearing between the two data elements.
5. The suffix of text following the second of the two data elements. Both the prefix and
suffix were limited to 10 characters.
For example, A possible pattern might consist of the following:
1. Order: title then author.

2. URL prefix: www. University_Mumbai.edu/class/
3. Prefix, middle, and suffix of the following form:
<LI>title by author
Here the prefix is <LI>, the middle is by (including the blank after “by"), and
the suffix is . The title is whatever appears between the prefix and middle; the
author is whatever appears between the middle and suffix.
To focus on patterns that are likely to be accurate, one can used several constraints on
patterns, as follows:
o Let the specificity of a pattern be the product of the lengths of the prefix, middle,
suffix, and URL prefix. Roughly, the specificity measures how likely we are to find
the pattern; the higher the specificity, the fewer occurrences we expect.
o Then a pattern must meet two conditions to be accepted:
1. There must be at least 2 known data items that appear in this pattern.
SUSHIL KULKARNI 24
2. The product of the specificity of the pattern and the number of occurrences of
data items in the pattern must exceed a certain threshold T (not specified).
8.4 Data Occurrences
An occurrence of a tuple is associated with a pattern in which it occurs; i.e., the same
title and author might appear in several different patterns. Thus, a data occurrence
consists of:
1. The particular title and author.

2. The complete URL, not just the prefix as for a pattern.
3. The order, prefix, middle, and suffix of the pattern in which the title and author
occurred.
8.5 Finding Data Occurrences Given Data
If we have some known title-author pairs, our list step in finding new patterns is to
search the Web to see where these titles and authors occur. We assume that there is an
index of the Web, so given a word, we can find (pointers to) all the pages containing
that word. The method used is essentially a-priori:
1. Find (pointers to) all those pages containing any known author. Since author names
generally consist of 2 words, use the index for each first name and last name, and
check that the occurrences are consecutive in the document.
2. Find (pointers to) all those pages containing any known title. Start by finding pages
with each word of a title, and then checking that the words appear in order on the
page.
3. Intersect the sets of pages that have an author and a title on them. Only these pages
need to be searched to find the patterns in which a known title-author pair is found.
For the prefix and suffix, take the 10 surrounding characters, or fewer if there are not
as many as 10.
8.6 Building Patterns from Data Occurrences
1. Group the data occurrences according to their order and middle. For example, one
group in the “group-by" might correspond to the order “title-then-author" and the
middle “” by .
2. For each group, find the longest common prefix, suffix, and URL prefix.
3. If the specificity test for this pattern is met, then accept the pattern.
4. If the specificity test is not met, then try to split the group into two by extending the
length of the URL prefix by one character, and repeat from step 2. If it is impossible
to split the group (because there is only one URL) then we fail to produce a pattern
from the group.
SUSHIL KULKARNI 25
Consider the example, where the our group contains the three URL's:
www.University_Mumbai.edu/class/cs345/index.html
www.University_Mumbai.edu/class/cs145/intro.html
www.University_Mumbai.edu/class/cs140/readings.html
Where cs345,cs 145 and cs140 are the code for the three subjects say advanced
databases, Java and UML.
The common prefix is www.University_Mumbai.edu/class/cs . If we have to split the

group, then the next character, 3 versus 1, breaks the group into two, with those data
occurrences in the first page (there could be many such occurrences) going into one
group, and those occurrences on the other two pages going into another.
8.7 Finding Occurrences Given Patterns
1. Find all URL's that match the URL prefix in at least one pattern.
2. For each of those pages, scan the text using a regular expression built from the
pattern's prefix, middle, and suffix.
3. Extract from each match the title and author, according the order specified in the
pattern.
ggeeiibbvv

Web Minning

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Web Minning

Hochgeladen von

Copyright:

Verfügbare Formate

WEB MINING

Demographics are tangible attributes such as home address, income, purchasing

Psychographics are personality types that might be revealed in a psychological survey,

Technographics are attributes of the visitor's system, such as operating system,

Web mining can be broadly defined as:

In this chapter, we provide an overview of tools, techniques, and problems associated

o Referrer logs, user registration or profile information

o Resolving difficulties in the identification of users due to missing unique key

2. A Taxonomy of Web Mining

Web Content Web Structure Web Usage

Web Page Search Result General Access Customized

2.1 Web Content Mining

the information from the web pages

2.1.1 Web Crawler

(b) Variety: The Web pages are not always documents.

(d) Domain Name Resolution: A symbolic address is mapped to an IP address and

Crawler are of different types and are discussed below:

c. Incremental Crawling: An incremental crawler is one which updates an existing set

d. Focused Crawling: A general-purpose web crawler normally tries to gather as many

Searching hypertext documents is based on depth-first search algorithm. This algorithm

[A] Agent Based Crawling

The InfoSpiders system (formerly ARACHNID) is based on ecology of agents, which

[B] Database Approach

(i) Multilevel Databases

(ii) Web Query Systems

W3QL: combines structure queries, based on the organization of hypertext documents,

TSIMMIS: extracts data from heterogeneous and semi-structured information sources

1. COVERS: One concept covers another.

2. COVERED BY: This is reverse of above.

Following is the example of WebML:

2.1.2 Harvest System

2.1.3 Virtual Web View

This approach is based on the database discussed earlier.

A less obvious means of Web personalization is collaborative-filtering software that

Collaborative-filtering software compares the information it gains about one user's

These are "rules-based personalization systems" If you have historical information,

Other personalization systems, such as Andromedia LikeMinds, emphasize automatic real

3. Web Structure mining

Following are two techniques used for structure mining.

3.1 Page Rank

PageRank is one of the methods Google uses to determine a page’s relevance or

Toolbar PageRank Real PageRank

Following are some of the terms used:

Thus the question is “What is PageRank?”. So let’s answer it.

The another definition given by Google is as follows:

The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Let us break down PageRank or PR(A) into the following sections:

3.1.1 Important Pages

Create a stochastic matrix of the Web; that is:

1. Each page i corresponds to row i and column i of the matrix.

The goal behind this matrix is:

Let us consider few examples:

n=1 1 5/4 9/8 5/4

Following are the problems that are faced by on the Web

Let us consider the following example,

Example 2: Suppose Microsoft tries to duck charges that it is a monopoly by removing

The matrix describing transitions is:

The first four steps of the iterative solution are:

n=1 1 3/4 5/8 1/2

The matrix describing transitions is:

and the equation to solve is:

n=1 1 3/4 5/8 1/2