Web Data Refining Using Feedback Mechanism and K-Mean Clustering

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, May 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 81
Web Data Refining Using Feedback

Mechanism and k-mean Clustering
Prof. D. Jatin Das, S. Arun Kumar, B. Ramakantha Reddy, S. Shiva Prakash
Abstract-Now a day’s more web sites are developed by everyone. Among them user cannot get accurate data that user required by searching on
web. In basically web mining can be done by some page ranking algorithms are many more. In this paper , user going to refine the web pages by
giving feed back or any rating by manually or by automatically. K-mean clustering algorithm is basic algorithm used day to day life. We have
proposed genetic algorithm to improve cluster quality and also accurate clusters. By also apply the weblogs to our paper to more refine. Web
mining using feedback is eliminating the unwanted sites in web and also it help for improving the user data in developing sites.
Keywords - Web Mining, Clustering, k-mean, Web-logs.
——————————  ——————————
1. INTRODUCTION
There are roughly three knowledge
T he explosive growth of information sources discovery domains that pertain to web mining:
available on the World Wide Web, it has Web Content Mining, Web Structure Mining, and
become increasingly necessary for users to utilize Web Usage Mining. Web content mining is the
automated tools in find the desired information process of extracting knowledge from the content
resources, and to track and analyze their usage of documents or their descriptions. Web document
patterns. These factors give rise to the necessity of text mining, resource discovery based on concepts
creating server side and client side intelligent indexing or agent based technology may also fall
systems that can effectively mine for knowledge. in this category. Web structure mining is the
Web mining can be broadly defined as the process of inferring knowledge from the
discovery and analysis of useful information from Worldwide Web organization and links between
the World Wide Web. This describes the automatic references and referents in the Web. Finally, web
search of information resources available online, usage mining, also known as Web Log Mining, is
i.e. Web content mining, and the discovery of user the process of extracting interesting patterns in
access patterns from Web servers, i.e., Web usage web access logs [13, 14].
mining. We can broadly categorize Web data
clustering into (i) users’ sessions‐based and (ii)
link‐based. The former uses the Web log data and
———————————————— tries to group together a set of users’ navigation
sessions having similar characteristics. In this
 Prof. D. Jatin Das B.E, M.Sc(Tech.CS) Department of framework, Web‐log data provide information
Computer Science and Engineering Sri Vidyanikethan about activities performed by a user from the
Engineering College, Tirupati, Andhra Pradesh, India -517102
moment the user enters a Web site to the moment
 . S. Arun Kumar, M.Tech., Department of Computer Science and the same user leaves it [8]. The records of users’
Engineering,Sri Vidyanikethan Engieering College, Tirupati,
Andhra Pradesh, India.
actions within a Web site are stored in a log file.
Each record in the log file contains the client’s IP
 B Ramakantha Reddy, M.Tech., Department of Computer address, the date and time the request is received,
Science and Engineering,Sri Vidyanikethan Engineering College,
Tirupati,Andhra Pradesh, India. the requested object and some additional
information ‐such as protocol of request, size of the
 S. Shiva Prakash, M.Tech., Department of Computer Science
and Engineering,Sri Vidyanikethan Engineering College, object etc. Figure 1 presents a sample of a Web
Tirupati,Andhra Pradesh, India. access log file from a Web server.
Fig 1: Web Mining Architecture

between the data points and the center (mean) of
the clusters.
141.243.1.172 [29:23:53:25] "GET
/Sofware.html HTTP/1.0" 200 1497 To apply the k‐means algorithm:
query2.lycos.cs.cmu.edu [29:23:53:36]
"GET /Consumer.html HTTP/1.0" 200 1325 • Choose k data points to initialize the
tanuki.twics.com [29:23:53:53] "GET clusters
/News.html HTTP/1.0" 200 1014 • For each data point, find the nearest
cluster center that is closest and
wpbfl2‐45.gate.net [29:23:54:15] "GET /
Assign that data point to the
HTTP/1.0" 200 4889 corresponding cluster
• Update the cluster centers in each
Figure 2: A sample of Web Server Log File
cluster using the mean of the data points which are
assigned to that cluster
The standard K‐Means algorithm was
• Repeat steps 2 and 3 until there are not
used to cluster user’s traversal paths in [9].
more changes in the values of the Means.
However, it is not clear how the similarity measure

was devised and whether the clusters are
In spite of its simplicity, the k‐means
meaningful. In [12], associations and sequential
algorithm involves a very large number of nearest
patterns between web transactions are discovered
neighbor queries. The high time complexity of the
based on Apriori algorithm [1]. A good survey on
k‐means algorithm makes it impractical for use in
clustering algorithms can be found in [16]. The k‐
the case of having a large number of points in the
means algorithm [3] is one of the most widely used
data set. Reducing the large number of nearest
clustering algorithms. The algorithm partitions the
neighbor queries in the algorithm can accelerate it.
data points (objects) into k groups (clusters), so as
In addition, the number of distance calculations
to minimize the sum of the squared) distances
increases exponentially with the increase of the time, (iii) Request method (“GET”, “POST”, …,
dimensionality of the data [2,7,4]. etc), (iv) URL of the page accessed, (v) Protocol
(typically HTTP/1.0), (vi) Number of bytes.
Many algorithms have been proposed to This field can automatically fill up by
accelerate the k‐means. In [6,5], the use of kd‐trees system programming algorithms
[2] is suggested to accelerate the k‐means.
However, backtracking is required, a case in which 2.2 Modified access logs
the computation complexity is increased [10]. K d‐
trees are not efficient for higher dimensions. The modified web server logs are consists
Furthermore, it is not guaranteed that an exact of these records :(i) User’s IP address, (ii) Access
match of the nearest neighbor can be found unless time, (iii) Request method (“GET”, “POST”, …,
some extra search is done as discussed in [4]. The etc), (iv) URL of the page accessed, (v) Protocol
use of triangle inequality to accelerate the k‐means. (typically HTTP/1.0), (vi) Number of bytes (vii)
In [10], it is suggested to use R‐Trees. Nevertheless, rating or feedback.
R‐Trees may not be appropriate for higher
dimensional problems. In [8,9,11], the Partial The last field is for rating to that site this
Distance (PD) algorithm has been proposed. The site can be useful for user requirements are not
algorithm allows early termination of the distance .this make help full for refinement of web data
calculation by introducing a premature exit
condition in the search process. Rating sites typically show a series of
images (or other content) in random fashion, or
As seen in the literature, the researchers chosen by computer algorithm, rather than
contributed only to accelerate the algorithm; there allowing users to choose. They then ask users for a
is no contribution in cluster refinement. In this rating or assessment, which is generally done
study, we propose a new algorithm to improve the quickly and without great deliberation. Users score
k‐means clustering in web usage data mining. The items on a scale of 1 to 10, yes or no. Others, such
proposed algorithm consists of two steps. In the as BabeVsBabe.com Typically, gives instant
first step, to avoid local minima, we presented a feedback in terms of the itemʹs running score, or
simple and efficient method to select initial the percentage of other users who agree with the
centroids based on mode value of the data vector. assessment. They sometimes offer aggregate
And the k‐means algorithm is applied to cluster statistics or ʺbestʺ and ʺworstʺ lists. Most allow
the data vectors [12]. Then in the second step, users to submit their own image, sample, or other
Genetic Algorithm (GA) is applied to refine the relevant content for others to rate. Some require
cluster to improve the quality of the clusters of the submission as a condition of membership.
users’ sessions.
3. Standard K-means Algorithm
The paper is organized as follows: the
following section defines the web access logs. One of the most popular clustering
Section 3 presents the standard k‐means algorithm. techniques is the k‐means clustering algorithm.
Section 4 is proposed cluster refinement algorithm Starting from a random partitioning, the algorithm
with Genetic Algorithm (GA) to improve the users’ repeatedly (i) computes the current cluster centers
session clusters. Section 5 presents the experiments (i.e. the average vector of each cluster in data
and results and the work is concluded in section 6. space) and (ii) reassigns each data item to the
cluster whose centre is closest to it. It terminates
2. Web Access Logs: when no more reassignments take place. By this
2.1 Basic access logs means, the intra‐cluster variance, that is, the sum
of squares of the differences between data items
and their associated cluster centers is locally
In general the web server logs are consists
minimized. k ‐means’ strength is its runtime,
of these records :(i) User’s IP address, (ii) Access
which is linear in the number of data elements, and
its ease of implementation. However, the objective or fitness function of an optimization
algorithm tends to get stuck in suboptimal problem.
solutions (dependent on the initial partitioning and In this algorithm search space are
the data ordering) and it works well only for encoded in the form of strings (called
spherically shaped clusters. It requires the number chromosomes). The basic reason for our refinement
of clusters to be provided or to be determined is, in any clustering algorithm the obtained clusters
(semi‐) automatically. In our experiments, we run will never gives us 100% quality. There will be
k‐means using the correct cluster number. some errors known as misclustered. That is, a data
item can be wrongly clustered. These kinds of
1. Choose a number of clusters k errors can be avoided by using our refinement
2. Initialize cluster centers n1,… nk. algorithm. GA have applications in fields as
a. Could pick k data points and diverse as VLSI design, image processing, neural
set cluster centers to these networks, machine learning, job shop scheduling,
Points etc.
b. Or could randomly assign
points to clusters and take The cluster obtained from improved k‐
Means of clusters means clustering is considered as input to our
3. For each data point, compute the cluster refinement algorithm. Initially a random point is
center it is closest to (using some distance selected from each cluster; with this a chromosome
measure) and assign the data point to this is build. Like this an initial population with 10
cluster. chromosomes is build. For each chromosome the
4. Re‐compute cluster centers (mean of entropy is calculated as fitness value and the global
data points in cluster) minimum is extracted. With this initial population,
5. Stop when there are no new re‐ the genetic operators such as reproduction,
assignments. crossover and mutation are applied to produce a
new population. While applying crossover
4. Genetic Algorithm operator, the cluster points will get shuffled means
that a point can move from one cluster to another.
The initial cluster centers are normally From this new population, the local minimum
chosen either sequentially or randomly as given in fitness value is calculated and compared with
the standard algorithm. The quality of the final global minimum. If the local minimum is less than
clusters based on these initial seeds. It may leads to the global minimum then the global minimum is
local minimum; this is one of disadvantage in k‐ assigned with the local minimum, and the next
means clustering. To avoid this, in our method, we iteration is continued with the new population.
are selecting the modes of the data vector as initial Otherwise, the next iteration is continued with the
cluster centers. Based on the number of clusters, same old population. This process is repeated for
the modes are selected one after another. Initially N number of iterations.
the first mode value is selected as the center for the
first cluster and the next highest frequently From the following section, it is shown
occurred value is (next mode value) assigned as that our refinement algorithm improves the cluster
the center for next cluster. quality. The algorithm is given as:

Genetic algorithm (GA) is randomized 1. Choose a number of clusters k
search and optimization techniques guided by the 2. Initialize cluster centers n1,… nk based on mode
principles of evolution and natural genetics, 3. For each data point, compute the cluster center it
having a large amount of implicit parallelism. GA is closest to (using some distance measure) and
perform search in complex, large and multimodal assign the data point to this cluster.
landscapes, and provide near‐optimal solutions for 4. Re‐compute cluster centers (mean of data points
in cluster)
5. Stop when there are no new re‐assignments.
No. of Time Time

Server Location
Requests From To
00:00:00 June 1, 1995
Saskatchewan Canada 2,408,625 23:59:59 December 31, 1995
00:00:00 July 1, 1995
NASA Florida 3,461,612 23:59:59 Agust 31, 1995
October 24, 1994
Calgary Alberta, Canada 726,739 October 11, 1995

Table 1: Internet Traffic Archive (Web Usage Data)

The following table gives a brief description about
6. GA based refinement each web access log sets.
a. Construct the initial population (p1)
b. Calculate the global minimum (Gmin) All the above logs are taken with the
c. For i = 1 to N do timestamps have 1 second resolution. The logs
i. Perform reproduction fully preserve the originating host and HTTP
ii. Apply the crossover operator request. And these traces can be freely distributed.
between each parent. The logs are an ASCII file with one line per
iii. Perform mutation and get the request, with the following columns:
new population. (p2) 1. host making the request. A hostname or the
iv. Calculate the local minimum Internet address.
(Lmin). 2. timestamp in the format ʺDAY MON DD
v. If Gmin < Lmin then HH:MM:SS YYYYʺ.
a. Gmin = Lmin; 3. request given in quotes.
b. p1 = p2; 4. HTTP reply code.
d. Repeat 5. bytes in the reply.

5. Experiments Since various clustering algorithms result in
different clusters it is important to perform an
We have generated clusters using both the evaluation of the results to assess their quality. In
algorithms for several different logs obtained from clustering, the procedure of evaluating the results
the internet traffic archive (http://ita.ee.lbl.gov/). is known as cluster validation and can be based on
The following six different web access log data sets various measures called validity measures. The
used to test our proposed method, which are validity measures are divided in two categories
collected from various web servers. depending on whether they have any reference to
external knowledge. By external knowledge we
• EPA‐HTTP ‐ a day of HTTP logs from a busy refer to a pre‐specified structure which reflects our
WWW server. intuition about the clustering structure of a data
• SDSC‐HTTP ‐ a day of HTTP logs from a busy set. The measures that have no reference to
WWW server. external knowledge are called internal quality
• Calgary‐HTTP ‐ a year of HTTP logs from a CS measures and they are estimated in terms of
departmental WWW server. quantities that involve the data set. Dunn’s index
• ClarkNet‐HTTP ‐ two weeks of HTTP logs from [28] and DB index [29] are two internal quality
a busy Internet service provider WWW server. measures that have a close relationship in that they
• NASA‐HTTP ‐ two months of HTTP logs from a both try to minimize the within‐cluster scatter
busy WWW server. while maximizing the between‐cluster separation
• Saskatchewan‐HTTP ‐ seven months of HTTP in order to find compact and well separated
logs from a University WWW server. clusters.
S.
Access Request No of
N IP address URL Protocol rating
time method bytes
O
Apr 08,
115.242.159.123
1 2002 GET http://www.yaledailynews.com HTTP 800K 3
08:46 PM
Apr 08,
125.242.149.122
2 2002 POST http://www.waterski.com HTTP 750K 1
08:43 PM
Apr 08,
234.222.111.152
3 2002 GET http://www.sony.com HTTP 925K 5
08:40 PM
Table 2: Web log

It is clear for the above definition that DB
5.1 The Dunn Index is the average similarity between each cluster and
The index is defined by the following its most similar one. It is desirable for the clusters
equation for a specific number of clusters to have the minimum possible similarity to each
  d (C , C )  other; therefore we seek clustering that minimizes
  
 min  min 
i j
Dn , c
i 1,..., nc j i 1,..., nc max diam(c )
 DB.
  k 1,...,nc k 

Each access to a Web page is recorded in the access
log of the Web server that hosts it. The entries of a
where d(ci, cj) is the dissimilarity function between
Web log file consist of fields that follow a
two clusters ci and cj defined as
predefined format. The fields of the common log
d (ci , c j )  min d ( x, y ) format are:
xci , yc j By apply the rating into log file format we
and diam(c) is the diameter of a cluster, which may will find out the worth of the site. Using this site
be considered as a measure of dispersion of the developer also put effort in developing.
clusters. The diameter of a cluster C can be defined Periodically doing the web mining on the web data
as follows: the low rated site kept separately it is also type of
diam(C )  min d ( x, y ) page ranking algorithm.
x , yC
It is clear that if the dataset contains compact and 6. Conclusion and Future work:
well‐separated clusters, the distance between the
clusters is expected to be large and the diameter of
Web usage mining applies data mining

techniques to discover usage patterns from the
the clusters is expected to be small. Thus, based on
Web data, In this paper we have Proposed a new
the Dunn’s index definition, we may conclude that
method for data logs by adding rating field it will
large values of the index indicate the presence of
helpful for web mining and also for users In the
compact and well‐separated clusters.
first step, the initial cluster centers are selected
5.2. DB Index based on statistical mode based calculation to
allow the iterative algorithm to converge to a
Given that K is the number of clusters, Ci “better” local minimum. And in the second step,
and Cj are the closest clusters according to average we have proposed a novel method to improve to
distance d and diam is the diameter of a cluster, cluster quality using Genetic Algorithm (GA)
the DB index is defined as follows: based refinement algorithm. The proposed thing is
to add the feedback field to log format.
1 K  diam(C i )  diam(C j ) 
DB   max   By this feedback we can separate the
K i 1
j i
 d (C i , C j )  unwanted sites for that we can develop the an
effective algorithm and also based on time user can
search the data in single site for long period of time [10] T. Kanungo, D.M. Mount, N. Netanyahu, C. Piatko, R.
by using any algorithms automatically generate Silverman, and A.Y. Wu, An efficient k‐means clustering
algorithm: Analysis and implementation. IEEE Trans.
rating for that blogs. Future work is to developing
Pattern Analysis and Machine Intelligence, 24 (7): 881‐892,
an efficient algorithm for this. 2002.
[11] Z. Michalewicz, “Genetic Algorithms, Data Structuresʺ
Evolution Programs, Springer, New York, 1992.
References [12] O. Nasraoui, H. Frigui, A. Joshi, and R. Krishnapuram,
[1] R. Agrawal and R. Srikant, “Fast algorithms for mining “Mining Web Access Logs Using Relational Competitive
association rules,” Proc. of the 20th VLDB Conference, pp. Fuzzy Clustering”, to be presented at the Eight International
487‐ 499, Santiago, Chile, 1994. Fuzzy Systems Association World Congress ‐ IFSA 99,
[2] I. V. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. Taipei, August 99.
White. Model‐based clustering and visualization of [13] S. Oyanagi, K. Kubota, A. Nakase, Application of
navigation patterns on a Web site. Data Mining and matrix clustering to web log analysis and access prediction,
Knowledge Discovery, 7(4):399‐424, 2003. in: WEBKDD2001—MiningWeb LogDataAcrossAll
[3] S. Chakrabarti. Mining the Web. Morgan Kaufmann, Customers Touch Points, Third InternationalWorkshop,
2003. 2001.
[4] Z. Chen, A.Wai‐Chee Fu, and F. Chi‐Hung Tong. Optimal [14] C. Shahabe, A. M. Zarkesh, J. Abidi and V. Shah,
algorithms for finding user access sessions from very large “Knowledge discovery from user’s web‐page navigation,”
Web logs. World Wide Web: Internet and Information Proc. Seventh IEEE Intl. Workshop on Research Issues in
Systems, 6:259‐279, 2003. Data Engineering (RIDE), 20‐29, 1997.
[5] D. Cheng, B. Gersho, Y. Ramamurthi, and Y. Shoham, WEBKDD 2001—Mining Web Log Data Across All
Fast Search Algorithms for Vector Quantization and Pattern Customers Touch Points, Third International Workshop, San
Recognition. Proceeding of the IEEE International Francisco, CA, USA, August 26, 2001. Revised papers, vol.
Conference on Acoustics, Speech and Signal Processing, 1:1‐ 2356 of Lecture Notes in Comp Sc, Springer, 113–144, 2002.
9, 1984. [15] J. Srivastava, R. Cooley, M. Deshpande, and P. Tan,
[6] N. Eiron and K. S. McCurley. Untangling compound Web Usage Mining: Discovery and Applications of Usage
documents on theWeb. In Proceedings of ACM Hypertext,, Patterns from Web Data, in SIGKDD Explorations, 1(2):1‐12,
pages 85‐94, 2003. 2000.
[7] J.L.R. Filho, P.C. Treleaven, C. Alippi, Genetic algorithm [16] Xu R., and Wunsch D., Survey of clustering algorithms.
programming environments, IEEE Comput. 27:28‐43,1994. IEEE Trans. Neural Networks, 16 (3): 645‐678, 2005.
[8] Y. Fu, K. Sandhu, and M‐Y Shih. Clustering of Web users

based on access patterns. In Proceedings of WEBKDD, 1999.
[9] B. Hay, K Vanhoof, and G. Wetsr Clustering navigation
patterns on a Website using a sequence alignment method.
In Proceedings of 17th International Joint Conference on
Artificial Intelligence, Seattle,Washington, USA, August,
2001. Refinement of Web usage Data Clustering from K‐
means with Genetic Algorithm 489

Web Data Refining Using Feedback Mechanism and K-Mean Clustering

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Web Data Refining Using Feedback Mechanism and K-Mean Clustering

Hochgeladen von

Copyright:

Verfügbare Formate

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, May 2011, ISSN 2151-9617

Web Data Refining Using Feedback

Keywords - Web Mining, Clustering, k-mean, Web-logs.

Fig 1: Web Mining Architecture

No. of Time Time

Table 2: Web log

Das könnte Ihnen auch gefallen