Sie sind auf Seite 1von 6

International Journal of Advanced Engineering Research and Technology (IJAERT) 317

Volume 4 Issue 10, October 2016, ISSN No.: 2348 8190

Log Based Web Pages Recommendation using User Clustering


A.R.Sankaliya
Government Polytechnic for Girls, Surat

Abstract with these applications increases, the need for


intelligent analysis of the web usage data will also
This study is aimed to recommend web pages based on continue to grow.
user browsing pattern through user clustering. Web
server log has very large amount of data including
some noisy and inconsistent data.To remove those data The Log File format of server is as shown below:
preprocessing, dimensionality reduction technique is
used. Reduced dataset can be applied to clustering 125.125.125.125 - uche [20/Jul/2008:12:30:45
algorithm to generate user clusters. These cluster
centroids are matched with new user log and +0700] "GET /index.html HTTP/1.1" 200 2345
recommend web pages to the new user for visit.
Every log entry conforming to the CLF contains these
Keywords- Data preprocessing technique, K-means fields:
clustering, Recommendation rule generation,
client IP address or hostname;
Evaluation measure.
user id ('-' if anonymous);
access time;
HTTP request method (GET, POST, HEAD);
I. INTRODUCTION
path of the resource on the Web server(identifying the
URL);
Web mining is an application of data mining which uses
protocol used for the transmission
data mining techniques to extract useful information
(HTTP/1.0,HTTP/1.1);
from web documents. Web mining is further divided
into three types Web Usage mining, Web Content status code returned by the server as response(200 for
Mining and Web Structure Mining. Web usage mining OK, 404 for not found, ...);
is a process of mining useful information from server number of bytes transmitted.
logs. When user use the internet and open different Clustering analysis is a widely used data mining
websites then browsing behavior of the user algorithm which is a process of partitioning a set of data
automatically save into log file. Web usage mining objects into a number of object clusters, where each
deals with these log files for extracting information data object shares the high similarity with the other
about users browsing behavior on internet. objects within the same cluster but is quite dissimilar to
User future request prediction is a technique of objects in other clusters [2].
web usage mining for predicting the next requests of The Web Usage mining includes the data from
users. The main use of prediction is for increasing the the web server access logs, proxy server logs, browser
user browsing speed efficiently, decreasing the user logs, user profiles, registration data, user sessions or
latency as well as possible, and reducing the loading of transactions, cookies, user queries, bookmark data,
web server. mouse clicks and scrolls, and any other data as the
Improvements in pre-processing of data, results of interactions. Web usage mining focuses on
demonstrating, and mining techniques, applied to the techniques that could predict user behavior [3] while the
web sources, have already lead to many effective user interacts with the web.
applications in competent web smarter analytics tools
and procedures for management of content. As the II. Preprocessing Technique
interaction between Users and Web resources
The data preparation process is often the most time
exponentially increases, the need for smart web usage
consuming and computationally intensive step in the
analysis tools will also continue to grow. As the
Web usage mining process. Generally, data
complexity of Web applications and users interaction

www.ijaert.org
International Journal of Advanced Engineering Research and Technology (IJAERT) 318
Volume 4 Issue 10, October 2016, ISSN No.: 2348 8190

preprocessing consists of data cleaning, user Table 1: Web log of imaginary website
identification, session identification and path
Completion [9]. But in our proposed work we were
used data cleaning and user identification.

Data Cleaning

The purpose of data cleaning is to eliminate irrelevant


items, and these kinds of techniques are of importance
for any type of web log analysis not only data mining.
According to the purposes of different mining
applications, irrelevant records in web access log will
be eliminated during data cleaning. Since the target of
Web Usage Mining is to get the users travel patterns,
following two kinds of records are unnecessary and
should be removed.

The records of graphics, videos and the format


information. The records have filename suffixes of GIF,
JPEG, CSS, and so on, which can found in the URI
field of the every record.
The records with the failed HTTP status code. By Navigation Page
examining the Status field of every record in the web
access log, the records with status codes over 299 or A Content Page
fewer than 200 are removed.

User Identification B C D

The different IP addresses distinguish different users; if


the IP addresses are same (in case of accessing by proxy
server), at that time the different browsers and operation E F G H I J
systems indicate different users for User identification.
In this step the unique users are distinguished, and as a
result, the different users are identified. This can be
K L M N O P
done in various ways like using IP addresses, cookies,
and direct authentication and so on.
Fig. 1: Site Topology
However, its difficult because of security and privacy
use the following heuristics to identify the user: The following heuristic is used if the agent field differs
for two web log entries, the requests are from two
Each IP address represents one user; different users [10].

For more logs, if the IP address is the same, but the Using above heuristic following is a user from above
agent log shows a change in browser software or log
operating system, an IP address represents a different
User 1: AB E K I OE L
user.
Here, table 1 shows the log file after data cleaning User 2: A C G M HN
process. From given table user identification can be
done: From Site structure user can be differentiate like
following:

www.ijaert.org
International Journal of Advanced Engineering Research and Technology (IJAERT) 319
Volume 4 Issue 10, October 2016, ISSN No.: 2348 8190

User 1: A B E K E L preprocessing. In case of accessing by proxy server, IP


address can be same at that time the different browsers
User 2: A C G M HN and operation systems indicate different users for User
User 3: I O identification.
Now from cleaned csv file we have generated user vs.
III. PROBLEM STATEMENT url matrix by using pivot table.
In the 4th stage k-means clustering is performed on user
K-means is a prototype-based, partitioned clustering vs. url matrix.
technique that attempts to find a user-specified number In the 5th stage distances between new user log and
of clusters (k), which are represented by their centroid. given cluster centroids were found. And with which
Choosing the proper initial centroid is the key step of cluster new user log has the minimum distance, we
the basic K-means procedure. Here we used K-means collect that particular cluster data into CSV file with
clustering. In this algorithm initial centroids are chosen new user.
randomly and based on similarity of log entries In the 6th stage we collect data in csv file we generate
clusters are generated.For the new user,its log entry is recommendation set of urls to new user.
matched with the existing cluster centroids to find
most similar cluster for new user. From matched PSEUDO CODE
cluster recommendations are generated. Here we define steps of our recommendation system:
1. Web data collected is pre-processed to put it into a
IV. PROPOSED FRAMEWORK format that is compatible with the analysis technique
to be used in the next phase.
Below figure shows the proposed framework for Cleaning data to remove inconsistencies, filtering
Automatic Web Personalization (AWP). out irrelevant information according to the goal of
analysis.
2. Identify distinct users.
3. Perform K-means clustering technique and take
centroid value.
4. Match the new users log with the generated K cluster
centroids.
5. Generate the recommendation rule based on
similarity.

The basic outline of the recommendation rule


generation algorithm is as follows:
After clustering we have Cluster set C= {c1, c2
ck} where ci is the cluster centroid.
o Input: user cluster set (C), active user (s)
Fig. 2: Architecture of Recommendation System o Output: recommended set (RecSet)
RecSet = ;
There are six main phases in the system. Compare each cluster cj C with active user s.
The 1st stage gives the details about the raw data are Get the most similarity of the cluster centroid
collected into csv file from Log files. for each item set I of the cluster centroid
In the 2nd stage data cleaning is performed on log file if I not appear in s then
which is first step of data preprocessing. RecSet {I}
It contains that data which has proper status code like
200 means site is ok. In this we had removed image end if
files like .jpg,.gif,.png,etc. end for
The 3rd stage is user identification. User identification is Recommendation set can also generated by time spent
performed on cleaned data and it is second step of data on particular url. The input file is filtered and cleaned

www.ijaert.org
International Journal of Advanced Engineering Research and Technology (IJAERT) 320
Volume 4 Issue 10, October 2016, ISSN No.: 2348 8190

before it is used for the algorithm. The web site function.


administrator will give the no. of centroid value while
performing K-means clustering and centroid value is Table 2: FMeasure Comparison of different
decided from how many users are identified in user distance functions for Dataset-1
identification process.
Dataset-1
V. EXPERIMENTAL SETUPAND ANALYSIS K-means clustering
F-measure Manhattan Euclidean
Here we have taken result of two different datasets. By distance distance
using two different distance functions we applied K- 0.6667 0.5726
means clustering algorithm and generated
recommendation set of urls for new user based on In the Table 2, shows the values of the F-Measure of
similarity match. Manhattan distance and Euclidean distance in K-means
Here we have calculated f-measure on bases of clustering for Dataset-1. Comparison between both
generated recommendation set and similarity of new distance function are represented by the graph.
user with recommendation set. Through these two we
find precision and recall from given formula and after
finding value of precision and recall we got f-measure
for both datasets.

Precision =
Documents correctly retrieved by the system
(TP)
All documents retrieved by the system (TP +
FP)

Recall =
Documents correctly retrieved by the system
(TP)
All documents relevant for the human (TP +
FN)
Fig. 3: F-measure comparison on Dataset-1
So for dataset 1, using manhattan distance function 4
urls are recommended for new user and 2 urls are match Figure 3 shows the result of F-Measure comparison
with the generated recommendation set. New user has between two distance function for Dataset-1. Our
only 2 urls and both are match with recommendation set proposed K-means clustering process with Manhattan
so recall value becomes 1. We got precision value 0.5 distance gives better F-Measure value as compare to
from given formula. Euclidean distance in Dataset-1.

And using euclidean distance function 4 urls are Now for dataset 2, using manhattan distance function
recommended for new user and 2 urls are match with 9 urls are recommended for new user and 6 urls are
the generated recommendation set. New user has 3 urls match with the generated recommendation set. New
but 2 urls are match with recommendation set so recall user has 8 urls but 6 urls are match with
value becomes 0.67. Precision value is 0.5. recommendation set so recall value becomes 0.75. We
got precision value 0.67.
From formula of f-measure,
By using euclidean distance function 11 urls are
recommended for new user and 7 urls are match with
the generated recommendation set. New user has 12 urls
but 7 urls are match with recommendation set so recall
We got values given in table 2 for both distance

www.ijaert.org
International Journal of Advanced Engineering Research and Technology (IJAERT) 321
Volume 4 Issue 10, October 2016, ISSN No.: 2348 8190

value becomes 0.58. We got precision value 0.63. method for Adaptive Web personalization which
consists of four steps: Preprocessing, Dimensionality
We got f-measure value given in table 3 for both reduction, Clustering and site recommendation. We
distance function. proposed the algorithm for K-means clustering and site
recommendation. In k-means clustering with the value
Table 3: FMeasure Comparison of different of cluster centroid we matched new user interest and
distance functions for Dataset-2 generated recommendation for new user.
Dataset-2 Dimensionality reduction techniques are not only useful
K-means clustering for lowering the size of the data, but also that they are
F-measure Manhattan Euclidean able to extract the underlying semantics of the data.
distance distance Through our proposed process we have got good
0.7077 0.6039 accuracy in generation of similar users clusters and
recommendation rules.
In the Table 3, shows the values of the F-Measure of
Manhattan distance and Euclidean distance in K-means The proposed algorithm has k-means clustering
clustering for Dataset-2. Comparison between both algorithm. In future work, instead of giving random
distance function are represented by the graph. number of clusters we can find maximum distance
between users and assign a cluster. Also we can apply
sequence pattern in generation of recommendation urls.
It means recommend those urls which are visited in a
sequence.

REFERENCES

[1] Geeta R. Bharamagoudar1, S. G. (Sep-Oct. 2012),


"Literature survey on Web Mining" in IOSR Journal
of Computer Engineering,ISSN: 2278-0661, ISBN:
2278-8727 Volume 5 , 31-36.

[2] Noor Kamal Kaur, U. K. (April 2014), "K-Medoid


Clustering Algorithm- A Review" in International
Journal of Computer Application and
Technology(IJCAT),ISSN: 2349-1841 Volume 1.
Fig. 4: F-measure comparison on Dataset-2
[3] Dilpreet Kaur, A. S. (August 2013), "User Future
Figure 4 shows the result of F-Measure comparison Request Prediction Using KFCM in Web Usage
between two distance function for Dataset-2. Our Mining" in International Journal of Advanced
proposed K-means clustering process with Manhattan Research in Computer and Communication
distance gives better F-Measure value as compare to Engineering Vol. 2, Issue 8, .
Euclidean distance in Dataset-2.
[4] Surbhi Anand, R. A. (June 2012), "An Efficient
Above results show that our Proposed K-means Algorithm for Data Cleaning of Log File using File
clustering algorithm with manhattan distance gives Extensions" in International Journal of Computer
better result for new user match with generated Applications (0975 888),Volume 48.
recommendation set of urls.
[5] Nirmala Huidrom, N. B. ( January 2013), "Clustering
VI. CONCLUSION AND FUTURE WORK Techniques for the Identification of Web User
Session" in International Journal of Scientific and
Web Personalization technique for web usage mining
Research Publications,ISSN 2250-3153 Volume 3,
purpose has been discussed since last many year to
Issue 1.
improve making web sites adaptive. We proposed the

www.ijaert.org
International Journal of Advanced Engineering Research and Technology (IJAERT) 322
Volume 4 Issue 10, October 2016, ISSN No.: 2348 8190

[6] K. A. Abdul Nazeer, M. P. (July, 2009), "Improving


the Accuracy and Efficiency of the k-means
Clustering Algorithm", Proceedings of the World
Congress on Engineering 2009 Vol I, London, U.K.

[7] FAHIM A.M., S. A. (2006), "An efficient enhanced k-


means clustering algorithm", Journal of Zhejiang
University SCIENCE A, ISSN 1009-3095 .

[8] Neelam Sain, S. T. (2012), "A Survey of Web Usage


Mining based on Fuzzy Clustering and HMM" in
International Journal of Computer Science and
Information Technologies, ISSN:0957-9646 Vol. 3 .

[9] R.M. Suresh, R. P. (August 2010), "An Overview of


Data Preprocessing in Data and Web Usage Mining",
UTC from IEEE Xplore .

[10] Zdravko Markov, Daniel T. Larose (2007),"Data


mining the Web" Wiley Publication.

[11] Taowei Wang, Y. R. (January 2009), "Research on


Personalized Recommendation Based on Web Usage
Mining Using Collaborative Filtering Technique",
WSEAS TRANSACTIONS on INFORMATION
SCIENCE and APPLICATIONS,Volume 6 .

www.ijaert.org

Das könnte Ihnen auch gefallen