Sie sind auf Seite 1von 3

International Journal of Computer Trends and Technology (IJCTT) volume 11 number 3 May 2014

ISSN: 2231-5381 http://www.ijcttjournal.org Page 128



Survey on Clustering Algorithm for Document
Clustering
Priyanka khadse Harshal Chowhan

Student & Computer Science and Engineering Asst. Professor & Computer Science and Engineering
W.C.E.M Nagpur, India W.C.E.M Nagpur, India

Abstract in today era all document are in electronic format to
save space and easy access. So, the biggest task is to the retrieve
document from large database. The Document clustering is the
process of partitioning collected data into subgroups with
similarities. The purpose of document clustering is to provide
technique from which human interests in searching the
information and understanding. The goal is to provide an
approach for extracting unknown pattern from a lager database.
Clustering algorithm are basically used for document clustering,
in this way, the database is summarised and presented in unique
manner.

Keywords Document clustering, information extracting,
clustering algorithm.
I. INTRODUCTION
Document clustering is the process of extracting fast and
important information of the files. The goal of survey is to
provide different technique of clustering and review.
Clustering is the task of grouping a set of objects with
similarities. Clustering is known as unsupervised learning. As
every other problem of this kind, it deals with finding a
structure in a collection of unlabeled data. Document
clustering is the process of organising document in
meaningful cluster. In other word the one cluster shares same
information than another cluster.
The clustering is most common form of unsupervised
learning, this major difference between clustering and
classification. Unsupervised means while classifying the
objects it does not depend on the predefine classes and
training. Most application such as segmentation, image
processing, pattern recognition, marketing, economics etc.
Use clustering. Therefore clustering is very critical area for
research. Document clustering is done by using different
techniques and models, such as Kohonens Self Organizing
Maps (SOM) [4] and the k-means Algorithm[1]. Beebe and
Dietrich in [5] proposed a new process model for text string
searches that advocated the use of machine learning
techniques, clustering being one of them. Clustering
algorithms are typically used for exploratory data analysis,
where there is little or no prior knowledge about the data [2],
[3]. Document clustering has been investigated for use in a
number of different areas of text mining and information
retrieval. Initially, document clustering was investigated for
improving the precision or recall in information retrieval
systems [Rij79, Kow97] and as an efficient way of finding the
nearest neighbours of a document [BL85]. More recently,
clustering has been proposed for use in browsing a collection
of documents [CKPT92] or in organizing the results returned
by a search engine in response to a users query [ZEMK97].
Document clustering has also been used to automatically
generate hierarchical clusters of documents [KS97].
There are many algorithms for clustering in data mining.
Cluster in one of the primary tool in data mining for analysis.
Clustering algorithmis divided into two parts Hierarchical
algorithm and Partition algorithm. Hierarchical Algorithm
divide the dataset into two subset and Partition Algorithm
divide the dataset into number of subset at one step.
Hierarchical Algorithm is divided into two categorized
agglomerative and divisive clustering. An agglomerative
clustering starts with a one-point cluster and merges two or
more most appropriate clusters. A divisive clustering starts
with one cluster of all data points and splits into the most
appropriate clusters. The process continues until a stopping
criterion is achieved. The hierarchical clustering builds of tree
known as dendrogram, as shown in Fig (a).


Fig (a): example of dendrogram
II. RELATED WORKS
In this section, review of clustering techniques which are
used in this study.
The clustering algorithmis having following stages shown
in fig (b). Clustering algorithm is used for document
clustering. In this study, hierarchical algorithmis better than
K-means, but slower. Fuzzy-c means is work on fuzzy logic
and it takes less time to execute than hierarchical.
Collection of data includes all documents which are used
to cluster. This data in further process to the next step. This
data includes all type of file. This data is index for storing and
filter themto remove extra words.





Fig (b): stages of clustering
Collection
of data
Pre-processing of
data
Document
Clustering
Similarity Calculation Iteration & output
International Journal of Computer Trends and Technology (IJCTT) volume 11 number 3 May 2014
ISSN: 2231-5381 http://www.ijcttjournal.org Page 129

Pre-processing includes the process which takes plain text
as an input and filters it and provides output. In this method,
remove all the stopword and provide only synonym.
Document clustering is main part of this paper; it will be
discussed in details.
Similarity calculation includes two parts inter cluster and
intra cluster distance. This distance depends on similarity
between clusters.
Iteration & output includes time for execution and
iteration for cluster.
A. Hierarchical Algorithm
Hierarchical techniques produce a nested sequence of
partitions, with a single, all inclusive cluster at the top and
singleton clusters of individual points at the bottom. Each
intermediate level can be viewed as combining two clusters
fromthe next lower level the result of a hierarchical clustering
algorithm can be graphically displayed as tree, called a
dendogram. This tree graphically displays the merging process
and the intermediate clusters. For document clustering, this
dendogramprovides a hierarchical index. There are two basic
approaches to generating a hierarchical clustering:
Agglomerative: Start with the points as individual clusters
and, at each step, merge the most similar or closest pair of
clusters. This requires a definition of cluster similarity or
distance.
Divisive: Start with one, all-inclusive cluster and, at each
step, split a cluster until only singleton clusters of individual
points remain. In this case, we need to decide, at each step,
which cluster to split and how to performthe split.
Agglomerative Algorithm:
1. Compute the similarity between all pairs of clusters, i.e.,
calculate a similarity matrix whose ijth entry gives the
similarity between the ith and jth clusters.
2. Merge the most similar (closest) two clusters.
3. Update the similarity matrix to reflect the pair wise
similarity between the new cluster and the original clusters.
4. Repeat steps 2 and 3 until only a single cluster remains.
Agglomerative algorithms are according to the inter-cluster
similarity measure they use. The most popular of these are
single-link, complete-link and group average. In the single
link method, the distance between clusters is the minimum
distance between any pair of elements drawn from these
clusters, in the complete link it is the maximumdistance and
in the average link it is correspondingly an average distance.
1) Advantages of hierarchical algorithm:
i) Embedded flexibility regarding the level of
Granularity.
ii) Ease of handling any forms of similarity or
distance.
iii) Applicability to any attributes type.
2) Disadvantages of hierarchical algorithm:
i) Vagueness of termination criteria.
ii) Most hierarchal algorithmdoes not revisit once
constructed clusters with the purpose of
improvement.

B. Fuzzy-c means algorithm
Bezdek [5] introduction Fuzzy C-Means clustering
method in 1981, extend from Hard C-Means clustering
method. Fuzzy-C means clustering is used to sort-out the
complex and multi-dimensional data in dataset. In which the
members have partial related or having fuzzy relations. FCM
is unsupervised clustering that is applied to wide range of
problems connected with feature analysis. FCM is used in
document clustering, engineering, astronomy, image analysis.
The fuzzy theory is actually based on Ruspini Fuzzy
clustering theory was proposed in 1980s. This analysis based
on distance between various input data point. The distance
between data points and the centre is formed according to
each cluster.
Fuzzy-C Means Algorithm
1. Initialize U=[uij] matrix, U(0)
2. At k-step: calculate the centres vectors C(k)=[cj] with
U(k)
3. Update U(k) , U(k+1)
4. If || U(k+1) - U(k)||< then STOP; otherwise return to
step 2.
Where,
'n is the number of data points.
'cj' represents the jth cluster center.
'm' is the fuzziness index m [1, ].
'c' represents the number of cluster center.
'ij' represents the membership of ith data to jth cluster centre.
'||xi vj||' is the Euclidean distance between ith data and jth
cluster centre.
FCM iteration moves the cluster centres to the right
position within a dataset. In general the introduction of fuzzy
logic in K-Means clustering is the Fuzzy C-means.FCM is
based on fuzzy method which work on membership weights,
this is natural for producing a cluster.
1) Advantages of fuzzy-c means:
i) Gives best result for overlapped data set and
comparatively better than k-means algorithm.
ii) Unlike k-means where data point must
exclusively belong to one cluster centre here data point is
assigned membership to each cluster centre as a result of
which data point may belong to more than one cluster centre.
2) Disadvantages of fuzzy-c means:
i) Apriori specification of the number of clusters.
ii) With lower value we get the better result but at
the expense of more number of iteration. Euclidean
distance measures can unequally weight underlying factors.

III. CONCLUSIONS AND FUTURE WORK
In this survey, various clustering approaches and
algorithms in document clustering in presented. There are
many issues in this area of document clustering. The paper
gives summary and interested readers a broad overview of the
existing techniques. As a future work, working on improvising
existing systems with better results. This results can be used
for searching results etc.
International Journal of Computer Trends and Technology (IJCTT) volume 11 number 3 May 2014
ISSN: 2231-5381 http://www.ijcttjournal.org Page 130

REFERENCES
[1] [1] A. L. N. Fred and A. K. J ain, Combining multiple clusterings
using evidence accumulation, IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 27, no. 6, pp. 835850, J un. 2005.
[2] [2] TC Havens, J C Bezdek, C Leckie, LO Hall Fuzzy c-means for
very large data ieeexplore.ieee.org - 2012
[3] [3] A. K. J ain and R. C. Dubes, Algorithms for Clustering Data.
Englewood Cliffs, NJ : Prentice-Hall, 1988.
[4] [4] A. Strehl and J. Ghosh, Cluster ensembles: A knowledge reuse
framework for combining multiple partitions, J . Mach. Learning Res.,
vol.3, pp. 583617, 2002.
[5] [5] X. Rui, D. Wunsch II, Survey of Clustering Algorithms, IEEE
Transactions on Neural Networks, vol.16, no.3, 2005.
[6] T. Kanungo and D. M. Mount, An Efficient K-means Clustering
Algorithm: Analysis and Implementation, Pattern Analysis and
Machine Intelligence, IEEE Transactions on Pattern Analysis and
Machine Intelligence. vol. 24, no. 7, 2002.
[7] V. S. Rao and Dr. S. Vidyavathi, Comparative Investigations and
Performance Analysis of FCM and MFPCM Algorithms on Iris data,
Indian J ournal of Computer Scienceand Engineering, vol.1, no.2, 2010
pp. 145-151 .

Das könnte Ihnen auch gefallen