Sie sind auf Seite 1von 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/309461986

Clustering Techniques in Data Mining: A Comparison

Conference Paper · March 2015

CITATIONS READS
7 170

1 author:

Pradeep Kumar Singh


Amity University
80 PUBLICATIONS   202 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Software Quality Model and Assessment View project

All content following this page was uploaded by Pradeep Kumar Singh on 01 March 2019.

The user has requested enhancement of the downloaded file.


Clustering Techniques in Data Mining: A
Comparison
P.K.Singh
Garima Hina Gulati Amity University,
Amity University, Uttar Pradesh Amity University, Uttar Pradesh Noida, U.P., INDIA
Noida, U.P., INDIA Noida, U.P., INDIA Email Id:pksingh16@amity.edu
Email Id:gars.arora@gmail.com Email Id:hinagu821990@gmail.co

Abstract – Clusteringis a technique in which a given data set techniques. Section IV gives the comparison among the
is divided into groups called clusters in such a manner that existing clustering algorithms followed by conclusion in
the data points that are similar lie together in one cluster. Section V.
Clustering plays an important role in the field of data mining II. LITERATURE REVIEW
due to the large amount of data sets.This paper reviews the
In [12], Dalal et al. discussed Clustering techniques and divide
various clustering algorithms available for data mining and
them into three major categories: Partitional Clustering,
provides a comparative analysis of the various clustering
Density based Clustering and Hierarchical Clustering which
algorithms like DBSCAN, CLARA, CURE, CLARANS, K-
are further subdivided and shown in Fig. 2.1. According to [8]
Means etc.
Lu et al., When the data is large and continuous in nature then
the traditional approaches of mining are not applicable
Keywords –FuzzyC Means (FCM),Balanced Iterative
because the real time data is quickly changing and requires
Reducing and Clustering using Hierarchies (BIRCH),
fast response as well. Random access to data stream is very
Clustering using Representatives (CURE), Density based
expensive so, a single access to streaming data is provided.
Clustering (DBSCAN), Ordering Point to Identify Clustering
Also the storage needed is very large. Therefore, clustering
Structure (OPTICS), Distributed Density- Based Clustering
and mining techniques for stream data are required. In Fahad
(DDC)
et al., The effectiveness of the candidate clustering algorithms
I. INTRODUCTION is measured through a number of internal and external validity
metrics, stability, run time and scalability tests[7]. Big volume
Clustering is defined as the unsupervised classification of the of data or big data has its own deficiencies as it needs big
data items or the observations i.e. the data sets have not been storages and this volume makes operations such as analytical
classified into any group and so they do not have any class operations, process operations, retrieval operations, very
attribute associated with them. Clustering is widely used as one difficult and hugely time consuming. To overcome these
of the important steps in the exploratory data difficult problems big data is clustered in a compact format
analysis[12].Clustering algorithms are used to find the useful that is still a informative version of the entire data.
and unidentified classes of patterns. Clustering is used to divide DENCLUE, OptiGrid and BIRCH are suitable clustering
the data into groups of similar objects. The objects that are algorithms for dealing with large datasets, No clustering
dissimilar are placed in separate clusters. Depending upon the algorithm performs well for all the evaluation criteria, and
metric chosen, a data object may belong to a single cluster or it future work should be dedicated to accordingly address the
may belong to more than one cluster. For example, consider a drawbacks of each clustering algorithm for handling big data.
retail database that contains information about the items According to Douglass et al., [16]Document clustering is
purchased by the consumers. Clustering will group the usually not received well due to some constraints. Firstly the
consumers according to their buying patterns. When we group clustering technique is relatively slow when handling large
the objects into clusters then simplification is achieved at the document collection and moreover it is thought that clustering
cost of losing some information. The choice of selecting the technique donot help in improving retrieval. Douglass
clustering algorithm is a critical step. proposed a document browsing technique that uses some fast
Our focus in this paper is to provide a comparative analysis of clustering algorithms as primary operations. They have
various clustering algorithms used in the data mining based formulated an interactive browsing technique. This technique
upon the similarity criterion and the complexity. This paper is is known as Scatter/Gather technique. Necessity of this
organized as follows: Section II provides the literature review. technique is fast clustering technique. Its working can be
Section III covers the discussion on Clustering algorithm understood by considering an example of textbook. In a

978-9-3805-4416-8/15/$31.00 2015
c IEEE 410
textbook if we need to search for some direct term we directly classification data from the distributed nodes.If the traditional
go to index whereas if we are looking for answer to some clustering algorithms are applied in distributed databases then
general question we search the table of content. The proposed it requires the transfer of all the datasets to the central site but
system i.e. scatter/gather uses cluster-based navigation method it is impractical due to the presence of large dataset at each
in order to navigate among the documents. These techniques local site as well as due to the limited bandwidth and privacy
group one or more similar documents for future reference. concerns. Distributed Clustering is divided into two:
Initially system scatter process cluster documents in groups and a) Hard Clustering: In hard clusters (crisp datasets) each data
then gather process selects group to form sub-collections.This object can exist only in one cluster i.e. the clusters are disjoint.
technique exists on two facilities i.e. clustering and Example: PAM, K means Clustering.b) Soft Clustering: In
reclustering. For these we use some buckshot and fractionation soft clusters (fuzzy datasets) every data object belongs to
algorithm. According to [9] Deepika et al., Distributed every cluster. Example: Neural Networks and Fuzzy

clustering is defined as the process of extraction of Clustering

.
The classifications of clustering algorithms for distributed data Particle Swarm Optimization (PSO) based unsupervised
are shown in Fig. 2.2. Distributed clustering works on two clustering algorithm for clustering of students for continuously
types of architectures: arriving data. It is a population based stochastic optimization
a) Homogeneous datasets: each local site has same attributes technique, which has been modeled based on the swarm of
b) Heterogeneous datasets: each local site has different particles. Authors have described the PSO algorithm, this
attributes but linking between the clusters depends on a algorithm is used to cluster students based on 3 sample factors
common key attribute. – efficiency, accuracy and error count.
According to Zhao et al., [22]Fast and high quality clustering
algorithms are important for browsing large amount of III. CLUSTERING ALGORITHM TECHNIQUES
information. We can use hierarchal clustering approaches like
Clustering algorithms are classified[1] according to:
partitioned and agglomerative clustering. By experimental
evaluation it was seen that partitional algorithm is always better • The type of input
than agglomerative algorithms due to relative low • Clustering criterion defining the similarity between the
computational requirements and clustering performance. In objects
[17]authors have presented a new class of clustering algorithm • Concepts on which clustering analysis techniques are
which is named as constrained agglomerative algorithms which based (e.g. fuzzy theory, numerical data, categorical data
is combination of both partitional and agglomerative approach. etc).Several clustering algorithms along with their
In this paper a new clustering algorithm is proposed that is application areas are shown in Fig. 3.1.
robust to outliners and deal with clusters non-spherical in
shape.It is a hierarchal clustering algorithm that gives a middle A.Partitional Clustering
ground between centroid-based and all point methods.It deals In Partitional clustering, clusters are represented by the
with all short-comings of centroid based and all point method. prototype and we use the iterative controstrategy to optimise
Authors have provided a comparison of BIRCH algorithm with the clustering. The partitioning algorithms divide the data sets
CURE.To handle large databases this algorithm uses random into various subsets called partitions and each partition
partitioning and sampling. Clustering algorithms are used in represents a cluster. The clusters formed have the following
many applications like speech and character recognition, image characteristics:
segmentation, vector quantization, etc[21].
• Eachclustermusthaveatleastoneobject
The main drawback of the existing clustering algorithms is the
random selection of initial centroids and their ability to deal • Each object must be a part of exactly one clusters i.e. no
with continuous arrival of data. This paper presents a novel overlapping

2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom) 411
Partitioning techniques are divided into two types: • Complete- link closeness: It is the similarity among
• Medoid Algorithms: Each cluster contains the instances the most dissimilar instances each belonging to separate
that are closest to the gravity centre cluster. It is not suitable for convex shaped clusters, but is
• Centroid Algorithms:gravity centre of the instances is used less sensitive to noise and errors or outliers.
to represent each cluster. Example: K –Means clustering a.) Divisive Method: It works in top-down fashion in
technique, where the data set is partitioned into k subsets which data sets are split into small clusters until each
in such a manner that all points in a given subset are cluster consists of only one data instance.
closest to the same gravity centre. The effectiveness of k- Time ComplexityO(n2logn)where n is the number of patterns
means technique depends on the objective function that is Space Complexity: O(n2)
being used to calculate the distance between the instances.
K- Means technique has a requirement that all the data B. Density Based Clustering (DBSCAN)
must be available prior.
In DBSCAN, different distance metrics are used and number
Time Complexity:O(nkl)where n is the number of patterns, k
of clusters is determined automatically by the algorithm. Data
is the number of clusters and l is the number of iterations taken
objects are separated depending on the connectivity, boundary
to converge. Generally, the value of k and l is fixed prior to
or their region. In DBSCAN the data points either belong to
attain the linear time complexity.
any cluster or are classified as noise.
Space Complexity: O(k+n) and addition storage is required for
The data points are either the core points, border points or
storing the data.
noise points.
• Core points: Points that lie inside the cluster are
B. Hierarchical Clustering
called as core points. A point is considered to be
A tree of clusters is called dendogram is constructed depending inside the cluster if the number of data points in the
on the medium of proximity in Hierarchical technique. Each neighbourhood exceeds a certain threshold value.
cluster node contains other nodes called child nodes and the • Border points: Points that are not the core points
nodes from same parent are called sibling nodes. Hierarchical but they lie in the neighborhood of core points.
techniques have a property of quick termination. Example of • Noise points: Point that is neither a core point and
Hierarchical clustering include: CURE, BIRCH, nor a border point.
CHAMELEON etc. Hierarchical techniques are divided into Time Complexity:O(m* time taken to find points in the
two categories: neighbourhood)
• Agglomerative Method:It works in bottom- up fashion i.e. Space Complexity: O(m)
groups all data instances until they belong to same cluster.
Closest pair of clusters is merged together. Closeness is C. Neural Networks
defined as single- link, complete- link and average- link.
• Single- link closeness: It is defined as the similarity Neural networks are non- linear data modelling tools that
between the two instances, both of which are in separate simulate the working of brain. They are used to identify the
clusters. Clusters ofelliptical shapes are handled efficiently relationship between the patterns depending on the input and
but it is sensitive to noise and errors. output values.

Fig.1. Clustering algorithms and their applications

412 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom)
D. Fuzzy Clustering Grid based techniques are mostly used in spatial data
mining. They divide the space into finite number of cells
Fuzzy clustering is a reasoning based technique in which
and all the operations are then performed on the quantised
associations among the patterns and the clusters is done on
space.
the basis of membership functions. Fuzzy clustering
generates overlapping clusters.
IV.COMPARISON OF CLUSTERING ALGORITHMS
E. Grid Based Clustering
In this section we have analyzed various clustering algorithms
and represented in Table I.
TABLE I. ANALYSIS ON CLUSTERING ALGORITHMS

Authors Algorithm Clustering Dataset Advantages Disadvantages Complexity


Category
Dalal, Harale k- Means Clustering Partitioning Numerical data Large datasets are processed Sensitive to noise, O(n)
(2011) Method (crisp data set) easily, Simple to implement Depends on initial value of
and results are easy to k, poor locally optimal
interpret solutions
Dhillon, K- Means Parallel Partitioning crisp and Computation time is reduced Do not support
Modha Implementation Method homogeneous by a factor of p(no. of heterogeneous datasets
(2002) datasets processors)
Vidya, Privacy preserving K- Partitioning Crisp data sets Privacy and security Do not support
Clifton means clustering over Method heterogeneous datasets
(2003) vertical partitioning
data
Moore, Hall DCA Partitioning Both crisp and Random division and Complex due to the
(2004) Method fuzzy dataset distribution of data set to local collision in centroid
site mapping
Visalakshi, MDCA Partitioning Crisp and Useful in non uniform schema
Thangavel, Method heterogeneous
Parvathi dataset
(2008)
Coletta, PFCM Fuzzy Method Fuzzy and Overlapping clusters are Data is of similar type i.e. O(n)
Vendramin, homogeneous formed due to usage of homogeneous
Hruschka, dataset membership function, Number
Pedrycz of clusters are not to be
(2012) defined prior
Ghanem, IFDFC Fuzzy Method Fuzzy and Privacy and easy to use and Homogeneous datasets
Kechadi, Tari homogeneous implement
(2011) dataset
Xu, Jager, Parallel DBSCAN Density based Crisp and Excellent scalability and noise O(nlogn)
Kriegel method heterogeneous is differentiated efficiently
(1999) data sets
Januzaj, SDBDC Density based Crisp and Privacy Threshold identification is
Kriegel, method heterogeneous difficult
Pfeifle data sets
(2004)
Klusch, Lodi, KDEC Density based Crisp and Privacy Do not support
Moro method homogeneous heterogeneous datasets
(2003) data sets
Khac, DDC Density based Crisp and Used in arbitrary shaped
Aouad, method heterogeneous clusters, deals with noise and
Kechadi data sets errors
(2007)
Ghanem, OPTICS Density based Crisp and Communication overhead is
Kechadi, Tari method heterogeneous less
(2011) data sets
Yi-Hong Lu, Scalable K Means Clustering by At most one scan of the data is Compression schemes
Yan Huang Clustering point method required used cause a large
(2005) overhead
Yi-Hong Lu, Statistical Grid based Threshold Memory save, No nearest Assumption that initial
Yan Huang Clustering based neighbour identification clusters have data as a line,
(2005) problem in high dimensional does not work well when
data, capable of finding dimensionality of data is

2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom) 413
clusters of arbitrary shape large
Yi-Hong Lu, Regression Analysis Mathematical Does not depend on initial
Yan Huang Model values or criteria to combine
(2005) clusters
Kauffman PAM Partitional Numerical Expensive due to the O(k(n-k)2)
Method dataset comparison of object with
entire dataset
Kyuseok BIRCH Hierarchical Numerical Handles noise easily Order sensitive O(n)
Shim, based dataset
Rajeev
Rastogi
(1998)
Sudipto CURE Hierarchical Numerical Handles large databases and O(n2logn),O
Gupta, based dataset improves quality (n)
Rajeev
Rastogi
(1998)
Deepika STING Grid based Spatial data set Arbitrary shaped clusters, O(k), k is
Singh handles noise efficiently number of
(2013) grid cells at
lowest level
Particle Swarm Unsupervised Categorical data Clusters are formed on the Real time and continuous
optimization clustering (students data) basis of three parameters i.e. data collection is needed
quality, efficiency and
accuracy
Tzung Hong Evolutionary Genetic Real world speeds up classification time, Large no of attributes
Composite Attribute Algorithm datasets (fuzzy reduces cost and good
Clustering data) performance
Daniel pyGCluster Hierarchical Overall data size is reduced
Jaeger, based
Johannes
Barth
(2014)
V. CONCLUSION Transactions on Evolutionary Computation, Vol. 18, No. 1,
This paper discusses the various clustering algorithms like February 2014 ,pp. 20-35
partitional clustering, hierarchical clustering, density based [4] Q. Liu, W.Jin, S.Wu, Y. Zhou, Clustering Research using
clustering; Grid based clustering and their time and space dynamic modelling based on granular computing, IEEE,2005,
pp. 539 - 543
complexities. [5] M. Halkidi, I. Koutsopoules, Online Clustering of Distributed
Partitional clustering represents the clusters using a prototype Streaming Data using Belief Propagation Techniques, IEEE
generally. Partitional clustering algorithms are very useful 2009, pp. 216 - 225
when the clusters are of convex shape having similar size and [6] Barhate Sachin R., Shelake Vijay M., A Survey and Future
the number of clusters can be identified prior. Vision, Second International Conference on Advanced
Due to the disability in predicting the number of clusters in Computing and Communication Technologies, IEEE 2012, pp.
advance Hierarchical clustering algorithms are used. They 96 – 100
divide the dataset into several levels of partitioning called [7] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, A. Zomaya, I.
dendograms. These algorithms are very effective in mining but Khalil, S. Foufou, A. Bouras , A survey of Clustering
Algorithms for Big Data: Taxonomy and Empirical Analysis,
the cost of formation of dendograms is very high for large IEEE 2014,267 - 279
datasets. [8] Y. Hong Lu, Y. Huang, Mining Data Streams using Clustering,
Density based clustering techniques are very useful in mining Proceedings of the Fourth International Conference on
large datasets because they can easily identify noise and can Machiene Learning and Cybernetics, Guangzhou, 18-21
deal with clusters of arbitrary shape. August, 2005, IEEE 2005,pp. 2079 - 2083
[9] Deepika Singh, Anjana Gosain, A Comparative Analysis of
REFERENCES Distributed Clustering Algorithms: A survey, 2013
[1] M. Halkidi, Y. Batistakis, M. Vazirgiannis, Clustering International Symposium on Computational and Business
algorithms and validity measures, IEEE, 2001, pp.3-22 Intelligence, IEEE 2013, pp. 165 - 169
[2] Mukhopadhyay, S. Bandyopadhyay, Survey of Multiobjective [10] R. H. Sheikh, M. M. Raghuwanshi, A. N. Jaiswal, Genetic
Evolutionary Algorithms for Data Mining: Part I, IEEE Algorithm Based Clustering: A Survey, First International
Transactions on Evolutionary Computation, Vol. 18, No. 1, Conference on Emerging Trends in Engineering and
February 2014,pp.4-19 Technology, IEEE 2008, pp. 314 - 319
[3] Mukhopadhyay, S. Bandyopadhyay, Survey of Multiobjective
Evolutionary Algorithms for Data Mining: Part II, IEEE

414 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom)
[11] S. Dhillon, Y. Guan, J. Kogan, Iterative Clustering of High
Dimensional Text Data Augmented by Local Search, ICDM ,
IEEE 2002,pp. 131 - 138
[12] M. A. Dalal, N D Harale, A survey on Clustering in data
mining, International Conference and Workshop on Emerging
Trends in Technology, TCET, Mumbai, India, ACM 2011, pp.
559-562
[13] Lin, E. Keogh, W. Truppel, Clustering of Streaming Time
Series is Meaningless, ACM, 2003, pp. 56-65
[14] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H.
Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z. H. Zhou,
M. Steinbach, D. J. Hand, D. Steinberg, Top 10 algorithms in
Data Mining: A survey, Springer, 4 December 2007
[15] D. Jaeger, J. Barth, A. Niehues, C. Fufezan, pyGCluster: A
novel hierarchical clustering approach, Vol. 30 no. 6 2014, pp.
896–898
[16] D. R. Cutting, D. R. Karger, J. O. Pedersen, J. W. Tukey, Scatter/
Gather: A Cluster based approach to browsing large document
collections, ACM 1992, pp. 318-329
[17] S. Guha, R. Rastogi, K. Shim, CURE: An efficient Clustering
Algorithm for Large Databases, ACM 1998,pp. 73-84
[18] M. S. Chen, J. Han, P. S, Yu, Data Mining: An Overview from
a Database Perspective, IEEE Transactions on Knowledge and
Data Engineering, Vol. 8, No. 6, December 1996, pp. 866 – 883
[19] R. T. Ng, J. Han, CLARANS: A Method for Clustering objects
for Spatial Data Mining, IEEE Transactions on Knowledge and
Data Engineering, Vol. 14, No. 5, September/October 2002, pp.
1003 – 1016
[20] M. Ester, H. Kriegel, J. Sander, X. Xu, A Density- Based
Algorithm for discovering clusters in large spatial databases
with noise, KDD-96 Proceedings, pp. 226-231
[21] Pavel, Survey of clustering Data mining Techniques, Accure
software,Inc.
[22] Y. Zhao, G. Karypis, Evaluation of Hierarchical Clustering
Algorithms for Document Datasets, ACM 2002, pp. 515-524.

2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom) 415

View publication stats

Das könnte Ihnen auch gefallen