Beruflich Dokumente
Kultur Dokumente
ABSTRACT
Clustering data is one of the techniques used for
extracting the structures in data, where it is useful to
get the significant structures in the data. The data can
be clustered with different clustering techniques at
most done with unsupervised process, so the
evaluation of clustering is required. K-means
algorithm is one of the most common clustering
techniques which has been evaluated against many
issues. One of these issues is This paper will take up
the sensitive of data that to be clustered while contains
noise features against the number of clusters based on
the entropy metric that would be used to evaluate the
sensitivity. Experiment work would be performed
through application k-means algorithm with iris data,
the specific procedure would be executed in order to
consolidate results.
Keywords: k-means, entropy, number of clusters,
clustering quality, iris data analysis
I. INTRODUCTION
In recent years, the high dimensionality of the modern
massive datasets has provided a considerable challenge
to k-means clustering approaches. First, the curse of
dimensionality can make algorithms for k-means
clustering very slow, and, second, the existence of
many irrelevant features may not allow the
identification of the relevant underlying structure in
the data [1].
Generally Feature subset selection can be viewed as
the process of identifying and removing as many
irrelevant and redundant features as well as possible.
This is because firstly irrelevant features that do not
contribute predictive accuracy, and secondly redundant
features that do not redound to getting a better
predictor for that they provide mostly information
which is already present in other feature [2].
Stability of a learning algorithm with respect to small
input perturbations is an important property, as it
implies that the derived models are robust with respect
to the presence of noisy features and/or data sample
fluctuations [3]. Traditionally, the feature subset
selection research has focused on searching for
relevant features [2]. The criteria is used to evaluate
"goodness" of a specific subset of features follows
either the wrapper model or the filter model.
According to the former, the clustering algorithm, C, is
II.
K_MEANS CLUSTERING
www.ijsret.org
202
International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882
Volume 5, Issue 4, April 2016
(1)
In the above equation, mi is the center of cluster Ci ,
while d(x,mi ) is the Euclidean distance between a
point x and mi . Thus, the criterion function E attempts
to minimize the distance of each point from the center
of the cluster to which the point belongs [10].
The K-means algorithm requires three user-specified
parameters: number of clusters K, cluster initialization,
and distance metric. Cluster initialize with one object
that is center for each cluster. A cluster is a set of
objects such that an object in a cluster is closer (more
similar) to the center of a cluster, than to the center
of any other cluster. The center of a cluster is often a
centroid, the average of all the points in the cluster.
Different initializations can lead to different final
clustering because K-means only converges to local
minima. One way to overcome the local minima is to
run the K-means algorithm, for a given K, with
multiple different initial partitions and choose the
partition with the smallest squared error [11].
III.
ENTROPY
(2)
clusters, and
IV.
no.
of
cluster
features
subsets used
two
cluster
three
cluster
four
cluster
Five
cluster
0.91
0.74
0.84
0.83
0.89
0.84
1.17
0.95
1.465
1.72
1.351
1.09
0.74
0.74
0.74
0.78
0.42
0.48
0.56
1.41
1.16
1.62
2.25
0.95
0.94
1.03
0.81
1.33
1.3
1.34
1.48
0.693
0.73
0.64
0.919
0.78
0.5
0.69
1.37
0.78
0.44
0.44
0.69
1.61
1.99
2.46
2.63
Sepal.Width
1.67
2.4
3.1
3.63
Petal.Length
0.78
0.5
0.68
0.67
1.06
0.41
0.44
0.68
Sepal.Length,
Sepal.Width,
Petal.Length,
Petal.Width
Sepal.Length,
Sepal.Width,
Petal.Length
Sepal.Length,
Sepal.Width,
Petal.Width
Sepal.Length,
Petal.Length,
Petal.Width
Sepal.Width,
Petal.Length,
Petal.Width
Sepal.Length,
Sepal.Width
Sepal.Length,
Petal.Length
Sepal.Length,
Petal.Width
Sepal.Width,
Petal.Width
Sepal.Width,
Petal.Length
Petal.Length,
Petal.Width
Sepal.Length
0.91
Petal.Width
Table(2) statistical metrics of entropy values
from different number of clusters
metric
max
min
average
Stdv.
1.67
0.693
1.0468
0.3146
2.4
0.41
0.975
0.5966
3.1
0.44
1.1373
0.7847
3.63
0.56
1.3186
0.8771
no. of
cluster
EXPERIMENT WORK
203
International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882
Volume 5, Issue 4, April 2016
B. EXPERIMENTAL PROCEDURE
The experiment in this paper was done through
specific procedure. However, K-means algorithm
would be executed for many runs, the runs were
performed according to different subset of features.
With each subset of features, k-means algorithm would
be executed for 4 time, each run with different number
of clusters (two, three, four, five) clusters. Then
calculating the entropy of the resulted clusters based
on the class label that available with data as shown in
Table(1). So the entropy value would be increased
with classes were increased in an individual cluster
that means impurity of this cluster and vice verse.
C. DISCUSSION
The disparity of the entropy results by executing with
different subsets of features indicate to the effect of
selected features in clustering. As a result the
significant results point to determine the noise features
that would result the increasing the entropy values. It
is obvious the two features: " Sepal.Length" and
"Sepal.Width" are noise features because the resulted
entropy values are high in comparing with the other
features " Petal.Length" and "Petal.Width".
In other side the results in table(1) reveal that no. of
clusters play role in giving the significant entropy
values between different runs as shown in table(2) that
indicate to statistical metrics for different runs
according to the number of clusters as shown in
Figure(1).
V.
CONCLUSION
REFERENCES
A. IRIS DATA
(a) is
two
clustersthe
(b)best
threeknown
clustersdatabase
(c) four to
clusters
(d) five
This
perhaps
be found
in
clusters Fisher's paper is a
the pattern recognition literature.
classic in the field and is referenced frequently to this
day. There are four variables(can be used as features)
and one variable as class label as following:
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
and Species as class variable. The data set contains 3
classes of 50 instances each, where each class refers to
a type of iris plant. One class is linearly separable from
www.ijsret.org
204
International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882
Volume 5, Issue 4, April 2016
www.ijsret.org
205