Sie sind auf Seite 1von 61

1

CHAPTER 1
INTRODUCTION

1.1 Introduction to Clustering
Clustering can be considered the most important unsupervised learning problem; so, as
every other problem of this kind, it deals with finding a structure in a collection of
unlabeled data. A loose definition of clustering could be the process of organizing
objects into groups whose members are similar in some way. A cluster is therefore a
collection of objects which are similar between them and are dissimilar to the
objects belonging to other clusters. Two or more objects belong to the same cluster if
they are close according to a given distance (in this case geometrical distance). This
is called distance-based clustering. Another kind of clustering is conceptual clustering:
two or more objects belong to the same cluster if this one defines a concept common
to all that objects. In other words, objects are grouped according to their fit to
descriptive concepts, not according to simple similarity measures. The goal of
clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to
decide what constitutes a good clustering? It can be shown that there is no absolute
best criterion which would be independent of the final aim of the clustering.
Consequently, it is the user which must supply this criterion, in such a way that the
result of the clustering will suit their needs. For instance, we could be interested in
finding representatives for homogeneous groups (data reduction), in finding natural
clusters and describe their unknown properties (natural data types), in finding
useful and suitable groupings (useful data classes) or in finding unusual data objects
(outlier detection) [1].

Possible Applications
Clustering algorithms can be applied in many fields, for instance:
Marketing: finding groups of customers with similar behavior given a large
database of customer data containing their properties and past buying
records;
Biology: classification of plants and animals given their features;
Libraries: book ordering;
2

Insurance: identifying groups of motor insurance policy holders with a
high average claim cost; identifying frauds;
City-planning: identifying groups of houses according to their house type,
value and geographical location;
Earthquake studies: clustering observed earthquake epicenters to identify
dangerous zones;
WWW: document classification; clustering weblog data to discover groups
of similar access patterns.

Requirements
The main requirements that a clustering algorithm should satisfy are:
scalability;
dealing with different types of attributes;
discovering clusters with arbitrary shape;
minimal requirements for domain knowledge to determine input parameters;
ability to deal with noise and outliers;
insensitivity to order of input records;
high dimensionality;
Interpretability and usability.

1.2 Types of Clustering
There exit a large number of clustering algorithms in the literature .The choice of
clustering algorithm depends both on the type of data available and on the particular
purpose and application. If cluster analysis is used as a descriptive or exploratory tool,
it is possible to try several algorithms on the same data to see what the data may
disclose. In general, major clustering methods can be classified into the following
categories [1].

1.2.1 Partitioning methods
Given a database of n objects or data tuples, a partition in method constructs k
partitions of the data, where each partition represents cluster and K<=n . That is, it
classifies the data into k groups, which together satisfy the following requirements:

3

Each group must contain at least on e object, and
Each object must belong to exactly one group-Notice that the second
requirement can be relaxed in some fuzzy partitioning technique.
Given K, the number of partitions to construct, a partitioning method creates an initial
partitioning. It then uses an iterative relocation technique that attempts to improve the
partitioning by moving objects from one group to another .The general criterion of a
good partitioning is that objects in the same clusters are "close" or related to each
other, whereas objects of different clusters are "far apart or very different. There are
various kinds of other criteria for judging the quality of partitions. To achieve global
optimality in partitioning-based clustering would require the exhaustive enumeration
of all of the possible partitions. Instead, most applications adopt one of two popular
heuristic methods;

1. The k-means algorithm, where each cluster is represented by the mean value of
the objects in the
2. Cluster, and the k-medoids algorithm, where each cluster is represented by one
of the objects located near the center of the cluster. These heuristic clustering
methods work well for finding spherical-shaped clusters in small to medium -
sized databases. To find clusters with complex shapes and for clustering very
large data sets, partitioning-based methods need to be extended. Partitioning-
based clustering methods are studied in depth later.

Given a database of objects and k, the number of clusters to form, a partitioning
algorithm organizes the objects into k partitions (k<=n), where each partition
represents a cluster. The clusters are formed to optimize an objective-partitioning
criterion, often called a similarity function, such as distance, so that the objects within
a cluster are "similar," whereas the objects of different clusters are "dissimilar" in
terms of the database attributes [2].

1.2.1.1 Classical Partitioning Methods: k-means and k-medoids
The most well known and commonly used partitioning methods are k-means, k-
medoids, and their variations.


4

Centroid-Based Technique: The K-Means method
The k-means algorithm takes the input parameter, k, and partitions a set of n objects
into k clusters so that the resulting intra-cluster similarity is high but the inter-cluster
similarity is low. Cluster similarity is measured in regard to the mean value of the
objects in a cluster, which can be viewed as the cluster's center of gravity. "How does
the k-means algorithm work?" The k-means algorithm proceeds as follows. First, it
randomly selects k of the objects, each of which initially represents a cluster mean or
center. For each of the remaining objects, an object is assigned to the cluster to which
it is the most similar, based on the distance between the object and the cluster mean. It
then computes the new mean for each cluster. This process iterates until the criterion
function converges. Typically, the squared-error criterion is used, defined as

E=sigma sigmap=ci square

Where E is the sum of square-error for all objects in the database ,p is the point in
space representing a given object, and mi, is the mean of cluster ci ( both p and mi, are
multidimensional).This criterion tries to make the resulting k clusters as compact and
as separate as possible. The algorithm attempts to determine K partitions that
minimize the squared-error function. It works when the clusters are compact clouds
that are rather well separated from one another. The method is relatively scalable and
efficient in processing large data sets because the computational complexity of the
algorithm is O (nkt), where n is the total number of objects, k is the number of
clusters, and t is the number of iterations. n Normally, k<<n and t<<n. The method
often terminates at a local optimum. The k-means method, however, can be applied
only when the mean of a cluster is denned-This may not be the case in some
applications, such as when data with categorical attributes are involved-the necessity
for users to specify k, the number of clusters, in advance can be seen as a
disadvantage. The k-means method is not suitable for discovering clusters with no
convex shapes or clusters of very different size. Moreover, it is sensitive to noise and
outlier data points since a small number of such data can substantially influence the
mean value [3].


5

1.2.2 Hierarchical methods
A hierarchical method creates a hierarchical decomposition of the given set of data
objects, a hierarchical method can be classified as being either agglomerative or
divisive, based on how the hierarchical decomposition is formed. The agglomerative
approach, also called the bottom-up approach ,starts with each object forming a
separate group, It successively merges the objects or groups close to one another, until
all of the groups are merged into one( the topmost level of the hierarchy), or until a
termination condition holds. The divisive approach, also called the top-down
approach, starts with all the objects in the same cluster, until eventually each object is
in one cluster, or until a termination condition holds, Hierarchical methods suffer from
the fact that once a step(merge o9r split) is done, it can never be undone. This rigidity
is useful in that it leads to smaller computation costs by not worrying about a
combinatorial number of different choices. However, a major problem of such
techniques is that they cannot correct erroneous decisions. There are two approaches
to improving the quality of hierarchical partitioning, such as in CURE and Chameleon,
or integrate hierarchical agglomeration and iterative relocation by first using a
hierarchical agglomerative algorithm and then refining the result using iterative
relocation by first using a hierarchical agglomerative algorithm and then refining the
result using iterative relocation , as in BIRCH.
A hierarchical clustering method works by grouping data objects into a tree clusters.
Hierarchical clustering methods can be further classified into agglomerative and
divisive hierarchical clustering, depending on whether the hierarchy decomposition is
formed in a bottom-up or top-down fashion. The quality of a pure hierarchical
clustering method suffers from its inability to perform adjustment once a merge or
split decision has been executed. Decent studies have emphasized the integration of
hierarchical agglomeration with iterative relocation methods [4].








6

1.2.2.1 Agglomerative and Divisive Hierarchical Clustering
In general, there are two types of hierarchical clustering methods:

Agglomerative hierarchical clustering:
This bottom-up strategy starts by placing each object in its own cluster and then
merges these atomic dusters into larger and larger clusters, until all of the objects are
in a single cluster or until certain termination conditions are satisfied .Most
hierarchical clustering methods belong to this category. They differ only in their
definition of inter cluster similarity.

Divisive hierarchical clustering:
This top-down strategy does the reverse of agglomerative hierarchical clustering by
starting with all objects in one cluster. It subdivides the cluster into smaller and
smaller pieces, until each object forms a cluster on its own or until it satisfies certain
termination conditions, such as a desired number of clusters is obtained or the distance
between the two closest clusters is above a certain threshold distance. Four widely
used measures for distance between clusters are as follows, where |p-p'| is the distance
between two objects or points p and p', m, is the mean for cluster C, and n, is the
number of objects of in Ci[5].
minimum distance :
maximum distance:
mean distance:
Average distance:

1.2.3 Density- Based Methods
Most partitioning methods cluster objects based on the distance between objects. Such
methods can find only spherical-shaped clusters and encounter difficulty at
discovering clusters of arbitrary shapes.DBSCAN's definition of a cluster is based on
the notion of density reach ability. Basically, a point q is directly density-reachable
from a point p if it is not farther away than a given distance (i.e., is part of its -
neighborhood), and if p is surrounded by sufficiently many points such that one may
consider p and q be part of a cluster. q is called density-reachable (note: this is
different from "directly density-reachable") from p if there is a sequence
of points with and p
1
= p and p
n
= q where each p
i + 1
is directly density-
7

reachable from p
i
. Note that the relation of density-reachable is not, so the notion of
density-connected is introduced: two points p and q are density-connected if there is a
point o such that both p and q are density reachable from 0.Other clustering methods
have been developed based on the notion of density. Their general idea is to continue
growing the given cluster as long as the density. Their general idea is to continue
growing the given cluster as long as the density in the "neighborhood" exceeds some
threshold; that is , for each data point within a given cluster, the neighborhood of a
given radius has to contain at least a minimum number of points .Such a method can
be used to filter out noise and discover clusters of arbitrary shape. DBSCAN is a
typical density-based method that grows clusters according to a density threshold;
OPTICS is a density-based method that computes an augmented clustering ordering
for automatic and interactive cluster analysis. Grid -based method: Grid -based
methods quantize the object space into a finite number of cells that form a grid
structure .All of the clustering operations is performed on the grid structure .he main
advantage of this approach is its fast processing time, which is typically independent
of the number of data objects and dependent only on the number of cells in each
dimension in the quantized space. STING is a typical example of a grid-based method.
CLIQUE and wave-cluster are two clustering algorithms that are both grid-based and
density-based. Model-based methods: Model-based methods hypothesize a model for
each of the clusters and find the best fit of the data to the given model. Model- based
methods hypothesize a model for each of the clusters and find the best fit of the data to
the given model .A model-based algorithm may locater clusters by constructing a
density function that reflects the spatial distribution of the data points. It also leads to a
way of automatically determining the number of clusters based on standard statistics,
taking "noise" or outliers into account and thus yielding robust clustering methods
.Model-based clustering methods are studied below. Some clustering algorithms
integrate the ideas of several clustering methods, so that it is sometimes difficult to
classify a given algorithm as uniquely belonging to only one clustering method
category. Furthermore, some applications may have clustering criteria that require the
integration of several clustering techniques. In the following sections, we examine
each of the above five clustering methods in detail. We also introduce algorithms that
integrate the ideas of several clustering methods. Outlier analysis, which typically
involves clustering, is described at the end of this section [6].
8

1.3 Thesis Statement
Methods are studied in which some clustering Density-Based (DBSCAN), Particle
swarm optimization (PSO), Hierarchical clustering, Hierarchical agglomerative
clustering (HAC), C-mean, and K-mean algorithms integrate the ideas of several
clustering methods, so that it is sometimes difficult to classify a given algorithm as
uniquely belonging to only one clustering method category. Furthermore, some
applications may have clustering criteria that require the integration of several
clustering techniques. In the following sections, we examine K mean clustering
algorithm with genetic algorithm over the Breast cancer dataset, Thyroid dataset and
E-coil dataset in detail.

1.4 Problem Statement
In the field of data mining there are many techniques k-mean, C-mean, Hierarchical,
DB Scan of clustering. The clustering are used in some important area like Pattern
recognition, Image analysis, Bioinformatics, Earthquake studies, Insurance. But in
clustering techniques are facing main three basic problems. First is the seed generation
problem , second is the generation of right number of cluster and third one is content
validation problem. In this thesis our main interest is to overcome these problems by
using of genetic algorithm.

1.5 Scopes of this Thesis
The current existing clustering techniques are k-mean algorithm. This is the powerful
clustering algorithm for the numeric data set is most widely used in the data mining
for clustering. In this algorithm randomly generate the cluster center called the seed of
cluster and then generate the hole cluster by measuring the minimum distances
between points. Therefore the objective of my thesis is generate initial seed which is
optimal seed of the cluster and then generate the hole cluster using the KVG algorithm
so we can find the result is more accurate and more errorless compare to k-mean. This
work is done by using the Genetic Algorithm through VSM . VSM is the vector
sequence method it maintain the flow of iteration and manipulate the selection of seed
for genetic algorithm.


9


1.6 Document Organization
This dissertation consists of seven chapters.
Chapter 1 Introduction
This chapter introduces some basic concepts, the problems to address, and the thesis
statement and provides the motivation and justification for the work described in this
dissertation.

Chapter 2 Literature Survey
This chapter provides a brief description of different clustering algorithm, genetic
algorithm.

Chapter 3 Theoretical Aspects
This chapter describes basics of K-mean, genetic algorithm on different dataset
(breast cancer dataset, thyroid dataset, E-coil dataset).

Chapter 4 Setting up Environment
This chapter provides details for setting up environment for MATLAB.

Chapter 5 Proposed Models for Classification
This chapter provides the details of proposed model for Clustering. It also describes
the use of Genetic algorithm with K-mean algorithm,

Chapter 6 Result Analysis
This chapter provides the results on various dataset using clustering algorithms and the
combinations of Genetic with K-mean. This chapter also analyses the results obtained.

Chapter 7 Conclusion and Future Work
This chapter includes conclusion and future scope of the dissertation.


10

CHAPTER 2
LITERATURE SURVEY

2.1 K-Means Clustering
The idea is to choose random cluster centers, one for each cluster. These centers are
preferred to be as far as possible from each other. Starting points affect the clustering
process and results. After that, each point will be taken into consideration to calculate
similarity with all cluster centers through a distance measure, and it will be assigned
to the most similar cluster, the nearest cluster center. When this assignment process is
over, a new center will be calculated for each cluster using the points in it. For each
cluster, the mean value will be calculated for the coordinates of all the points in that
cluster and set as the coordinates of the new center. Once we have these k new
centroids or center points, the assignment process must start over. As a result of this
loop we may notice that the k centroids change their locations step by step until no
more changes are made. When the centroids do not move any more or no more errors
exist in the clusters, we call the clustering has reached a minima. Finally, this
algorithm aims at minimizing an objective function, which is in this case a squared
error function. A simple approach is to compare the results of multiple runs with
different k clusters and choose the best one according to a given criterion. However,
we need to be careful as increasing k results in smaller error-function values by
definition, due to the few number of data points each center will represent, and thus it
will lose its generalization ability, as well as increasing the risk of over fitting.

2.1.1 Initialization for Clustering Techniques
The main purpose of clustering algorithm modifications is to improve the
performance of the underlying algorithms by fixing their weaknesses. And because
randomness is one of the techniques used in initializing many of clustering
techniques, and giving each point an equal opportunity to be an initial one, it is
considered the main point of weakness that has to be solved. However, because of the
sensitivity of K-Means to its initial points, which is considered very high, we have to
make them as near to global minima as possible in order to improve the clustering
performance. [3, 5].
11

K-Means is one of the most common algorithms used for clustering. The algorithm
classifies pixels to a predefined number of clusters (assume k clusters). The idea is to
choose random cluster centers called centroids, one for each cluster. These centroids
are preferred to be as far as possible from each other. Initial points affect the
clustering process and results. After that, each pixel will be taken into consideration to
calculate similarity with all cluster centers through a distance measure, and it will be
assigned to the most similar cluster, the nearest cluster center. When this assignment
process is over, a new centroid is calculated for each cluster using the pixels in it. For
each cluster,the mean value will be calculated for the coordinates of all the points in
that cluster and set as the coordinates of the new center. Once we have these k new
centroids or center points, the assignment process must start over. This process is
repeated until there is no change in centroids. Finally, this algorithm aims at
minimizing an objective function, which is in this case is a squared error function as
given by eq. 1 .

E (1)


In this formula K is the number of clusters, x represents a data point, Ck represents
cluster k, mk represents the mean of the cluster k, and A is the total number of
attributes for a data point.
The K-means algorithm starts by initializing the k cluster centers. The input data
points are then allocated to one of the existing clusters according to the square of the
Euclidean distance from the clusters, choosing the closest. The mean (centroid) of
each cluster is then computed so as to update the cluster center. This update occurs as
a result of the change in the membership of each cluster.the processes of re-assigning
the input vectors and the update of the cluster centers is repeated until no more change
in the value of any of the cluster centers. The K Means clustering method can be
considered, as the cunning method because here, to obtain a better result the centroids,
are kept as far as possible from each other.




12

The steps for the K-means algorithm are given below:
1. Initialization: choose randomly K input vectors (data points) to initialize the
clusters.
2. Nearest-neighbor search: for each input vector, find the cluster center that is
closest, and assign that input vector to the corresponding cluster.
3. Mean update: update the cluster centers in each cluster using the mean
(Centroid) of the input vectors assigned to that cluster.
4. Stopping rule: repeat steps 2 and 3 until no more change in the value of the
means.
One drawback of K-means is that it is sensitive to the initially selected points, and so
it does not always produce the same output. To avoid this problem, the algorithm may
run many times before taking an average values for all runs, or at least take the
median value[3].

2.2 A Fast Genetic K-means Clustering Algorithm
A new clustering algorithm called Fast Genetic K-means Algorithm (FGKA). FGKA
is inspired by the Genetic K-means Algorithm (GKA) proposed by Author but
features several improvements over GKA. they experiment indicate that, while K-
means algorithm might converge to a local optimum, both FGKA and GKA always
converge to the global optimum eventually but FGKA runs much faster than GKA. In
this paper, we propose a new clustering algorithm called Fast Genetic K-means
Algorithm (FGKA). FGKA is inspired by the Genetic K-means Algorithm (GKA) but
features several improvements over it, including efficient calculation of TWCVs,
avoiding illegal string elimination overhead, and the simplification of the mutation
operator. The initialization phase and the three operators are redefined to achieve
these improvements [7] .
FGKA starts with the initialization phase, which generates the initial population P0.
The population in the next generation Pi+1 is obtained by applying the following
genetic perators sequentially: the selection, the mutation and the K-means operator on
the current population Pi. The evolution takes place until the termination condition is
reached. The initialization phase randomly generates the initialpopulation P0 of Z
solutions which might end up with illegal strings. At first sight, illegal strings are
undesirable. For this reason, the GKA algorithm makes significant effort to eliminate
13

illegal strings. Illegal strings, however, are permitted in FGKA, but are considered as
the most undesirable solutions by defining their TWCVs as + and assigning them
with lower fitness values. Our flexibility of allowing illegal strings in the evolution
process avoids the overhead of illegal string elimination, and thus improves the time
performance of the algorithm. In the following, we give a brief description of the
three genetic operators.
FGKA maintains a population (set) of Z coded solutions (partitions), where Z is a
parameter specified by the user. Each solution is coded by a string a1aN of length
N. Given a solution Sz = a1aN, we define the legality ratio of Sz, e(Sz), as the
number of non-empty clusters in Sz divided by K. Sz is legal if e(Sz)=1,and illegal
otherwise.FGKA starts with the initialization phase, which generates the initial
population P0. The population in the next generation Pi+1 is obtained by applying the
following genetic operators sequentially: the selection, the mutation and the K-means
operator on the current population Pi. The evolution takes place until the termination
condition is reached. The initialization phase randomly generates the initial
population P0 of Z solutions which might end up with illegal strings. At first sight,
illegal strings are undesirable. For this this reason, the GKA algorithm makes
significant effort to eliminate illegal strings. Illegal strings, however, are permitted in
FGKA, but are considered as the most undesirable solutions by defining their TWCVs
as + and assigning them with lower fitness values. Our flexibility of allowing illegal
strings in the evolution process avoids the overhead of illegal string elimination, and
thus improves the time performance of the algorithm. In the following, we give a brief
description of the three genetic operators.
A new clustering algorithm called Fast Genetic K-means Algorithm (FGKA). FGKA
is inspired by the Genetic K-means Algorithm (GKA) but features several
improvements over it, including efficient calculation of TWCVs, avoiding illegal
string elimination overhead, and the simplification of the mutation operator. The
initialization phase and the three operators are redefined to achieve these
improvements [7].




14


2.3 Agglomerative Hierarchical Clustering based on Affinity
Propagation Algorithm
A hierarchical clustering based approach for identifying refactoring that would
improve the class structure of a software system was introduced [8]. In this direction,
a hierarchical agglomerative clustering algorithm (HAC), was developed. The
algorithm suggests the refactoring needed in order to improve the structure of the
software system. The main idea is that clustering is used in order to obtain a better
design, suggesting the needed refactorings.Real applications evolve in time, and new
application classes are added in order to met new requirements. Obviously, for
obtaining in these conditions a restructuring model of the modified software system,
the clustering algorithm (HAC in our approach) can be applied from scratch, every
time when the application classes set changes. This means that every time when the
software system changes, the extended system is analyzed starting from the entire set
of classes, methods and attributes, and HAC is applied to obtain an improved
structure of the system. But this can be inefficient, particularly for large software
systems. by proposing an adaptive method to cope with the evolving application
classes set. The method is based on detecting stable structures (cores) inside the
restructuring model of the system and resuming the clustering process from these
structures, when the application classes set increases. We aim to reach the result more
efficiently than applying HAC again from the scratch on the extended software
system.
In order to implement the hierarchical structure which extends the single-level
classification to a hierarchical multilevel one, the first task is to divide all classes
hypotheses into groups level-by-level. For this, an agglomerative clustering can be
used to merge same classes or group of classes, level by level until all classes are
merged together. For example classes in Figure 1 can be agglomerated as in Figure 2.
According to Figure 2, class1 and class2 have greater similarity or smaller distance
and are merged together in the first level. The distance between classes or group of
them is performance based. The most straight-forward performance-based distance
between two classes or class groups is probably the accuracy of identifying them from
each other. Higher accuracy indicates that they are easier to discriminate (larger
distance).To calculate this distance, a series of pair-wise classification experiments
15

are performed at each classification level and the classes in each pair that has smallest
performance are chosen to be merged.





Figure 2.1. An example of class distribution

Affinity propagation (AP) algorithm doesn't fix the number of the clusters and doesn't
rely on random sampling by the author [8]. It exhibits fast execution speed with low
error rate. However, it is hard to generate optimal clusters. This paper proposes an
agglomerative clustering based on Affinity propagation method to overwhelm the
limitation. It puts forward k-c1uster closeness to merge the clusters yielded by AP. In
comparison to AP, method has better performance and is better than or equal to the
quality of AP method. And it has an advantage of time complexity compared to
adaptive affinity propagation. This paper studies properties of AP algorithm, and then
Proposes the agglomerative hierarchical clustering based on AP. It generates the
initial division by AP partition. And it defines a novel cluster closeness based on
neighbor relationship, which can evade the influence of density. Based on it, Affinity
propagation algorithm can quickly and effectively performs agglomerative
hierarchical clustering and generates the better clusters. Experiments show that
Affinity propagation algorithm works better than the original AP and gets a more
accurate division and has an advantage in time complexity compared with Affinity
propagation. How to deal with the data with complicate structure and noise is a
direction for future research.
16


2.4 Hybridized Improved Genetic Algorithm with Variable Length
Chromosome for Image Clustering
Clustering is a process of putting similar data into groups. This paper presents data
clustering using improved genetic algorithm (IGA) in which an efficient method of
crossover and mutation are implemented. Further it is hybridized with the popular
Nelder-Mead (NM) Simplex search and K-means to exploit the potentiality of both in
the hybridized algorithm. The performance of hybrid approach is evaluated with few
data clustering problems. Further a Variable Length IGA is proposed which optimally
finds the clusters of benchmark image datasets and the performance is compared with
K-means and GCUK.The results revealed are very encouraging with IGA and its
hybridization with other algorithms. Improved GA based clustering on some well
known data sets. Although K-means clustering is a very well established approach,
however it has some demerits of initialization and falling in local minima. GA being a
randomized based approach has the capability to alleviate the problems faced by K-
means. In this paper an improved version of GA was discussed and implemented for
data clustering. In this improved version of GA (IGA) a new approach of crossover
and offspring formation adopted. When applied to data clustering problem IGA
performs better compared to K-means in all data set under study in this paper.
However, to further improvise the performance of IGA on data clustering the K-
means was hybridized resulting in KMIGA and boost the KM-IGA further more it has
been hybridized with Nelder-Mead resulting in KM-NM-IGA. In hybrid algorithm
(KM-NM-IGA) the outcome of K-means becomes one of the chromosomes in the
initial population of NM-IGA. The results reveal that hybrid algorithm gives better
results compared K-means, IGA and Nelder-Mead. Since the clustering results
achieved by the IGA are satisfactory we have applied the IGA to the Image clustering
problem by proposing a new variable length IGA (VLIGA) for automatic evolution of
clusters. Experiments were carried out with three standard natural grey scale images
to evaluate the performance of the proposed VLIGA. It was evident from the results
that VLIGA algorithm was effective compared to the GCUK and traditional K-means
algorithm. Further enhancements will include the study of higher dimensional data
sets and large data set for clustering. Also the datasets with mixed data can be studied.
It is also planned to study the appropriateness of hybrid algorithm (K-NM-IGA) for
17

image clustering and extend the same to color images.The K-means algorithm tends
to converge faster than the IGA as it requires fewer function evaluations, but it
usually results in less accurate clustering. One can take advantage of its speed at the
inception of the clustering process and leave accuracy to be achieved by other
methods at a later stage of the process. This statement shall be verified in later
sections of this paper by showing that the results of clustering by IGA can further be
improved by seeding the initial population with the outcome of the K-means
algorithm (denoted as KMIGA and KMNM IGA). More specifically, the hybrid
algorithm first executes the K-means algorithm, which terminates when there is no
change in centroid vectors. In the case of KM IGA, the result of the K-means
algorithm is used as one of the chromosomes, while the remaining chromosomes are
initialized randomly. The IGA algorithm then proceeds as presented above. In the case
of KMNMIGA, the first chromosome is seeded from k-means algorithm and rest 3N
particles or vertices as termed in randomly generated and NMIGA is then carried out
to its completion [9].

2.5 Gene Expression Analysis Using Clustering
Data Mining has become an important topic in effective analysis of gene expression
data due to its wide application in the biomedical industry. In this paper, k-means
clustering algorithm has been extensively studied for gene expression analysis. Since
our purpose is to demonstrate the effectiveness of the k-means algorithm for a wide
variety of data sets, Two pattern recognition data and thirteen microarray data sets
with both overlapping and non-overlapping class boundaries were taken for studies,
where the number of features/genes ranges from 4 to 7129 and number of sample
ranges from 32 to 683. The number of clusters ranges from two to eleven. For pattern
recognition, we use IRIS and WBCD data and for microarray data we use serum data ,
yeast data, leukemia data, breast data, Lymphoma data , lung cancer, and St. Jude
leukemia data. To identify common subtypes in independent disease data, four
different types of breast data and four Diffused Large B-cell Lymphoma (DLBCL)
data were used. Clustering error rate (or, clustering accuracy) is used as evaluation
metrics to measure the performance of k-means algorithm. Clustering is an efficient
way of analyzing information from microarray data and K-means is a basic method
for it. K-means can be very easily applied to Microarray data. Depending on the
18

nature and complexity of the data performance of K-means varies. We achieve
maximum accuracy for IRIS data where as lowest for DLBCL D. K-means has some
serious drawbacks. Many papers have presented in past to improve K-Means. In the
future we are planning to study K-Means clustering with other heuristic based search
methods like SA and GA or some others [10].

2.6 Enhancing Cluster Compactness using GA Initialized K-means
This paper presents a new initialization technique for clustering. Genetic algorithm
has been used for optimal centroid selection. These centroids act as starting points for
k-means. Previous researches used GA (genetic algorithm) initialized K-means
(GAIK) for clustering. In this paper some modification is done and a partition based
GA initialized K-means (PGAIK) technique is introduced in order to improve the
clustering performance. To measure the cluster compactness a within cluster scatter
criteria has been used. Experimental results show that PGAIK yields more compact
clusters as compared to simple GAIK. The initialization step is very important for any
clustering algorithm. The experimental results show that the partition based random
initialization method performs well and yields more compact clusters as compared to
the normal random selection [11].
.
2.7 Ant-based Clustering Algorithms
Ant-based clustering is a biologically inspired data clustering technique. Clustering
task aims at the unsupervised classification of patterns in different groups. Clustering
problem has been approached from different disciplines during last years. In recent
years, many algorithms have been developed for solving numerical and combinatorial
optimization problems. Most promising among them are swarm intelligence
algorithms. Clustering with swarm-based algorithms is emerging as an alternative to
more conventional clustering techniques. These algorithms have recently been shown
to produce good results in a wide variety of real-world applications. During the last
five years, research on and with the ant-based clustering algorithms has reached a
very promising state. In this paper, a brief survey on ant-based clustering algorithms
is described. We also present some applications of ant-based clustering algorithms.
Ant-based clustering algorithms are an appropriate alternative to traditional clustering
19

algorithms. The algorithm has a number of features that make it an interesting study
of cluster analysis. It has the ability of automatically discovering the number of
clusters. It linearly scales against the dimensionality of data. The nature of the
algorithm makes it fairly robust to the effects of outliers within the data. Research on
ant-based clustering algorithms is still an on-going field of research. In this paper, we
address a brief survey of ant-based clustering algorithms and an overview of some of
its applications. There are a number of directions in which research on ant-based
clustering can be continued. We summarize and conclude the survey with listing some
important future works and research trends for ant-based clustering algorithms: a
comparative study of ant clustering performance with respect to other clustering
algorithms; applying ant clustering algorithms to real-world applications; effects on
performance of user-defined parameters; a hierarchical analysis of the input data by
varying some of the user-defined parameters; sensitivity analysis of various user-
defined parameters of ant clustering algorithms; to determine optimal values of
parameters other than pick and drop policies; developing new probabilistic rules for
picking and dropping objects; study the effect based on reasonably good validity
index function to judge the fitness of several possible partitioning of the data of ant-
based clustering schemes and validating mathematically; study the possibility of
dynamic clustering using ant clustering with data mining applications; applying ant
clustering algorithms for multi-objective optimization problems; study of
transformation of ant clustering algorithms into supervised algorithms; developing
new theoretical results of behavior of ant clustering algorithms and study of
hierarchical ant-based clustering algorithms; to analyze the working principles that
ant-based clustering shares with other clustering methods; hybridization of ant-
clustering algorithm with alternative clustering methods [12].

2.8 Fuzzy Kernel K-Means Clustering Method Based on Immune GA
A fuzzy kernel k-means clustering method based on immune genetic algorithm (IGA-
FKKM) is proposed in this paper to overcome the dependence on the shape of the
sample space and local optimization of fuzzy k-means algorithm. Mapping samples
from low-dimension space into high-dimension feature space with Mercer kernel, the
method thus eliminates the influence of the shape of sample space on clustering
accuracy. Meanwhile, the probability of gaining the global optimal value is also
20

increased by using the immune genetic algorithm. Compared with the fuzzy k-means
clustering method (FKM) and the fuzzy k-means clustering method based on genetic
algorithm (GA-FKM), IGA-FKKM is validated by experimental results to achieve
higher classification accuracy. We propose a Fuzzy Kernel K-Means clustering
method based on Immune Genetic Algorithm (IGA-FKKM). Dependence of fuzzy K-
Means clustering on distribution of sample is eliminated with the introduction of
kernel function. Immune genetic algorithm is also used to suppress fluctuation
occurred at later evolvement and avoid local optimum. Compared with FKM and GA-
FKM, the experimental results show that IGA-FKKM obtains the global optimum,
and has higher cluster accuracy. Further study will focus on dealing with the
sensibility of clustering algorithm to initial value .
The experiments show that K-means algorithm might converge to a local optimum,
both FGKA and GKA always converge to the global optimum but FGKA runs almost
20 times faster than GKA. It also shows that three improvements: efficient calculation
of TWCVs, avoiding illegal string elimination overhead, and the simplification of the
mutation operator have different improvement impact factor over GKA. More details
are available in [13].

2.9 An IGA for Document Clustering with Semantic Similarity
Measure
This paper proposes a self-organized IGA (improved genetic algorithm) for document
clustering based on semantic similarity measure. The traditional method to represent
text is that the document is organized as a string of words, while the conceptual
similarity is ignored. We take advantage of thesaurus-based ontology to overcome this
problem. To investigate how ontology method could be used effectively in document
clustering, a hybrid strategy which combines the thesaurus-based semantic similarity
measure and vector space model (VSM) measure to provide more accurate assessment
of similarity between documents are implemented. Considering the influence between
the diversity of the population and the selective pressure, an approach of dynamic
evolution operators is put forward in this article. In our experiment two data sets of
200 and 600 documents from Reuter-21578 corpus are excerpted for test and the
experiment results show that our method of genetic algorithm in conjunction with the
hybrid semantic strategy, the combination of the thesaurus-based measure and VSM-
21

based measure, outperforms that with the sole VSM measure. Our clustering
algorithm also efficiently enhances the performance of precision and recall in
comparison with k-means in the same similarity environments. In this article a
modified genetic algorithm with the semantic similarity measure is proposed for
document clustering. The common problem in the fields of text clustering is that the
document is represented as a bag of words, while the conceptual similarity between
each pairs of documents is ignored. We take advantage of thesaurus based ontology to
overcome this problem. In our experiments, data set 1 with 200 documents from four
topics and data set 2 with 600 documents form 6 topics are selected for test. The
results show that our genetic algorithm in conjunction with the hybrid strategy, the
combination of the VSM-based and thesaurus-based similarity measure, gets the best
clustering performance in terms of the precision and recall. Furthermore, the proposed
self-organized genetic algorithm, considering the influence between the diversity of
the population and the selective pressure, efficiently evolve the clustering of the
documents in comparison with standard k-means algorithm in the same similarity
strategy. As we discussed, some important words which transform to incomplete
forms after stemming are not included in WorldNet lexicon and will not be considered
as concepts for similarity evaluation. In the future we will refine our algorithm by
using a more excellent parser, for example, Text Analyst, or combine with the corpus-
based method to overcome this problem for clustering [14].

2.10 Web Clustering Based on GA with Latent Semantic Indexing
Technology
This paper constructed a latent semantic text model using genetic algorithm (GA) for
web clustering. The main difficulty in the application of GA for text clustering is
thousands or even tens of thousands of dimensions in the feature space. Latent
semantic indexing (LSI) is a successful technology which attempts to explore the
latent semantics structure in textual data, and furthermore, it reduces this large space
to smaller one and provides a robust space for clustering. GA belongs to search
techniques that efficiently evolve the optimal solution for the problem. Evolved in the
reduced latent semantic indexing model, GA can improve clustering accuracy and
speed which is typically suitable for real time clustering. We used SSTRESS criteria
22

to analyze the dissimilarity between original term-by-document corpus matrix and the
approximate decomposition matrix with different ranks corresponding to the
performance of our algorithm evolved in the reduced space. The superiority of GA
applied in LSI model over K-means and conventional GA in the vector space model
(VSM) is demonstrated by providing good Reuter text clustering results. In this paper,
we propose a genetic algorithm with latent semantic indexing method for web text
clustering. The analysis of LSI shows that not only does it provide an underlying
semantic structure for text model, but also reduces the dimension drastically which is
very suitable for GA to evolve the optimal clusters. Our algorithm is applied to
Reuter-21578 text collection and demonstrates the effectiveness of our clustering
algorithm which is superior to that of GA in VSM and traditional K-means algorithm.
200 documents from 4 topics are chosen for simulating real time web clustering.
When the dimensions are reduced to 100, GA with 900 terms in LSI model obtains its
best performance with the computational time of 11.3 seconds. Therefore, the less
important dimensions correspond to noise are ignored. A reduced rank
approximation matrix to the original matrix is constructed by dropping these noisy
dimensions. Furthermore, the experimental results verify that we have succeeded in
reducing the number of terms with then we use the SSTRESS criteria to analyze the
dissimilarity between original term-by-document corpus matrix and the approximate
decomposition matrix with different ranks corresponding to the performance of our
algorithm evolved in the reduced space. In the future, we will refine our algorithm by
decreasing computational time of clustering [15].

2.11 Hierarchical Clustering for Adaptive Refactoring Identification
This paper studies an adaptive refactoring problem. It is well-known that improving
the software systems design through refactoring is one of the most important issues
during the evolution of object oriented software systems. We focus on identifying the
refactoring needed in order to improve the class structure of software systems, in an
adaptive manner, when new application classes are added to the system. We propose
an adaptive clustering method based on a hierarchical agglomerative approach that
adjusts the structure of the system that was established by applying a hierarchical
agglomerative clustering algorithm before the application classes set changed. The
adaptive method identifies, more efficiently, the refactoring that would improve the
23

structure of the extended software system, without decreasing the accuracy of the
obtained results. An experiment testing the methods efficiency is also reported. We
have proposed in this paper a new method (HAR) for adapting a restructuring scheme
of a software system when new application classes are added to the system. The
considered experiment proves that the result is reached more efficiently using HAR
method than running HAC again from the scratch on the extended software system.
Further work will be done in the following directions: To isolate conditions to decide
when it is more effective to adapt (using HAR) the partitioning of the extended
software system than to recalculate it from scratch using HAC algorithm. To apply the
adaptive algorithm HAR on open source case studies and real software systems.
Identify adaptive extensions of other existing automatic methods for refactoring
identification [16].

2.12 Summarization of Text Clustering Based Vector Space Model
Text clustering is an important task of natural language processing and is widely
applicable in areas such as information retrieval and web mining. The representation
of document and the clustering algorithm are the key issues of text clustering. This
paper discusses Vector Space Model (VSM)-based clustering algorithms. This paper
reviewed the text clustering algorithm. Text clustering has three issues. They are
sparse high dimensional, multi-word synonyms and polysemy. They lead the
clustering to very high time complexity. And they greatly interfere with the accuracy
of the clustering algorithm. They cause a sharp decline in the performance of
clustering. This is the difficult of the technique. In this paper, we describe some
clustering algorithms which are widely used in the document clustering based VSM
model [17].







24

2.13 Initializing K-Means using Genetic Algorithms
K-Means (KM) is considered one of the major algorithms widely used in clustering.
However, it still has some problems, and one of them is in its initialization step where
it is normally done randomly. Another problem for KM is that it converges to local
minima. Genetic algorithms are one of the evolutionary algorithms inspired from
nature and utilized in the field of clustering. In this paper, we propose two algorithms
to solve the initialization problem, Genetic Algorithm Initializes KM (GAIK) and KM
Initializes Genetic Algorithm (KIGA). To show the effectiveness and efficiency of
our algorithms, a comparative study was done among GAIK, KIGA, Genetic-based
Clustering Algorithm (GCA), and FCM. Our experimental evaluation scheme was
used to provide a common base of performance assessment and comparison with
other methods. From the experiments on the eight data sets, we find that pre-
initialized algorithms work well and yield meaningful and useful results in terms of
finding good clustering configurations which contain interdependence information
within clusters and discriminative information for clustering. In addition, it is more
meaningful in selecting, from each cluster, significant centers, with high multiples
interdependence with other points within each cluster. Finally, when comparing the
experimental results of K Means, GKA, GAIK and KIGA we find that KIGA is better
than the others. As shown by the results on all datasets KIGA is ready to achieve high
clustering accuracy if compared to other algorithms [18].
25

CHAPTER 3
THEORETICAL ASPECTS

3.1 K-mean algorithm
In statistics and data mining, k-means clustering is a method of cluster analysis which
aims to partition n observations into k clusters in which each observation belongs to
the cluster with the nearest mean. This results into a partitioning of the data space into
Verona cells. The problem is computationally difficult (NP-hard), however there are
efficient heuristic algorithms that are commonly employed that converge fast to a
local optimum. These are usually similar to the expectation-maximization algorithm
for mixtures of Gaussian distributions via an iterative refinement approach employed
by both algorithms. Additionally, they both use cluster centers to model the data,
however k-means clustering tends to find clusters of comparable spatial extend, while
the expectation-maximization mechanism allows clusters to have different shapes.
Almost all partitioned clustering methods are based upon the idea of optimizing a
function F referred to as clustering criterion which, hopefully, translates one's
intuitive notions on cluster into a reasonable mathematical formula. The function
value usually depends on the current partition of the database {C1; . . . ; Ck}.
Concretely, the K-Means algorithm finds locally optimal solutions using as clustering
criterion F the sum of the L2 distance between each element and its nearest cluster
centre (centroid). This criterion is sometimes referred to as square-error criterion.
Therefore, it follows that

F =

Where K is the number of clusters, Ki the number of objects of the cluster i, w
ij
the j
th

object of the i
th
cluster and w
i
is the centroid of the i
th
cluster which is defined as


26

As can be seen below where the pseudo-code is presented, the K-Means algorithm is
provided somehow with an initial partition of the database and the centroids of these
initial clusters are calculated. Then, the instances of the database are relocated to the
cluster represented by the nearest centroid in an attempt to reduce the square-error.
This relocation of the instances is done following the instance order. If an instance in
the relocation step (Step 3) changes its cluster membership, then the centroids of the
clusters Cs and Ct and the square-error should be recomputed. This process is
repeated until convergence, that is, until the square-error cannot be further reduced
which means no instance changes its cluster membership.
*******************************************************************

Step 1:- select show how an initial partition of the database in K clusters
{C1 CK}
Step 2:- calculate cluster centroids ,i=1K
Step 3:- for every Wi in the database and following the instance order Do
Step 3.1:- Reassign instance Wi to its closest cluster centroid
, is moved from Cs to Ct if
For all j=1 K, js
Step 3.2:-Recalculate centroids for clusters Cs and Ct
Step 4:- IF cluster membership is stabilized THEN stops ELSE go to Step3.
*******************************************************************

3.1.1 Drawbacks of the K-Means algorithm
Despite being used in a wide array of applications, the K-Means algorithm is not
exempt of drawbacks. Some of these drawbacks have been extensively reported in the
literature. The most important are listed below:
As many clustering methods, the K-Means algorithm assumes that the number
of clusters K in the database is known beforehand which, obviously, is not
necessarily true in real-world applications,
As an iterative technique, the K-Means algorithm is especially sensitive to
initial starting conditions (initial clusters and instance order),
27

The K-Means algorithm converges finitely to a local minimum. The running
of the algorithm defines a deterministic mapping from the initial solution to
the final one.
To overcome the lack of knowledge on the real value in the database of the input
parameter K, we adopt a rough but usual approach: to try clustering with several
values of K. The problem of initial starting conditions is not exclusive to the K-Means
algorithm but shared with many clustering algorithms that work as a hill-climbing
strategy whose deterministic behavior leads to a local minima dependent on initial
solution and on instance order. Although there is no guarantee of achieving a global
minimum, at least the convergence of K-Means algorithm is ensured. Milligan shows
the strong dependence of the K-Means algorithm on initial clustering and suggests
that good final cluster structures can be obtained using Ward's hierarchical method to
provide the K-Means algorithm with initial clusters. Fisher proposes creating the
initial clusters by constructing an initial hierarchical clustering based upon the work.
Author suggests using a MaxMin algorithm in order to select a subset of the original
database as the initial centroids to establish the initial clusters. Present some
experimental results of an instance of the EM algorithm reminiscent of the K-Means
with three different initialization methods (being one of them a hierarchical
agglomerative clustering method). Most of the initialization methods that we have
mentioned above do not constitute only initialization methods. They are clustering
methods themselves and when used with the K-Means algorithms result in a hybrid
clustering algorithm. Thus, these initialization methods suffer from the same problem
as the K-Means algorithm and they have to be provided with an initial clustering. For
The remaining part of this paper, we focus on much simpler and more inexpensive
initialization methods that constitute the first initialization of any other more
complexes clustering method. This is the reason that motivates Author to develop an
algorithm for refining the initial seeds for the K-Means algorithm. To overcome the
possible bad effects of instance order, present a procedure to order the instances of the
database. They show that ordering instances, so that consecutive observations are
dissimilar based on L2, lead to good clusterings. Author proposes a local strategy to
reduce the effect of the instance ordering problem. Although they focus on
incremental clustering procedures, their strategy is not coupled to any particular
procedure and may be adapted to the K-Means algorithm [19].
28


3.2 Genetic algorithms
As a part of our main objective, we aim to find the best and the worst set of initial
starting conditions to approach the extremes of the probability distributions of the
square-error values. Due to the computational expense of performing an exhaustive
search, we tackle the problem using genetic algorithms. Roughly speaking, we can
say that Genetic Algorithms (GAs) are kinds of evolutionary algorithms, that is,
probabilistic search algorithms which simulate natural evolution. GAs is used to solve
combinatorial optimization problems following the rules of natural selection and
natural genetics. They are based upon the survival of the fittest among string
structures together with a structured yet randomized information exchange. Working
in this way and under certain conditions, GAs evolves to the global optima with
probability arbitrarily close to 1. When dealing with GAs, the search space of a
problem is represented as a collection of individuals. The individuals are represented
by character strings. Each individual is coding a solution to the problem. In addition,
each individual has associated a fitness measure. The part of the space to be examined
is called the population. The purpose of the use of a GA is to find the individual from
the search space with the best ``genetic material''. Below show the pseudo-code of the
GA that we use. First, the initial population is chosen and the fitness of each of its
individuals is determined. Next, iteration two parents are selected from the
population. This parental couple produces children which, with a probability near
zero, are mutated, i.e., their hereditary distinctions are changed. After the evaluation
of the children, the worst individual of the population is replaced by the fittest of the
children. This process is iterated until a convergence criterion is satisfied.The
operators which define the children production process and the mutation process are
the crossover operator and the mutation operator respectively. Both operators are
applied with different probabilities and play different roles in the GA. Mutation is
needed to explore new areas of the search space and helps the algorithm avoid local
optima. Crossover is aimed to increase the average quality of the population. By
choosing adequate crossover and mutation operators as well as an appropriate
reduction mechanism, the probability that the GA reaches a near-optimal solution in a
reasonable number of iterations increases [20].

29


*********************************************************************
BEGIN AGA
Choose initial population at random
Evaluate initial population
WHILE NOT convergence criterion DO
BEGIN
Select two parents from the current population
Producer children by the selected parents
Mutate the children
Replace the worst individual of the population by the best
child
END
Output the best individual found
END AGA
*******************************************************************
Genetic algorithm (GA) is a class of optimization procedures inspired by the
biological mechanism of reproduction. In the past, it has been used to solve various
problems including target recognition , object recognition, face recognition , and face
detection/verification [21]. GA operates iteratively on a population of structures, each
one of which represents a candidate solution to the problem at hand, properly encoded
as a string of symbols. A randomly generated set of such strings forms the initial
population from which the GA starts its search. Three basic genetic operators guide
this search: selection, crossover, and mutation. The genetic search process is iterative:
evaluating, selecting, and recombining strings in the population during the iteration
(generation) until reaching some termination condition. Evaluation of each string is
based on a fitness function that is problem-dependent. It determines which of the
candidate solutions are better. This corresponds to the environmental determination of
survivability in national selection. Selection of a string, which represents a point in
the search space, depends on the strings fitness relative to those of other strings in the
population. It probabilistically removes, from the population, those points that have
relatively low fitness. Mutation, as in natural systems, is a very low probability
operator and just flips a specific bit. Mutation plays the role of restoring lost genetic
30

material. In contrast, crossover is applied with high probability. It is a randomized yet
structured operator that allows information exchange between points. Its goal is to
preserve the fittest individual without introducing any new value. The goal of feature
subset selection is to use less features to achieve the same or better performance.
Therefore, the fitness evaluation contains two terms: (1 )Fatemeh seiti, Afsaneh Alavi,
Mohammad Reza Nazari, Mahdi Aliyari, Mohammad Teshnehlab International
Journal of Computer Science and Security 4 error and (2) the number of features
selected. We used the fitness function shown below to combine the two terms:
Fitness = Error + J Ones (1)
Where the Error corresponds to the classification error rate and Ones corresponds to
the number of features selected (i.e., Ones in the chromosome). The Ones term ranges
from 1 to L where L is the length of chromosome (for the second dataset, L=21).
Finding the best balance between the number of features and the classification error
rate is an important issue. According to equation (1), the lower the error rate, the
better the fitness. Also the fewer the number of features, the better the fitness in this
study, we prefer to achieve the best accuracy rate with the fewer number of features.
Therefore, the first and the second term should be at the same range.
In below figure GA is used to find an optimal binary vector, where each bit is
associated with a feature. If the ith bit of this vector is equals to 1, then the ith feature
is allowed to participate in classification. If the bit is equal to 0, then the
corresponding feature does not participate.

0 1 0

1 1

d bits

Feature 2 is included in the classifier.
Feature 1 is not included in the classifier.

A d-dimensional binary vector, comprising a single member of the GA population for
GA-based feature selection.


31

3.3 Vector space model
The vector-space models for information retrieval are just one subclass of retrieval
techniques that have been studied in recent years. The taxonomy provided labels the
class of techniques that resemble vector-space models ``formal, feature-based,
individual, partial match'' retrieval techniques since they typically rely on an
underlying, formal mathematical model for retrieval, model the documents as sets of
terms that can be individually weighted and manipulated, perform queries by
comparing the representation of the query to the representation of each document in
the space, and can retrieve documents that don't necessarily contain one of the search
terms. Although the vector-space techniques share common characteristics with other
techniques in the information retrieval hierarchy, they all share a core set of
similarities that justify their own class. Vector-space models rely on the premise that
the meaning of a document can be derived from the document's constituent terms.
They represent documents as vectors of terms d= (t
1
, t
2
t
n
) where t
i
(1 i n) is a
non-negative value denoting the single or multiple occurrences of term i in document
d. Thus, each unique term in the document collection corresponds to a dimension in
the space. Similarly, a query is represented as a vector q= (t
1
, t
2
t
m
) where term t
j
(1
j m) is a non-negative value denoting the number of occurrences of t
i
(or, merely a 1
to signify the occurrence of term t
j
in the query [22]. Both the document vectors and
the query vector provide the locations of the objects in the term-document space. By
computing the distance between the query and other objects in the space, objects with
similar semantic content to the query presumably will be retrieved. Vector-space
models that don't attempt to collapse the dimensions of the space treat each term
independently, essentially mimicking an inverted index. However, vector-space
models are more flexible than inverted indices since each term can be individually
weighted, allowing that term to become more or less important within a document or
the entire document collection as a whole. Also, by applying different similarity
measures to compare queries to terms and documents, properties of the document
collection can be emphasized or deemphasized. For example, the dot product (or,
inner product) similarity measure finds the Euclidean distance between the query and
a term or document in the space. The cosine similarity measure, on the other hand, by
computing the angle between the query and a term or document rather than the
distance, deemphasizes the lengths of the vectors. In some cases, the directions of the
32

vectors are a more reliable indication of the semantic similarities of the objects than
the distance between the objects in the term-document space . Vector-space models
were developed to eliminate many of the problems associated with exact, lexical
matching techniques. In particular, since words often have multiple meanings
(polysemy), it is difficult for a lexical matching technique to differentiate between
two documents that share a given word, but use it differently, without understanding
the context in which the word was used. Also, since there are many ways to describe a
given concept (synonomy), related documents may not use the same terminology to
describe their shared concepts. A query using the terminology of one document will
not retrieve the other related documents. In the worst case, a query using terminology
different than that used by related documents in the collection may not retrieve any
documents using lexical matching, even though the collection contains related
documents. Vector-space models, by placing terms, documents, and queries in a term-
document space and computing similarities between the queries and the terms or
documents, allow the results of a query to be ranked according to the similarity
measure used. Unlike lexical matching techniques that provide no ranking or a very
crude ranking scheme (for example, ranking one document before another document
because it contains more occurrences of the search terms), the vector-space models,
by basing their rankings on the Euclidean distance or the angle measure between the
query and terms or documents in the space, are able to automatically guide the user to
documents that might be more conceptually similar and of greater use than other
documents. Also, by representing terms and documents in the same space, vector-
space models often provide an elegant method of implementing relevance feedback .
Relevance feedback, by allowing documents as well as terms to form the query, and
using the terms in those documents to supplement the query, increases the length and
precision of the query, helping the user to more accurately specify what he or she
desires from the search. Information retrieval models typically express the retrieval
performance of the system in terms of two quantities: precision and recall. Precision is
the ratio of the number of relevant documents retrieved by the system to the total
number of documents retrieved. Recall is the ratio of the number of relevant
documents retrieved for a query to the number of documents relevant to that query in
the entire document collection. Both precision and recall are expressed as values
between 0 and 1. An optimal retrieval system would provide precision and recall
33

values of 1, although precision tends to decrease with greater recall in real-world
systems.
The VSM has been a standard model of representing documents in information
retrieval for almost three decades . Let D be a document collection and Q the set of
queries representing users information needs. Let also ti symbol- ize term i used to
index the documents in the collection, with i = 1... n. The VSM assumes that for each
term ti there exist a vector ~ti in the vector space that represents it. It then considers
the set of all term vectors {~ti} to be the generating set of the vector space, thus the
space basis. If each dk,(for k = 1, .., p) denotes a document of the collection, then
there exists a linear combination of the term vectors {~ti} which represents each dk in
the vector space. Similarly, any query q can be modeled as a vector ~q that is a linear
combination of the term vectors. In the standard VSM, the term vectors are
considered pair wise orthogonal, meaning that they are linearly independent. But this
assumption is unrealistic, since it enforces lack of relatedness between any pair of
terms, whereas the terms in a language often relate to each other. Provided that the
orthogonally assumption holds, the similarity between a document vector ~ dk and a
query vector ~q in the VSM can be expressed by the cosine measure given in equation
below.



3.4 Thyroid Dataset
In order to perform the research reported in this manuscript, two thyroid disease
datasets are used. These thyroid datasets are taken from UCI machine learning
repository [23]. The first thyroid dataset consists of 215 patients and 5 features. These
features are T3-resin uptake test (A percentage), Total serum thyroxin as measured by
the isotopic displacement method, Total serum triiodothyronine as measured by
radioimmuno assay, Basal thyroid-stimulating hormone (TSH) as measured by
radioimmuno assay, Maximal absolute difference of TSH value after injection of 200
mg of thyrotropin-releasing hormone as compared to the basal value. This dataset
consist of 3 classes which are normal, hyperthyroidism and hypothyroidism. The
34

second thyroid dataset consists of 7200 patients of 21 features each, 15 binary (from
x2 to x16) and 6 continuous (x1, x17, x18, x19, x20, x21). The training data and the
test data consist of 3772 and 3428 data, respectively. 92% of samples in this dataset
belong to normal class. The features are age, sex, on thyroxin, maybe on thyroxin, on
antthyroid medication, sick-patient reports malaise, pregnant, thyroid surgery, I131
treatment, test hypothyroid, test hyperthyroid, on lithium, has goiter, has tumor,
hypopituitary, psychological symptoms, TSH, T3, TT4, T4U, and FTI.
This directory contains 6 databases, corresponding test set, and corresponding
documentation. They were left at the University of California at Irvine by Ross
Quinlan during his visit in 1987 for the 1987 Machine Learning Workshop. The
documentation files (with file extension "names") are formatted to be read by
Quinlan's C4 decision tree program. Though briefer than the other documentation
files found in this database repository, they should suffice to describe the database,
specifically :
1. Source
2. Number and names of attributes (including class names)
3. Types of values that each attribute takes
In general, these databases are quite similar and can be characterized somewhat as
follows:
1. Many attribute (29 or so, mostly the same set over all the databases)
2. Mostly numeric or Boolean valued attributes
3. Thyroid disease domains (records provided by the Garavan Institute of
Sydney, Australia)
4. Several missing attribute values (signified by "?")
5. small number of classes (under 10, changes with each database)
6. 2800 instances in each data set
7. 972 instances in each test set (It seems that the test sets' instances
8. are disjoint with respect to the corresponding data sets, but this has not been
verified)
This database now also contains an additional two data files, named hypothyroid.
Data and sicken thyroid Dataset. They have approximately the same data format and
set of attributes as the other 6 databases, but their integrity is questionable. Ross
Quinlan is concerned that they may have been corrupted since they first arrived at
35

UCI, but we have not yet established the validity of this possibility. These 2
databases differ in terms of their number of instances (3163) and lack of
corresponding test files. They each have 2 concepts (negative/hypothyroid and sick-
euthyroid/negative respectively). Their source also appears to be the Garavan
institute. Each contains several missing values. Another relatively recent file
thyroid0387.data has been added that contains the latest version of an archive of
thyroid diagnoses obtained from the Garvan Institute, consisting of 9172 records from
1984 to early 1987. A domain theory related to thyroid disease has also been added
recently (thyroid. theory). The files new-thyroid.[names, data] were donated by
Stefan Abe hard [23].

3.5 Breast Cancer Dataset
Breast cancer is one of the most common cancers among women World Wide and its
incidence is about one million new patients annually by the year 2000. There is an
overall increase of 2% in the incidence of breast cancer throughout the world per year.
Worldwide it is estimated that 420000 deaths would occur annually as a result of
breast cancer by the year 2000. Although breast cancer is a potentially fatal condition,
early diagnosis of disease can lead to successful treatment. One of the important steps
to diagnose the breast cancer is classification of tumor. Tumors can be either benign
or malignant but only the latter is cancer. So, malignant tumors generally are more
serious than benign tumors. Early diagnosis needs a precise and reliable diagnosis
procedure that allows physicians to distinguish between benign breast tumors and
malignant ones. For this purpose, there are various computer-based solutions to serve
as the diagnosis procedure and assist the physicians to specify the type of breast mass.
These systems, called Medical Diagnostic Decision Support (MDDS) systems, can
augment the natural capabilities of human diagnosticians incorporating imprecise
models about the incompletely understood and exceptionally complex process of
medical diagnosis. For evaluating the model, Wisconsin Diagnostic Breast Cancer
(WDBC) Dataset is used. Each record of this dataset is represented with 30 numerical
features. Features are computed from a digitized image of a fine needle aspirate
(FNA) of a breast mass. They describe characteristics of the cell nuclei present in the
image. The diagnosis of each record is benign or malignant. This dataset contains
36

569 instances. 357 instances are benign and 212 malignant. There is no missing value
in the dataset [24].
Table:3.1 Data set table of Breast Cancer
Data set
Characteristics:
Multivariate
Number of
Instances:
569 Area: Life
Attribute
Characteristics:
Real Number of
Attributes:
32 Date
Donated
1995-
11-01
Associated Tasks: Clustering Missing Values No Number of
Web Hits:
119949

3.6 Ecoli Data Set
This section describes how the benchmark consisting of 43 E.coli sequence datasets
each containing a known motif derived from RegulonDB is created. RegulonDB is a
database on transcription regulation and operon organization in Escherichia coli. We
start with the file 'TF Binding. For each of 43 distinct transcription factors, we select
the first target gene of each transcription unit regulated by this transcription factor,
and for each target gene, we select the intergenic region 250 nucleotides upstream and
50 nucleotides downstream of the translation start site (as the transcription start site is
often unknown). This gives for each transcription factor a file in FASTA format with
a list of DNA sequences (separated by '>') for each target gene. Based on the genome
coordinates for all transcription factor binding sites described in 'TF Binding Sites',
we describe the motif model that corresponds to the transcription factor in each
dataset by the relative start and end position, strand and nucleotide description of the
sites in the created sequences dataset. Secondly, we use the positional nucleotide
counts from the file 'Matrices_Alignments' to create a PWM representation of the
motif model [25].

Table:3.1 Data set table of Ecoli
Data set
Characteristics:
Multivariate
Number of
Instances:
336 Area: Life
Attribute
Characteristics:
Real Number of
Attributes:
8 Date
Donated
1996-
09-01
Associated Tasks: Clustering Missing Values No Number of
Web Hits:
31008
37

CHAPTER 4
PROPOSED WORK

This work evaluates the performance of K means with VSM and Genetic algorithm
for the breast, thyroid dataset. A The basic idea about selecting initial cluster centers
using genetic algorithm In the proposed algorithm, we first use random function to
select K data objects as initial cluster centers to form a chromosome, a total of M
chromosomes selected, then have K-means operation on each group of cluster center
in the initial population to compute fitness, select individuals according to the fitness
of each chromosome, select high-fitness chromosomes for the crossover and mutation
operation eliminating low fitness chromosomes, format next generation group finally.
In this way, within each new generation of groups, the average fitness are rising, each
cluster center is closer to the optimal cluster center, and finally select chromosome
that have the highest fitness as the initial cluster center. Algorithm
*********************************************************************
1. Choose a number of clusters k
2. Initialize cluster centers ........... based on mode
3. For each data point, compute the cluster center it is closest to (using some distance
measure)
and assign the data point to this cluster.
4. Re-compute cluster centers (mean of data points in cluster)
5. Stop when there are no new re-assignments.
6. GA based refinement
a) a Construct the initial population (p1)
b) b Calculate the global minimum (Gmin)
c) c For i = 1 to N do
i. Perform reproduction
ii. Apply the crossover operator between each parent.
iii. Perform mutation and get the new population. (p2)
iv. Calculate the local minimum (Lmin).
v. If Gmin < Lmin then
a.Gmin = Lmin;
b. p1 = p2;
d) Repeat

*******************************************************************
38

A. Chromosome Coding, we use real-coded, the value of length of the
chromosome is the number of cluster, and the specific code form is:
X= K is the number of cluster center of a chromosome.
B. Population Initialization The range of M is 20-100. Specific operation is as
follows: select K cluster centers randomly to form a chromosome Ran, if the
center randomly selected has already exist in the same chromosome, then
remove the center and reselect until it reaches K centers, until the population
size to M = 100. Select the Fitness Function we use the inverse of objective
function J as the fitness function, that is F=1/J. The smaller J is the greater
fitness function will become, so the better clustering effect is.
C. Genetic Operation we use proportional selection operator, single-point
crossover operator and uniform mutation operator. To avoid premature or slow
convergence phenomenon using a fixed probability, we are using self-adaptive
genetic operator that is dynamically adjust the crossover rate and mutation
rate. Pc and Pm is calculated as follows:




Among them, means average fitness value of each generation group;
means the largest individual fitness value in the group; means the
larger fitness value of the two crossing individuals; f indicates the fitness value
of mutating individual. The formula makes individuals with high fitness have
lower crossover rate and mutation rate; individuals with small fitness have a
higher crossover rate and mutation rate. This helps protect the best individual,
but also can make individuals with lower fitness cross and mutate at higher
rate, producing excellent model.
39

D. Loop Termination Conditions In this paper, we use termination algebra T as
running end condition of genetic algorithm, which indicate that the genetic
algorithm stop running after it runs to the specified evolution algebra, and
output the best individual in current group as optimal solution of the problem.
the range from 100 to 1000.
E. Description of the Specific algorithm
a. Set the parameters: population size M, the maximum number of
iteration T, the number of clusters K, etc.
b. Generate m chromosomes randomly; a chromosome represents
a set of initial cluster centers, to form the initial population.
c. According to the initial cluster centers showed by every
chromosome, carry out K-means clustering, each chromosome
corresponds to once K-means clustering, then calculate
chromosome fitness in line with clustering result, and
implement the optimal preservation strategy.
d. For the group, to carry out selection, crossover and mutation
operator to produce a new generation of group.
e. To determine whether the conditions meet the genetic
termination conditions, if meet then withdrawal genetic
operation and turn 6, otherwise turn 3.
f. Calculate fitness of the new generation of group; compare the
fitness of the best individual in current group with the best
individual's fitness so far to find the individual with the highest
fitness.
g. Carry out K-means clustering according to the initial cluster
center represented by the chromosome with the
h. Highest fitness, and then output clustering result.

40


Figure 4.1 Generation of clusters by KVG


41


4.1 Clustering using GA
In clustering using GA defines a measurement naming as below:


Duty of GA is finding proper cluster centers 1, 2, 3 in such a way that
clustering standard of M can be minimize.

4.2 Displaying series
Each series is a collection of real numbers that show K cluster centers. In N-D
environment, length of each chromosome is identify with N*K.


4.2.1 Population initialization
K cluster centers are accidentally selected from available list and insert in one
chromosome. This trend repeats for all P chromosomes producing population. Indeed
P is population size.



4.3 Fitness computation

Access of fitness contains two processes. In first process, clusters, base of centers is
chromosomes, will produce. This means any point specify to one of the clusters
with center of, if:
P=1,2,3.....K and j

After clustering is done, available cluster centers in chromosomes replace with
average of any clusters points. For cluster, new center accesses in this way:


For define this subject.

42

The clustering standard or is account in this way:


F=

Also fitness dependent accesses in this way:

4.4 Reproduction (Selection)
The selection process selects chromosomes from the mating pool directed by the
survival of the fittest concept of natural genetic systems. In the proportional selection
strategy adopted in this article, a chromosome is assigned a number of copies, which
is proportional to its fitness in the population, that go into the mating pool for further
genetic operations. Roulette wheel selection is one common technique that
implements the proportional selection strategy.These processes ultimately result in the
next generation population of chromosomes that is different from the initial
generation. Generally the average fitness will have increased by this procedure for the
population, since only the best organisms from the first generation are selected for
breeding, along with a small proportion of less fit solutions, for reasons already
mentioned above.

4.5 Crossover
Crossover is a probabilistic process that exchanges information between two parent
chromosomes for generating two child chromosomes. In this dissertation we are using
Single point crossover with a fixed crossover probability of pc = 0.8 is used. For
chromosomes of length l, a random integer, called the crossover point, is generated in
the range [1, l-1]. The portions of the chromosomes lying to the right of the crossover
point are exchanged to produce two offspring.



43


4.6 Mutation
Each chromosome undergoes mutation with a fixed probability pm = 0.6. For binary
representation of chromosomes, a bit position (or gene) is mutated by simply flipping
its value. Since we are considering real numbers in this paper, a random position is
chosen in the chromosome and replace by a random number between 0-9.After the
genetic operators are applied, the local minimum fitness value is calculated and
compared with global minimum. If the local minimum is less than the global
minimum then the global minimum is assigned with the local minimum, and the next
iteration is continued with the new population. The cluster points will be repositioned
corresponding to the chromosome having global minimum. Otherwise, the next
iteration is continued with the same old population. This process is repeated for N
number of iterations. From the following section, it is shown that our refinement
algorithm improves the cluster quality.

4.7 Stopping criteria
In this work, calculation trend of fitness, crossover, mutation is done in amount of
maximum repeated number.
44

CHAPTER 5
ENVOIRMENT SETUP

5.1 MATLAB Environment
MATLAB is a high-level technical computing language and interactive environment
for algorithm development, data visualization, data analysis, and numeric
computation. Using the MATLAB product, you can solve technical computing
problems faster than with traditional programming languages, such as C, C++, and
FORTRAN.
You can use MATLAB in a wide range of applications, including signal and image
processing, communications, control design, test and measurement, financial
modeling and analysis, and computational biology. Add-on toolboxes (collections of
special-purpose MATLAB functions, available separately) extend the MATLAB
environment to solve particular classes of problems in these application areas.
MATLAB provides a number of features for documenting and sharing your work.
You can integrate your MATLAB code with other languages and applications, and
distribute your MATLAB algorithms and applications. Features include:
High-level language for technical computing
Development environment for managing code, files, and data
Interactive tools for iterative exploration, design, and problem solving
Mathematical functions for linear algebra, statistics, Fourier analysis, filtering,
optimization, and numerical integration
2-D and 3-D graphics functions for visualizing data
Tools for building custom graphical user interfaces
Functions for integrating MATLAB based algorithms with external
applications and languages, such as C, C++, Fortran, Java, COM, and
Microsoft

Excel




45

5.1.1 Introduction to MATLAB
MATLAB (matrix laboratory) is a numerical computing environment and fourth-
generation programming language. Developed by MathWorks, MATLAB allows
matrix manipulations, plotting of functions and data, implementation of algorithms,
creation of user interfaces, and interfacing with programs written in other languages,
including C, C++, Java, and Fortran. Although MATLAB is intended primarily for
numerical computing, an optional toolbox uses the MuPAD symbolic engine,
allowing access to symbolic computing capabilities. An additional package, Simulink,
adds graphical multi-domain simulation and Model-Based Design for dynamic and
embedded systems. In 2004, MATLAB had around one million users across industry
and academia. MATLAB users come from various backgrounds of engineering,
science, and economics. MATLAB is widely used in academic and research
institutions as well as industrial enterprises.
5.1.2 Requirements of MATLAB


46


5.1.3 How to Use MATLAB


47

CHAPTER 6
RESULT ANALYSIS

Comparing K-means algorithm based on genetic algorithm with the original K-means
algorithm and two known improved algorithms to verify the effectiveness that
selecting initial cluster center using genetic algorithm. Improved algorithm 1 is
proposed in, the improved algorithm 2 is proposed in. In order to exclude the impact
of isolated points, we use the method proposed in that using the average value of
subset whose object is more close to center as a new round of cluster center to
improve K-means algorithm, and also apply this method to original K-means
algorithm, improved algorithm 1 and improved algorithm 2, having a comprehensive
comparison on them. Experimental data are two groups of data from the UCI
database, iris data sets. Considering that great value properties affect the distance
between the samples. We added groups of isolated points respectively to the two sets
of data above-mentioned. Iris data. Experiment parameter settings are as follows: k =
0.25; pcl = 0.9; pc2 = 0. 6; pml = 0.5; pm2 = 0.1; pc = 0. 6; pm = 0.1; m (initial
population size) = 50, maxgen (the maximum number of iteration) = 100. The results
are showed in Table: 6.1, Table: 6.2 and Table: 6.3, the initial cluster center values are
data object label. In the experimental results, for improvement 1, due to calculating
the mean within each separate section as the initial cluster centers it is not marked.
Can be seen from the above data, the traditional K-means algorithm is sensitive to the
initial cluster centers, different cluster centers have quite different clustering results,
and the results sometimes are poor, the algorithm is unstable. For the improvement 1
algorithm and improvement 2 algorithms, due to selecting the ideal initial cluster
centers by calculating, so they have better clustering effect, the objective function
value is smaller. And the K-means algorithm based on genetic algorithm proposed in
the article find out the optimal objective function value through searching initial
cluster centers, in the three groups of data, its objective function values are smaller
than the other two algorithms', indicating that algorithm proposed in this paper has
better clustering effect. During 3 experiments, this algorithm has already found the
optimal objective function value, indicating that the algorithm is relatively stable. Can
be seen from the from the table, when the data have apparent isolated points, the
48

algorithm proposed in this paper can significantly reduce the impact of outliers and
improve clustering accuracy than the other two improved methods, and when the data
have no apparent isolated points, the proposed algorithm still have more accurate
clustering results than the other two improved methods, it is proved in the experiment
that the method selecting initial cluster centers proposed in the text is not affected
when the data have apparent isolated points, while the other the other two improved
methods have certain limitations.

6.1 Experiment Design:

6.1.1 Experiment for Thyroid Dataset

Figure 6.1 Result of K-mean algorithm





49



Figure 6.2 Result of Proposed (KVG) methods

6.1.2 Experimental Graph

Figure 6.3 Parameter graph between k-mean and KVG for thyroid dataset
50



6.1.3 Experimental Table.
Table 6.1 Base Parameter comparison on thyroid dataset.


6.2.1 Experiment for breast cancer dataset

Figure 6.4 Result of K-mean algorithm

S.N Clustering
Algorithm
Threshold Iteratin Time Error
Rate


1.


K-means
Algorithm


0.21


5


2.9328


3.8631


2.


KVG
Algorithm


0.21


6


4.92963


1.15938
51




Figure 6.5 Result of Proposed (KVG) methods

6.2.2 Experimental Graph

52

Figure 6.6 Parameter graph between k-mean and KVG for breast cancer dataset



6.2.3 Experimental Table

Table 6.2 Base Parameter comparison on breast cancer dataset

6.3.1 Experiment on E-coil dataset

S.N Clustering
Algorithm
Threshold Iteration Time Error
Rate


1.


K-means
Algorithm


0.51


4


2.58962


4.3169


2.


KVG Algorithm


0.51


5


4.38682


1.34741
53

Figure 6.7 Result of K-mean algorithm

Figure 6.8 Result of Proposed (KVG) methods

6.3.2 Experimental Graph Between k-mean and KVG for E-coil Dataset:

54


Figure 6.9 Parameter graph between k-mean and KVG for E-coil dataset

6.3.3 Experimental Table

Table 6.2 Base Parameter comparison on E-coil dataset

6.4 Performance Measures

Figure 6.10 Comparison of Error rate between k-mean and KVG
S.N Clustering
Algorithm
Threshold Iteration Time Error
Rate


1.


K-means
Algorithm


0.31


9


5.17923


5.16772


2.


KVG Algorithm


0.31


10


7.90002


1.51839
55




Figure 6.11 Comparison of Iteration through cluster between k-mean and KVG

6.5 The result of simulation and comparison of algorithms
Before draw a conclusion, we give some definition for kind of parameters. As
previously said, the object was assessment of K-means clustering method and
proposal trend of GA-clustering and comparison of their result. For this, we use some
collections of different parameters. The example is Iris Data set. This dataset consist
of 3 classes which are normal, hyperthyroidism and hypothyroidism. The second
thyroid dataset consists of 7200 patients of 21 features each, 15 binary (from x2 to
x16) and 6 continuous (x1, x17, x18, x19, x20, x21). The parameters are described in
3 classes: 1, 0.5, and 0. the number of parameters should be identical with 3. Other
parameter collection is called Vowel that contains 871 records parameter. Each record
parameter constitutes 3 index 1, 2, 3. This parameter should be insert in 6 class.
Third parameter collection is called Crude oil that contains 56 record parameters and
each record constitutes 5 indexes. This parameter should insert in 3 classes.


56




GA-Clustering implementation is done with these parameters:
Population size (pop-size): 100
Crossover rate (PC): 0.8
Mutation rate (PM): 0.6
Maximum iteration: 100
K-means clustering is done with these parameters:
Number of clusters or K: That there are different parameters in
each kind.
Maximum iteration: 100

6.6 Observed Results
The numbers in this table are clustering standard or ; and the lower one can give
better conclusion. This conclusion implements more and more and accepts with
different initiative population on different parameters collections.

6.7 Obtained from running two algorithms on Thyroid Dataset.





S.N Clustering
Algorithm
Threshold Iteration Time Error Rate


1.


K-means
Algorithm


0.21


5


2.9328


3.8631


2.


KVG
Algorithm


0.21


6


4.92963


1.15938
57



6.8 Obtained from running two algorithms on Breast Cancer
Dataset.



6.9 Obtained from running two algorithms on E-coil Dataset.








S.N Clustering
Algorithm
Threshold Iteration Time Error Rate


1.


K-means
Algorithm


0.51


4


2.58962


4.3169


2.


KVG
Algorithm


0.51


5


4.38682


1.34741
S.N Clustering
Algorithm
Threshold Iteration Time Error Rate


1.


K-means
Algorithm


0.31


9


5.17923


5.16772


2.


KVG
Algorithm


0.31


10


7.90002


1.51839
58

CHAPTER 7
CONCLUSION AND FUTURE WORK

In this dissertation we modified the K-Means algorithm that is one of the popular
clustering techniques has been surveyed and tried to apply one of the optimization
method named genetic algorithm improve in unsupervised clustering procedure.
Genetic algorithms are population based methods that use from operators for
processing of population chromosomes. In this research, we defined a representation
of chromosome string and combine K-Means and GA together. Observing simulations
in different running show that K-Means clustering based on Genetic algorithm
improved clustering measurement better and more efficient rather than pure K-Means
considerably.

Future Work:
In this Dissertation we have observed that the time Complexity is greater than
previous K-means Algorithm. In Future Research we want to minimize the execution
time of Proposed Algorithm.
In the process of optimization and controlling of iteration the
simple k-mean algorithm is improved but in case the complexity of process in future
we attempt to bind VSM with GA and minimization some complex step.

.

59

REFERENCES
1. Yi Lu, Shiyong Lu, Farshad Fotouhi, FGKA: A Fast Genetic K-means Clustering
Algorithm,IEEE Third International conference on Symposium on
Bioinformatics and Bioengineering ,ACM ,PP.622-625,2011.
2. Qinghe Zhang, Xiaoyun Chen, Agglomerative Hierarchical Clustering based on
Affinity Propagation Algorithm Third International Symposium on Knowledge
Acquisition and Modeling,Vol.22,PP.56-59, 2010.
3. Venkatesh Katari,Suresh Chandra Satapathy,Hybridized Improved Genetic
Algorithm with Variable Length Chromosome for Image Clustering IJCSNS
International Journal of Computer Science and Network Security, VOL.7
No.11,PP.17-27,2007
4. Kumar Dhiraj and Santanu Kumar Rath, Gene Expression Analysis Using
Clustering, International Journal of Computer and Electrical Engineering, Vol. 1,
No. 2,PP.62-66, 2009
5. Kailash Chander, Dr. Dinesh Kumar, Vijay Kumar, Enhancing Cluster
Compactness using Genetic Algorithm Initialized K-means International Journal
of Software Engineering Research & Practices Vol.1, Issue 1, PP.141-145, 2011.
6. O.A. Mohamed Jafar and R. Sivakumar, Ant-based Clustering Algorithms: A
Brief Survey, International Journal of Computer Theory and Engineering, Vol. 2,
No. 5, PP.407-411 , 2010.
7. Chengjie GU1, Shunyi ZHANG1, Kai LIU1, He HUANG2, Fuzzy Kernel K-
Means Clustering Method Based on Immune Genetic Algorithm, Journal of
Computational Information Systems ,Vol.3,PP.56-59,2011.
8. Wei Song and Soon Cheol Park, An Improved Genetic Algorithm for Document
Clustering with Semantic Similarity Measure IEEE Forth International
conference on Natural Computation, vol. 1, pp. 536-540, 2008.
9. Wei Song and Soon Cheol Park, Analysis of Web Clustering Based on Genetic
Algorithm with Latent Semantic Indexing Technology,IEEE Sixth International
Conference on Advanced Language Processing and Web Information Technology,
Vol.1, PP. 81-87,2007
10. Istvan Gergely Czibula1, Gabriela Czibula, Hierarchical Clustering for Adaptive
Refactoring Identification IEEE International conference on cloud computing,
vol.4,PP.123-126,2007
60

11. Mingzhen Chen, Yu Song,Summarization of Text Clustering based Vector Space
Model,IEEE 10th International Conference, ISSN: 13468030, Vol. 6, PP. 919
938. 2009.
12. Bashar Al-Shboul, and Sung-Hyon Myaeng, Initializing K-Means using Genetic
Algorithms, World Academy of Science, Engineering and Technology,
Knowledge and Data Engineering, Vol.17, No.10, PP.1363-1366,2009.
13. J.M. Pena, J.A. Lozano, P. Larranaga, An empirical comparison of four
initialization methods for the K-Means algorithm Department of Computer
Science and Artificial Intelligence, Intelligent Systems Group, Vol. 3242 ,PP.
519-547,1999.
14. Daniel Costa, An evolutionary tabu search algorithm and the NHL scheduling
problem Vol. 55, PP.56-59,1994.
15. Grefenstette J.Incorporating Problem Speci Knowledge into Genetic
Algorithms", Genetic Algorithms and Simulated Annealing, Hingham, ed. Davis
(Pitman, London and Morgan Kaufmann Publishers, Inc., 1987), 42-60.
16. N. Belkin and W. Croft. Retrieval techniques. In M. Williams, editor, Annual
Review of Information Science and Technology (ARIST), Elsevier Science
Publishers B.V, Vol.22, chapter 4, pages 109--145.., 1987.
17. W. Frakes and R. Baeza-Yates, editors. Information Retrieval: Data Structures &
Algorithms. Prentice Hall, Englewood Cliffs, New Jersey, 1992.
18. http://archive.ics.uci.edu/ml/datasets/Wine
http://www.mathworks.in/index.html
19. http://homes.esat.kuleuven.be/~bioi_marchal/MotifSuite/benchmarktest.php.







61

PUBLICATION
Amit Dubey, Prof. Anurag Jain and Dr. A.K. Sachan

A survey: Performance
improving of K-mean by Genetic Algorithm Accepted in International Journal of
Computational Intelligence and Information Security(IJCIIS) .Australia vol,2 ,PP-25-
29.2011

Das könnte Ihnen auch gefallen