Clustering

HELSINKI UNIVERSITY OF TECHNOLOGY Laboratory of Computer and Information Science T-61.
195 Special Assignment 1
1.8.2002
Clustering Algorithms: Basics and Visualization

Jukka Kainulainen 47942F
Clustering Algorithms: Basics and Visualization

Jan Jukka Kainulainen HUT
jkainula@cc.hut.fi
Abstract This paper discusses the issue of clustering algorithms. Clustering algorithms are important in many fields of science. Paper provides the basic concepts and presents an implementation of a few popular algorithms. The problem of visualization of the result is also discussed and a simple solution in a dimension limited case is provided. 1 INTRODUCTION Clustering is one solution to the case of unsupervised learning, where class labeling information of the data is not available. Clustering is a method where data is divided into groups (clusters) which seem to make sense. Clustering algorithms are usually fast and quite simple. They need no beforehand knowledge of the used data and form a solution by comparing the given samples to each other and to the clustering criterion. The simplicity of the algorithms is also a drawback: The results may vary greatly when using a different kind of clustering criteria and thus unfortunately also nonsense solutions are possible. Also with some algorithms the order in which the original samples are introduced can make a great difference to the result. Despite the drawbacks clustering is used in many fields of science including machine vision, life and medical sciences and information science. One reason for this is the fact that intelligent beings, humans included, are known to use the idea of clustering in many brain functions. This paper introduces the basics of clustering and the concepts needed to understand and implement the algorithms. A couple of algorithms are implemented and compared. Also the problem of visualization of the result is discussed and one solution is provided in the popular OpenGL framework. The second chapter introduces the basic concepts the reader should be aware of to be able to read this paper efficiently. The third chapter introduces the most popular algorithms and discusses their characteristics. The fourth chapter complements the survey to the clustering algorithms by introducing some special algorithms. Chapter five then provides an implementation of four algorithms discussed in chapter three and chapter six provides a way to visualize the result. Finally, chapter seven runs some tests on the implemented algorithms.
2 BASIC CONCEPTS When classifying different kind of samples a way to represent the sample in a mathematical way is needed. From now on we assume that the features are represented in a feature vector. A feature vector is a vector including different features for the sample. That is, with l features xi the feature vector is of the form x = [x1, x2, , xl]T where T denotes transposition and xi are typically real numbers. The selection of these features is often very hard due to the fact there usually are a lot of features from where the most representative ones should be selected. This is because the computational complexity of the classification (clustering) algorithm grows with every feature selected. Feature selection and the reduction of dimensionality of the data are beyond this document. For additional information about these tasks a good place to start from is Thedoridis (1999). 2.1 Linear Independence, Vector Space A set of n vectors is said to be linearly independent if equation k1x1 + k2x2++ knxn = 0 (2.1)
implies ki = 0 for all i. If a nonzero solution can be found then the vectors are said to be linearly dependent. If the vectors are linearly dependent then at least one of them can be expressed as a linear combination of the others. Now, given m vectors xn with l components (as before) we can form a set V of all the linear combination of these vectors. V is called the span of these vectors. The maximum number of linearly independent vectors in V is called the dimension of V. Clearly, if the given m vectors were linearly independent then the dimension is m and xn is called the basis for V. An n-dimensional vector space Rn is the space of all vectors with n (real) numbers as components. The symbol R itself refers to a single dimension of real numbers. Thus for n = 3 we get R3 for which the basis is, for example, vectors x1 = [1, 0, 0]; x2 = [0, 1, 0]; x3 = [0, 0, 1] so every vector in R3 can be expressed with these three basis vectors. A more comprehensive examination of vector spaces can be found from almost any engineering mathematics book, for example, Kreyszig (1993). 2.2 Data Normalization The used data is often scaled to be within a certain range. Neural networks, for example, often require this. A classical way to normalize the N available data vectors is with the mean value
xk =
1 N
x
i =1
ik
(2.2)
where k denotes the feature and variance
k2 =
1 N (xik xk )2 (2.3) N 1 i =1
now, to gain zero mean and unit variance

ik = x xik xk
(2.4)
2.3 Definition of a Cluster Now, let us define some basic concepts of clusters in a mathematical way. Let X be a set of data, that is X = {x1, x2,, xn}. A so called m-clustering of X is its partition into m parts (clusters) C1,, Cm, so that 1. None of the clusters is empty; Ci 2. Every sample belongs to a cluster 3. Every sample belongs to a single cluster (crisp clustering); Ci Cj= , ij Naturally, it is assumed that vectors in cluster Ci are in some way more similar to each other than to the vectors in other clusters. Figure 1 illustrates a couple of different kind of clusters; compact, linear and circular.
Figure 1: Couple of different kind of clusters 2.4 Number of Possible Clusterings Naturally, the best way to apply clustering would be to identify all possible clusterings and select the best suitable one. Unfortunately, due to limited time and large amount of
feature vectors this isnt usually possible. If we let S(N, m) denote the number of possible clusterings of N vectors into m groups it is true that 1. S(N, 1) = S(N, N) = 1 2. S(N, m) = 0, if m > N where the second statement comes from the definition that no cluster may be empty. Actually, it can be shown that the solution for S is the Stirling numbers of the second kind: S ( N , m) = m N 1 m (1) m i i i (2.5) m! i = 0
It is quite clear that the solution of equation 2.5 grows rapidly with N and if the number of desired clusters m is not initially available many different values must be tested and a raw power solution becomes impossible. A more efficient solution must be found. 2.5 Proximity Measure When clustering is applied a way to measure the similarities and dissimilarities between the samples is needed. A formal way to define the function of dissimilarity measure (DM) d (informally, distance) on X is the following: d:XXR exists d0 in R : - < d0 d(x, y) < +, for all x, y in X (2.6) d(x, x) = d0, for all x in X d(x, y) = d(y, x), for all x, y in X. If in addition the following is valid: d(x, y) = d0, if and only if x = y (2.9) (2.7) (2.8)
d(x, z) d(x, y) + d(y, z), for all x, y, z in X (triangular inequality) (2.10) then d is called a metric. A similarity measure (SM) is defined correspondingly, see Theodoridis (1999) for details. For example the Euclidian distance d2 is a metric dissimilarity measure of d0 = 0; the minimum possible distance between any two vectors is 0 and is equal to 0 only when the two vectors are the same. Also the triangular inequality is known to be true with Euclidian distance. Another well known metric DM is the Manhattan norm. The inner product xTy between two vectors on the other hand, is a similarity measure. Especially, if the length of the vectors x, y is one then the lower and upper limits for the inner product are -1 and +1.
When we extend the above formulas (2.6 2.10) to hold for subsets (U) of X we get the proximity measure on U as a function : U U R. A typical case where proximity between subsets is needed is when a single vector x is measured against a cluster C. Typically a distance to y, the representative of C, is chosen. The representative can be chosen so that the value is, for example, maximized or minimized. If a single vector representative is chosen among C the used method is called global clustering criteria and if all the vectors in C have an effect on the representative a local clustering criteria is being used. The representative of C can also be curve or a hyperplane in the dimension of x. For example in figure 2 different kind of representatives for the clusters of figure 1 are chosen. The first cluster is represented by a point, the second by a line and the third by a two hyperspheres (the inner and the outer).
Figure 2: Different representatives for different clusters 3 POPULAR CLUSTERING ALGORITHMS As stated in chapter 2.4, calculating all possible combinations of the feature vectors is not generally possible. Clustering algorithms provide means to make a sensible division into small clusters by using only a fraction of the work needed to calculate all possible combinations. These algorithms usually fall into one of the categories of the below subchapters, Theodoridis (1999). 3.1 Sequential Algorithms Sequential algorithms are straightforward and fast methods to produce a single clustering. Usually the feature vectors are presented to the algorithm once or a few times. Final result is typically dependent on the order of presentation and the result is often compact and hyperellipsoidally shaped. 3.1.1 Basic Sequential Algorithmic Scheme A very basic clustering algorithm that is easy to understand is basic sequential algorithmic scheme (BSAS). In the basic form vectors are presented only once and the number of clusters is not known a priori. What is needed is the dissimilarity measure d(x, C), threshold of dissimilarity and the number of maximum clusters allowed q. 5
The idea is to assign every newly presented vector to an existing cluster or create a new cluster for this sample, depending on the distance to the already defined clusters. As pseudo the algorithm works like the following: 1. m = 1; Cm = {x1}; // Init first cluster = first sample
2. for every sample x from 2 to N a. find cluster Ck such that min d(x, Ck) b. if d(x, Ck) > AND (m < q) i. m = m + 1; Cm = {x} // Create a new cluster c. else i. Ck = Ck + {x} // Add sample to the nearest cluster ii. Update representative if needed 3. end algorithm As can be seen the algorithm is simple but still quite efficient. Different choices for the distance function lead to different results and unfortunately the order in which the samples are presented can also have a great effect to the final result. Whats also very important is a correct value for . This value has a direct effect on the number of formed clusters. If is too small unnecessary clusters are created and if too large a value is chosen less than required number of clusters are formed. One detail is that if q is not defined the algorithm decides the number of clusters on its own. This might be wanted under some circumstances but when dealing with limited resources a limited q is usually chosen. Also, it should be noted that BSAS can be used with a similarity function simply by replacing the min function with max. There exists a modification to BSAS called modified BSAS (MBSAS), which runs twice through the samples. It overcomes the drawback that a final cluster for a single sample is decided before all the clusters have been created. The first phase of the algorithm creates the clusters (just like 2b in BSAS) and assigns only a single sample to each cluster. Then the second phase runs through the remaining samples and classifies them to the created clusters (step 2c in BSAS). 3.1.2 A Two-Threshold Sequential Scheme The major drawback of BSAS and MBSAS is the dependence on the order of the samples as well as on the correct value of . These drawbacks can be diminished by using two threshold values 1 and 2. Distances less than the first value 1 denote that these two samples most likely belong together and distances greater than 2 denote that these samples do not belong to the same cluster. Values between these two are in a so called gray area and they are to be reevaluated at a later stage of the algorithm. 6
Letting clas(x) be a boolean stating whether a sample has been classified or not and assuming no bounds to the number of clusters, the two-threshold sequential scheme (TTSAS) can be described in pseudo: 1. m = 0 2. for all x clas(x) = false 3. prev_change = 0; cur_change = 0; exists_change = 0; 4. while exists some sample not classified a. for every x from 1 to N i. if clas(x) = false AND is first in this while loop AND exists_change = 0 1. m = m + 1 // Create a new cluster 2. Cm = {x}; clas(x) = true; 3. cur_change = cur_change + 1 ii. else if clas(x) = false 1. find min d(x, Ck) 2. if d(x, Ck) < 1 a. Ck = Ck + {x}; clas(x) = true; // Add to a cluster b. cur_change = cur_change + 1 3. else if d(x, Ck) > 2 a. m = m + 1 // Create a new cluster b. Cm = {x}; clas(x) = true; c. cur_change = cur_change + 1 iii. else // if clas(x) = true 1. cur_change = cur_change + 1 b. exists_change = |cur_change prev_change| c. prev_change = cur_change; cur_change = 0;
The variable exists_change variable checks if at least one vector has been classified at the current pass of while loop. If no sample has been classified, the first unclassified sample is used to form a new cluster (4.a.i) and this guarantees that at most N passes of the while loop is performed. In practice the number of passes should naturally be much less than N but in theory this is an O(N2) algorithm, see Weiss (1997) on additional information about performance classifications. 3.2 Hierarchical Algorithms Hierarchical algorithms are further divided into agglomerative and divisive schemes. These algorithms rely on ideas of matrix and graph theory to produce either decreasing or increasing number of clusters each time step thus producing a hierarchy of clusterings. The problem of its own is the choice of selecting a proper clustering from this hierarchy. One solution is to track the lifetime of all clusters and search for clusters that have a large lifetime. This involves subjectivity and might not be suitable in many cases. Another approach is to measure the self-similarity of clusters by calculating distances between vectors in the same cluster and comparing them to some threshold value. As can be deduced the problem overall is difficult and no general correct solution exists. 3.2.1 Agglomerative Algorithms 1: Matrix Theory Agglomerative algorithms start from an initial clustering where every sample vector has its own cluster; that is, initially there exists N clusters with a single element in every one of them. Every step of the algorithm two of these clusters are joined together, thus resulting in one less clusters. This is continued until only a single cluster exists. One simple agglomerative algorithm from which many other algorithms are derived is the general agglomerative scheme (GAS) defined as follows: 1. t = 0; Ci = xi, i = 1N // Initial clustering 2. while more than one cluster left a. t = t + 1 b. among all the clusters find the pair min d(Ci, Cj) (or max d(Ci, Cj) if d denotes similarity) c. Create Cq = Ci + Cj, and replace clusters Ci and Cj with it. It should be clear that this creates a hierarchy of N clusterings. The disadvantage here is that once two vectors are assigned to a single cluster there is no way for them to get separated at a later stage. Further, it is quite easy to see that this is an O(N3) algorithm not suitable for a large N.
There are algorithms such as matrix updating algorithmic scheme (MUAS) and single and complete link variations that can all be seen as a special case of GAS. These algorithms are based on matrix theory and the general idea is to use the pattern matrix D(X) and the dissimilarity matrix P(X) to hold the information needed in a GAS like updating scheme. The pattern matrix is a N m matrix, whose ith row is the transposed ith vector of X. The dissimilarity matrix is just an N N matrix, whose element (j, k) equals the dissimilarity d(xj, xk). 3.2.2 Agglomerative Algorithms 2: Graph Theory The other form of agglomerative algorithms is those based on graph theory. A graph G is an ordered pair G = (V, E), where V = {vi, i = 1,, N} is a set of points (nodes) and E is a set of edges denoted eij or (vi, vj) connecting some of these points. If the order of points (vi, vj) is not meaningful the graph is undirected; otherwise it is directed. If no cost is associated with the edges the graph is said to be unweighted and if anyone of the edges has a cost then the graph is weighted. A more throughout introduction to graphs can be found, for example, from Ilkka (1997). Small graphs can be illustrated easily by drawing them like in the figure 3. The first graph with 5 nodes is complete (all points are connected to each other) and unweighted. The second graph is a subgraph of the first one and the third is a path 1,,5 with weights assigned to each edge.
Figure 3: Different kinds of graphs. In clustering algorithms we consider the graph nodes as sample vectors from X. Clusters are formed by connecting these nodes together with edges. The basic algorithm in this case is know as the graph theory-based algorithmic scheme (GTAS). It is, again, very similar to GAS. The difference is in the step 2b, which now becomes min gh(k)(Ci, Cj) (or max gh(k)(Ci, Cj) if g denotes similarity) where gh(k)(Ci, Cj) is defined in terms of proximity and the property h(k) of the subgraphs. This property can differ depending on the desired result. In other words clusters (subgraphs) are formed based on the distance and the fact that the resulting subgraph has the property h(k) or is complete.
3.2.3 Divisive algorithms Divisive algorithms are the opposite of the agglomerative ones: they start with a single cluster containing the entire set X and start to divide it in stages. The final clustering contains N clusters, one sample in each one. The idea is to find the division that maximizes the dissimilarity. As an example the generalized divisive scheme (GDS) may be defined 1. t = 0; C0 = {X} 2. while each vector not in single distinct cluster a. t = t + 1 b. for i = 1 to t i. amongst all possible pairs of clusters find max g(Ci, Cj) c. form a new clustering by separating the pair (Ci, Cj) It is easy to see that this is computationally very demanding and in practice simplifications are required. One way to speed up the process goes as follows: Let Ci be a cluster to be split into C1 and C2. Initially set C1 = 0 and C2 = Ci. Now, find the vector from C2 whose average dissimilarity with other vectors is maximal and move it to C1. Next, for each remaining vector x (in C2) compute its average dissimilarity with C1 and C2. If for every x dissimilarity with C2 is smaller than with C1 then stop (we found the division). Else we move the x maximizing the similarity with C1 and minimizing the similarity with C2 to C1. This process in continued always until a division has been found and it can be viewed as the step 2.b.i of GDS. 3.3 Cost Function Optimization Algorithms A third genre of clustering algorithms is those based on a cost function. Cost function algorithms use a cost function J to define the sensibility of their solution. Usually the number of desired clusters is known beforehand and differential calculus is used to optimize the cost function iteratively. Bayesian, Thedoridis (1999), philosophy is also often applied. This category includes also fuzzy clustering algorithms where a vector belongs to a cluster up to a certain degree. As a glance to the cost function optimization algorithms some of the theory of fuzzy clustering is discussed in the below subchapter. 3.3.1 Fuzzy Clustering Fuzzy schemes have been under a lot of interest and research during the recent years. What is characteristic and unique to fuzzy schemes is that a sample belongs simultaneously to many categories. A fuzzy clustering is defined by a set of functions uj : X A, j = 1,, m and A = [0, 1]. A hard clustering scheme can be defined by setting A = {0, 1}.
10
Let j be a parameterized representative of the cluster j, so that = [1T,, mT]T and let U be an N m matrix with element (i, j) denoting uj(xi). Then we can define a cost function of the form
q J q (,U ) = uij d ( xi , j ) i =1 j =1 N m
(3.1)
which is to be minimized with respect to and U. The parameter q is called the fuzzyfier. The constraint is that one sample belongs to the clusters at the rate of 1
u
j =1
ij
= 1 , i = 1,,N.
(3.2)
Minimizing J with respect to U, see Thedoridis (1999) for details, leads to

urs =
m
d (x ,
d (x r , s ) j =1 r j
1 q 1
, r = 1,...,N , s = 1,,m (3.3)
With respect to we take the gradiant of (3.1) and obtain

N d ( xi , j ) J (,U ) q = uij = 0 , j = 1,,m. (3.4) j j i =1
When combined these two do not give a general closed form solution. One solution is to use, for example, an algorithm known as generalized fuzzy algorithmic scheme (GFAS) to iteratively estimate U and , see Theodoridis (1999). Finally if, for example, for a point representative we use a common function d of form
d (x i , j ) = (xi j )T A(xi j ) substituting this to (3.4) yields to
(3.5)
N J (,U ) q = uij 2 A( j x i ) = 0 j i =1
which is to be used in GFAS to obtain new representatives per time step. 4 OTHER ALGORITHMS There are also other algorithms not belonging to the groups mentioned in chapter 3. These include, for example, genetic algorithms, stochastic relaxation methods, competitive learning algorithms and morphological transformation techniques. Also,
11
some graph theory algorithms are used, for example, algorithms based on the minimum spanning tree (MST), see Ilkka (1997). The following subchapter gives a quick tour to the ideas of competitive learning. 4.1 Competitive Learning Competitive learning algorithms are a wide branch of algorithms used in many fields of science. What these algorithms actually do is clustering. They typically use a set of representatives wj (like in the previous chapter) which are moved around in space Rn to match (represent) regions that include relatively large amount of samples. Every time a new sample is introduced the representatives compete with each other and the winner is updated (moved). Other representatives can be updated at a slower rate, left alone or be punished (moved away from the sample). One of the most basic competitive algorithms is the generalized competitive learning scheme (GCLS) defined 1. t = 0; // Time = 0 2. m = minit; // Number of clusters
3. while NOT convergence AND (t < tmax) a. t = t + 1 b. present a new sample x and calculate winner wj c. if (x NOT similar with wj) AND (m < mmax) i. m = m + 1; // New cluster ii. wm = x d. else // Update parameters i. if winner wj(t) = wj(t-1) + nh(x, wj(t-1)) ii. if not winner wj(t) = wj(t-1) + nh(x, wj(t-1)) 4. Clusters are ready and represented by wjs. Assign each sample to the cluster whose representative is the closest. Parameters n and n called learning rate parameters. They control the rate of change of the winners and losers. The function h is some function usually dependent on distance. Convergence can be considered, for example, by calculating the total change in the vectors wj and comparing it to a selected threshold value.
12
4.1.1 The Self-Organizing Map Generalizing the definition of a representative and defining a neighborhood of representatives Qj for each wj we achieve a model called the (Kohonen) self-organizing map (SOM). As time t increases the neighborhood shrinks and concentrates around the representative. All representatives in the neighborhood of the winner are updated each time step. Whats important is the fact that the neighborhood is independent on the actual distances in space and is defined in terms of the indices j. That is, the geometrical distance in space is not an issue or metric of the neighborhood. The self-organizing map and its properties are formally defined in terms of a neural network in Haykin (1999). The original article about SOM by Kohonen (1982) is the most popular and general model in use today. The Kohonen model is illustrated in the figure 4. The input layer is connected to a two-dimensional array of neurons from which the winner is chosen. The weight of the winner (and its neighborhood) is then updated each time step like in GCLS.
Figure 4: The Kohonen SOM model 4.2 Closing on Algorithm Presentation This and the preceding chapter introduced the most general algorithms used in clustering. Every algorithm presented is of basic nature and should not be very hard to understand or implement. The presentation is not complete and many algorithms of value are omitted due to limitations set for the length of this document. As a close-up, figure 5 gives an overall sight to the families of clustering algorithms mentioned above. The hierarchy is neither absolute nor complete and also some other kind of division could be made. It is provided to clarify the different branches in the way the algorithms operate.
13
Figure 5: The family of clustering algorithms 5 THE IMPLEMENTATION This chapter presents an implementation of four of the above algorithms. The implementation works in the Microsoft Windows environment and is coded in the C++ programming language using the Microsoft Visual Studio .NET compiler. C++ was chosen for this task because it is an industry standard, it produces efficient programs and because of its power of expression. All the relevant code is provided in appendix and is available, along with the application, from the author. What is needed before the actual implementation is a way to store and handle the vectors used as containers for the features. For this a class of vector container was created. What is assumed from here on is the fact that there are no more that three features to deal with. This limitation is because of the fact that only three dimensions (3D) can be easily projected to a 2D display. The vector class (Vector3) has these three components for which all the normal vector arithmetic is implemented as class operators. The next chapter provides an efficient way to display these vectors and their clusterings with the OpenGL API (application program interface) often used in professional 3D graphics. A base class for all clustering algorithms was named CClustAlgorithm. It provides some virtual functions all algorithms must implement. This way there is a unified base from which all the algorithms must be derived. The most important function (Java has methods, C++ has functions) to implement is
virtual const; ClList* Clusterize(const VList* vectors, CCluster* empty)
It returns an std::list of clusters formed from given vectors. The empty cluster parameter is so that different kind of subclasses can be used: Clusterize always creates
14
the same kind of objects as is given to it. That way one can, for example, use clusters with different kind of representatives. The CCluster class represents a single cluster. It holds a list of all the vectors belonging to it and a representative (mean value). It can be used as a base class if other kinds of representatives are needed. Vectors can be added and removed and distances to other clusters and individual vectors can be calculated. Also, other clusters can be included into a cluster and thus form a union. This functionality is enough for efficient implementation of the algorithms in the below subchapters. 5.1 MBSAS The implementation of the MBSAS algorithm is straightforward. Initialization includes creating an initial cluster and adding the first element in the list to it. After that, the create clusters pass is performed. One iterator goes through all the samples and another iterator goes through the list of clusters (for every sample). The cluster with minimal distance is retrieved and the distance is compared to the given threshold value. If the distance is greater than the threshold and we still may create a new cluster (iq is the maximum amount) we do so. As code the create clusters pass is
for (; iter != tmplist.end(); iter++) { tmp = *iter; float mindist = FLT_MAX; // Find the minimum distance for (iter2 = ClusterList->begin(); iter2 != ClusterList->end(); iter2++) { float dist = (*iter2)->Distance(tmp); if (dist < mindist) mindist = dist; } // Create a new cluster? if ((mindist > fTheta) && ((int)ClusterList->size() < iq)) { CCluster* newclust = empty->GetNewCluster(); newclust->AddVector(tmp); ClusterList->push_back(newclust); } }
Now, all the samples already assigned to a cluster are removed from the list. After that a second pass to classify all the rest is made. The sample is added to the cluster with minimum distance just like above but new clusters are not created anymore. The implementation works well and is the fastest (among with TTSAS) of the algorithms provided here. The quality of the result is dependent on the correct value of the threshold value as can be seen in chapter 7. 5.2 TTSAS In the TTSAS algorithm the list of vectors is first transformed into a normal array of vectors. This might not be necessary but now another array containing the information about whether the sample is classified or not (clas) is easy to make. The implementation
15
here does not limit the number of clusters like the above MBSAS. What is done is exactly the same as in chapter 3.1.2. Part 4.a.i goes like
if (!clas[i] && existsChange == 0 && !gotOne) { // Let's make sure the while ends at some point :) CCluster* clust = empty->GetNewCluster(); clust->AddVector(tmplist[i]); ClusterList->push_back(clust); clas[i] = true; curChange++; numDone++; gotOne = true; }
Next, if sample is not classified (4.a.ii) we search the minimum distance cluster and based on the two threshold values create a new cluster, add the sample to an existing cluster or just leave it to a later pass:
if (mindist < fTheta1) { // found the same kind minclust->AddVector(tmplist[i]); clas[i] = true; curChange++; numDone++; } else if (mindist > fTheta2) { // need to create a new one CCluster* clust = empty->GetNewCluster(); clust->AddVector(tmplist[i]); ClusterList->push_back(clust); clas[i] = true; curChange++; numDone++; }
All this is done for the entire list, pass by pass, until every sample belongs to a cluster. The first if guarantees that no more that N2 is made. Naturally, in practice, the amount of passes is less. The result is, again, dependent on the threshold values. The speed of this implementation is on par with the MBSAS implementation or little better. 5.3 GAS The GAS algorithm is also quite short and easy to implement. Notice that the entire hierarchy of clusterings is not saved in this implementation. First, create an initial clustering of N clusters with one sample in every cluster. Then, while there is more that the suggested amount of clusters seek the two that are closest to each other and combine them:
// Seek the two clusters that have min distance (slow)... for (iter2 = ClusterList->begin(); iter2 != ClusterList->end(); iter2++) { iter2++; // Dummy inc for (iter3 = iter2--; iter3 != ClusterList->end(); iter3++) { float dist = (*iter2)->Distance(*iter3); if (dist < mindist) { mindist = dist; minclust1 = *iter2; minclust2 = *iter3; } } } // ...and combine them if (minclust2 != NULL) { minclust1->Include(*minclust2); ClusterList->remove(minclust2);
16
delete minclust2; }
The only problem here is actually the slowness of the algorithm. It takes a lot of time to go through all the levels of clustering. This proves the theory that this algorithm is quite slow when compared to the two above algorithms. On the other hand there are no threshold values that would need to be carefully selected as can be seen in chapter 7 and thus this algorithm is easier and safer to use. If the amount of clusters is relatively large compared to the amount of samples this algorithm should be more competitive since then less iteration steps are performed. 5.4 GDS The general divisive scheme is even more demanding than GAS. That is why the general version of the algorithm was not implemented but an optimized version is presented (a bit like the version in 3.2.3). The optimization is based on the assumption that an outlier sample (the one farthest away from the medium) is very likely to belong to another cluster than where it currently is. Also as with GAS this algorithm is not driven to the end, but it is interrupted as soon as the suggested amount of clusters is found. This actually helps a lot if there is a relatively small amount of clusters. Initially a single cluster including all samples is created. Then, while there exists less than desired amount of clusters, a cluster with the farthest outlier is selected as the one to be divided in two. The outlier is moved to a cluster of its own. Then the distance to the new cluster is calculated for every other vector in the old cluster and measured against the distance to the own representative. The one closest to the new cluster is selected and if it is nearer to the new cluster than the old one it is moved to the new one. This is continued until no vector is moved during one pass (all vectors in the old cluster are nearer to representative than the new one):
while (foundOne) { foundOne = false; // Let's see if we find any fVector3 vect(0, 0, 0); maxdist = FLT_MAX; // Go through all the samples in the old cluster for (iter = maxclust->GetVectors().begin(); iter != maxclust->GetVectors().end(); iter++) { // Dist. to the new clust. maxdist2 = newclust->Distance(*iter); if (maxclust->Distance(*iter) > maxdist2 && maxdist2 < maxdist) { foundOne = true; maxdist = maxdist2; vect = *iter; } } if (foundOne) { // We did find one sample? newclust->AddVector(vect); maxclust->RemoveVector(vect); } }
The improvements presented significantly diminish the time needed for a single pass: Instead of always looking for all possible combinations we select a single vector and 17
attach similar vectors to it from the old cluster. Due to the optimizations, the runtime for this algorithm is closer to the MBSAS and TTSAS than it is to GAS. The quality of the result is generally close to the one of GAS. 6 THE VISUALIZATION ENGINE The visualization of clusterings is problematic if the amount of features is large. What is provided in this chapter is a way to display three element vectors and their clusterings with the OpenGL 1.2 API. This engine can also be seen as a tutorial to the use of OpenGL in Microsoft Windows environment to create 3D visualizations. It actually consists of a single class CGLRenderer which acts as a wrapper between OpenGL and the rest of the program. The user, or programmer, needs no knowledge of the OpenGL while using other parts of the code. What this class provides are the basic functions needed to initialize OpenGL, draw elements to screen, move and rotate the camera and display text to the screen. The most important functions of CGLRenderer are a function to draw a list of clusters
void DrawClusters(const ClList* list);
and the two functions that move and rotate the view point (camera) in the coordinate system allowing the viewer to move freely in space
void MoveCamera(float advance, float sideways, float up = 0.f); void RotateCamera(float xrot, float yrot);
If selected, the engine draws gray bounding spheres around the clusters so that it is easier to see the limits of a single cluster (key V). Figure 6 shows a typical view of five separate clusters in space. The gray spheres represent the borders of different clusters, while the colorful dots inside the spheres are individual samples. The three white lines represent the coordinate axis. User can change the viewpoint and view direction with mouse and mouse buttons. The user may choose any viewpoint and any view direction; this is not limited by the application. The user can change the algorithm to show by pressing the number keys. If the selected algorithm is disabled no clusters are shown (just the coordinate axis). The text display on the bottom left corner of the screen shows the current algorithm, number of found clusters and viewer position (the display is not visible in the resized images of this document). If the user minimizes the application it goes to a so called idle mode where it consumes much less processor time allowing other applications to run more efficiently.
18
Figure 6: Typical view of five clusters 6.1 Data Generation and Algorithm Initialization Data can be generated with a simple generator class CDataGenerator. It has static functions that can be called from anywhere to get data generated with a simple random function. The data generation setup screen is illustrated in the figure 7. User inputs the number of clusters and vectors and the data generator generates such a material. The compactness is a value indicating how compact the clusters should be. The file section is for future use when data could be read from a file generated, for example, with Matlab. User should notice here that after this dialog also the algorithms are reinitialized with given amount of clusters. If other parameters are wanted the algorithm setup dialog in figure 8 must be used. After either of these dialogs the algorithms are driven through with the data. If the number of sample vectors is great the response to the user might be slow. This is due to the fact that the application is single threaded, meaning that when some algorithm is been run the display is not updated. Algorithms and their parameters are initialized with another dialog illustrated in the figure 8. From this dialog the GAS algorithm can also be disabled if a large amount of data needs to be clusterized. When all the parameters have been filled the user presses OK and the data is reclusterized using the new parameters for the algorithms. 19
Figure 7: The data generation setup.
Figure 8: The algorithm setup. 7 THE TESTS This chapter presents some results of running the algorithms with different parameters and different amounts of samples and clusters. No solid statistical analysis was made for the results. A more throughout analysis is beyond the scope of this basic document and the results are provided as a guideline for general interest on the behavior of the algorithms. First the speed of the algorithms was analyzed against the amount of samples. Table 1 shows the result (the amount of clusters was kept in 20). The tests were run on AMD
20
Athlon Classic 700MHz, 512kB of L2-cache and with 512MB of memory on Windows XP. The GAS algorithm was not driven with the three largest data sets due to the fact that it would have taken a lot of time and the behavior (O(N3)) of the algorithm can already be seen from the smaller sets. The performance of the other algorithms seems to be proportional to N2, where one pass of the GDS is relatively long compared to MBSAS or TTSAS. Notice that due to the way the runtime was measured times below 100ms cannot be considered accurate. Table 1. Runtime for different amounts of data
Algorithm MBSAS TTSAS GDS GAS
N=1000 10ms 0ms 80ms 12.4 s
N=2000 10ms 0ms 360ms 185.1s
N=4000 20ms 20ms 1.5s 0.4h
N=8000 70ms 60ms 7.1s
N=16000 310ms 320ms 37.4s
N=32000 960ms 900ms 231.6s
7.1 The Effect of Parameters The effect of the parameters was also examined with randomly selected sets of data. First figure 9 shows the result of incorrectly chosen parameters for TTSAS.
Figure 9: The meaning of correct parameters.
The figure on the left shows how TTSAS with incorrectly large theta values (1000, 2000) misclassifies a part of the blue cluster to the red cluster (cluster on the left). When the theta values are lowered to a more reasonable stage (500, 1500) TTSAS creates correct clusters in the picture on the right. The value 500 corresponds roughly the
21
distance of 20 in space. This problem does not emerge with GAS or GDS since they have no dissimilarity parameters. Figure 10 illustrates the case where an incorrect amount of clusters was guessed and given to the algorithms as parameter (four instead of five).
Figure 10: The meaning of correct value of clusters The top leftmost image shows the behavior of MBSAS: the blue cluster (on the left) is incorrectly large when compared to the top rightmost image of TTSAS. The result of TTSAS is most likely the correct one since it does not use the number of clusters parameter at all. The behavior of GAS, down on the left, is similar to MBSAS while GDS creates a big red cluster (on the right) including the samples of the green cluster. The fact that GDS differs from GAS is probably due to the fact that the red cluster of GDS is more compact than the blue cluster of GAS thus making its outlier closer to average value for the GDS algorithm. 8 CONCLUSION What were discussed were the problem of clustering and the most popular algorithms of that particular field of science. The basic concepts needed to understand the functionality of the algorithms were discussed in chapter two. Chapter three provided a sight to the most popular algorithms and their behavior. Chapter four completed the tour with a couple of special purpose algorithms. Chapter five included an implementation of four of the algorithms discussed and chapter six presented a way to visualize the product 22
of the algorithms with OpenGL. Finally, in chapter 7 the algorithms were run on the framework and some different kind of parameters and samples were considered. This paper was meant as an introduction to the algorithms and also the purpose was to create an efficient way to display the clusterings of individual data elements. If something remains to be done a more throughout analysis of the behavior of the algorithms should be made. Especially cases where the samples are badly balanced or in some particular order should be generated and analyzed. The application could easily be complemented with a screenshot feature to automatically generate printable figures of the current display. Also, it would be easy to add a feature to read the sample data from a file. Finally, a complete set of the source code is available from the author by request. The appendix provides all commented key elements of the code. The code is provided for educational purposes. A complete listing would be unacceptably long to be included to this document and would serve no purpose. REFERENCES Haykin, S. 1999. Neural Networks a Comprehensive Foundation. Second Edition. Prentice Hall. 842 p. Ilkka, S. 1997. Diskreetti Matematiikkaa. Fifth Edition. Otatieto. 165 p. Kohonen, T. 1982. Self-Organized Formation of Topologically Correct Feature Maps. Biological Cybernetics. vol 43. pp. 59-69. Kreyszig, E. 1993. Advaced Engineering Mathematics. Seventh Edition. Wiley. 1271 p. Thedoridis, A; Koutroumbas, K. 1999. Pattern Recognition. Academic Press. 625 p. Weiss, M. 1997. Data Structures and Algorithm Analysis in C. Second Edition. Addison Wesley. 511 p. APPENDIX
/** * Cluster.h * * Provides a basic cluster class including sample vectors. * * Commercial use without written permission from the author is forbidden. * Other use is allowed provided that this original notice is included * in any form of distribution. * @author Jukka Kainulainen 2002 jkainula@cc.hut.fi */ /** * The definition of a single cluster. A cluster has a list of vectors * belonging to it. Vectors can be added or removed to/from a cluster. * Other classes can be derived from this class. For example those with * other kind of representatives. This class uses a medium representative. * */ class CCluster { protected: /** The vectors belonging to this cluster */ VList Vectors; /** The medium representative */ fVector3 Medium; /** The current outlier sample */
23
fVector3 Outlier; /** The distance of the outlier from representative */ float fOutlierDist; /** Is the outlier valid */ bool bOutlierValid; /** Updates the representative value */ virtual void UpdateMedium(); /** Updates the outlier sample (the one farthest from the representative) */ virtual void UpdateOutlier(); public: CCluster(void); /** Is this cluster empty? */ bool IsEmpty() const; /** Add a vector to this cluster */ virtual void AddVector(const fVector3& vec); /** Remove a vector from this cluster */ virtual void RemoveVector(const fVector3& vec); /** Includes all vectors from the given cluster to this one also. */ virtual void Include(CCluster& cluster); /** Returns a refenrence to the list of vectors */ virtual const VList& GetVectors() const; /** Returns the reprsentative vector */ virtual fVector3 GetRepresentative() const { return Medium; } /** Returns the outlier vector */ virtual fVector3 GetOutlier(); /** Returns the outlier distance */ virtual float GetOutlierDist(); /** Returns a new cluster object. Override for different clusters. */ virtual CCluster* GetNewCluster() { return new CCluster(); } /** Returns the squared distance to given vector. */ virtual float Distance(const fVector3& vec); /** Returns the squared distance to another cluster */ virtual float Distance(const CCluster* clust); /** Clears and deletes this cluster */ virtual ~CCluster(void); }; /** A list of cluster pointers */ typedef list<CCluster*> ClList; // From Cluster.cpp void CCluster::Include(CCluster& cluster) { VList::iterator i; for (i = cluster.Vectors.begin(); i != cluster.Vectors.end(); ++i) { Vectors.push_back(*i); } UpdateMedium(); bOutlierValid = false; // The caller should remove the elements from the other one... } const VList& CCluster::GetVectors() const { return Vectors; } float CCluster::Distance(const fVector3& vec) { return Medium.SquaredDistance(vec); } float CCluster::Distance(const CCluster* clust) { return Medium.SquaredDistance(clust->GetRepresentative()); }
24
/** * ClustAlgorithm.h * * Commercial use without written permission from the author is forbidden. * Other use is allowed provided that this original notice is included * in any form of distribution. * @author Jukka Kainulainen 2002 jkainula@cc.hut.fi */ #pragma once #include "Cluster.h" /** * Provides a base class for all clustering algorithms. */ class CClustAlgorithm { public: CClustAlgorithm() {}; virtual void SetParameters(float theta, int q = 0) = 0; /** * Creates a clustering from vectors using the given empty cluster class. * You can give different kinds of cluster subclasses as parameter. * */ virtual ClList* Clusterize(const VList* vectors, CCluster* empty) const = 0; /** Returns the number of clusters searched for, 0 if not in use */ virtual int GetClusters() { return 0; } /** Returns the theta parameter used, 0 if not in use */ virtual float GetTheta() { return 0.f; } virtual float GetTheta2() { return 0.f; } /** Returns the name of the algorithm */ virtual const char* GetName() const = 0; virtual ~CClustAlgorithm(void) {}; }; // From MBSAS.cpp ClList* CMBSAS::Clusterize(const VList* vectors, CCluster* empty) const { int numVectors = 0; int clusters = 0; fVector3 tmp; if ((vectors == NULL) || (empty == NULL)) return NULL; if ((int)vectors->size() < 1) return NULL; // This is not the optimum way.... VList tmplist = *vectors; ClList* ClusterList = new ClList(); VList::iterator iter; ClList::iterator iter2; // Fill with the initial vector iter = tmplist.begin(); empty->AddVector(*iter); iter++; ClusterList->push_back(empty); // 'Create the clusters' pass for (; iter != tmplist.end(); iter++) { tmp = *iter; float mindist = FLT_MAX; // Find the minimum distance for (iter2 = ClusterList->begin(); iter2 != ClusterList->end(); iter2++) { float dist = (*iter2)->Distance(tmp); if (dist < mindist) mindist = dist;
25
} // Create a new cluster? if ((mindist > fTheta) && ((int)ClusterList->size() < iq)) { CCluster* newclust = empty->GetNewCluster(); newclust->AddVector(tmp); ClusterList->push_back(newclust); } } // Now we have to remove the already taken samples... for (iter2 = ClusterList->begin(); iter2 != ClusterList->end(); iter2++) { tmp = (*iter2)->GetRepresentative(); // Representative is the only one tmplist.remove(tmp); } // And then we classify the rest... for (iter = tmplist.begin(); iter != tmplist.end(); iter++) { tmp = *iter; float mindist = FLT_MAX; CCluster* minclust = NULL; // Find the minimum distance cluster... for (iter2 = ClusterList->begin(); iter2 != ClusterList->end(); iter2++) { float dist = (*iter2)->Distance(tmp); if (dist < mindist) { mindist = dist; minclust = *iter2; } } minclust->AddVector(tmp); // ...and add to it } return ClusterList; } // From TTSAS.cpp... ClList* CTTSAS::Clusterize(const VList* vectors, CCluster* empty) const { if ((vectors == NULL) || (empty == NULL)) return NULL; if ((int)vectors->size() < 1) return NULL; // We'll do this the old way... bool* clas = new bool[vectors->size()]; fVector3* tmplist = new fVector3[vectors->size()]; VList::const_iterator iter; int i = 0; for (iter = vectors->begin(); iter != vectors->end(); iter++, i++) { tmplist[i] = *iter; clas[i] = false; } ClList* ClusterList = new ClList(); int numVectors = (int)vectors->size(); int numDone = 0; // Number of classified samples int existsChange = 0; // Classified something new during last pass int curChange = 0; // Current number of classified samples int prevChange = 0; // Number of classified samples during last pass float mindist = FLT_MAX; CCluster* minclust = NULL; ClList::iterator iter2; while (numDone < numVectors) { bool gotOne = false; for (i = 0; i < numVectors; i++) { if (!clas[i] && existsChange == 0 && !gotOne) { // Let's make sure the while ends at some point :) CCluster* clust = empty->GetNewCluster(); clust->AddVector(tmplist[i]); ClusterList->push_back(clust); clas[i] = true; curChange++; numDone++; gotOne = true; } else if (clas[i] == 0) { mindist = FLT_MAX; minclust = NULL; // Find the minimum distance cluster... for (iter2 = ClusterList->begin(); iter2 != ClusterList->end(); iter2++) {
26
float dist = (*iter2)->Distance(tmplist[i]); if (dist < mindist) { mindist = dist; minclust = *iter2; } } if (mindist < fTheta1) { // found the same kind minclust->AddVector(tmplist[i]); clas[i] = true; curChange++; numDone++; } else if (mindist > fTheta2) { // need to create a new one CCluster* clust = empty->GetNewCluster(); clust->AddVector(tmplist[i]); ClusterList->push_back(clust); clas[i] = true; curChange++; numDone++; } } else // clas == 1 curChange++; } existsChange = abs(curChange - prevChange); prevChange = curChange; curChange = 0; } delete empty; delete[] clas; delete[] tmplist; return ClusterList; } // From GDS.cpp... ClList* CGDS::Clusterize(const VList* vectors, CCluster* empty) const { if ((vectors == NULL) || (empty == NULL)) return NULL; if ((int)vectors->size() < 1) return NULL; // Create the initial clustering... ClList* ClusterList = new ClList(); VList::const_iterator iter; CCluster* tmp = empty->GetNewCluster(); for (iter = vectors->begin(); iter != vectors->end(); iter++) tmp->AddVector(*iter); ClusterList->push_back(tmp); float maxdist, maxdist2; CCluster* maxclust; ClList::iterator iter2; while ((int)ClusterList->size() < iq) { maxdist = 0; maxclust = NULL; // Find the cluster that has maximal outlier element... for (iter2 = ClusterList->begin(); iter2 != ClusterList->end(); iter2++) if ((*iter2)->GetOutlierDist() > maxdist) { maxdist = (*iter2)->GetOutlierDist(); maxclust = *iter2; } // Move the outlier to a new cluster CCluster* newclust = empty->GetNewCluster(); newclust->AddVector(maxclust->GetOutlier()); maxclust->RemoveVector(maxclust->GetOutlier()); ClusterList->push_back(newclust); bool foundOne = true; // While we found a vector more similar to the new cluster... while (foundOne) { foundOne = false; // Let's see if we find any fVector3 vect(0, 0, 0); maxdist = FLT_MAX; // Go through all the samples in the old cluster for (iter = maxclust->GetVectors().begin(); iter != maxclust>GetVectors().end(); iter++) { maxdist2 = newclust->Distance(*iter); // Dist. to the new clust. if (maxclust->Distance(*iter) > maxdist2 && maxdist2 < maxdist) { foundOne = true; maxdist = maxdist2; // The closest one to the new cluster vect = *iter; }
27
} if (foundOne) { // We did find one sample? newclust->AddVector(vect); maxclust->RemoveVector(vect); } } } delete empty; return ClusterList; } // From GAS.cpp... ClList* CGAS::Clusterize(const VList* vectors, CCluster* empty) const { if ((vectors == NULL) || (empty == NULL)) return NULL; if ((int)vectors->size() < 1) return NULL; // Create the initial clustering... ClList* ClusterList = new ClList(); VList::const_iterator iter; for (iter = vectors->begin(); iter != vectors->end(); iter++) { CCluster* tmp = empty->GetNewCluster(); tmp->AddVector(*iter); ClusterList->push_back(tmp); } ClList::iterator iter2; ClList::iterator iter3; float mindist; CCluster* minclust1; CCluster* minclust2; while ((int)ClusterList->size() > iq) { mindist = FLT_MAX; minclust1 = NULL; minclust2 = NULL; // Seek the two clusters that have min distance (slow)... for (iter2 = ClusterList->begin(); iter2 != ClusterList->end(); iter2++) { iter2++; // Dummy inc for (iter3 = iter2--; iter3 != ClusterList->end(); iter3++) { float dist = (*iter2)->Distance(*iter3); if (dist < mindist) { mindist = dist; minclust1 = *iter2; minclust2 = *iter3; } } } // ...and combine them if (minclust2 != NULL) { minclust1->Include(*minclust2); ClusterList->remove(minclust2); delete minclust2; } } delete empty; return ClusterList; } // From GLRenderer.cpp void CGLRenderer::DrawVolumes(const ClList* list) { int c = 0; glColor4f(0.5f, 0.5f, 0.5f, 0.4f); glEnable(GL_BLEND); glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA); for (ClList::const_iterator iter = list->begin(); iter != list->end(); iter++, c+=3){ glPushMatrix(); float dist = (*iter)->GetOutlierDist(); fVector3 out = (*iter)->GetRepresentative(); glTranslatef(out.x, out.y, out.z); glutWireSphere(sqrt(dist), 8, 8); glutSolidSphere(sqrt(dist), 8, 8); glPopMatrix(); } glDisable(GL_BLEND); }
28
void CGLRenderer::RotateCamera(float xrot, float yrot) { // Around the up vector... vdir = vdir.Rotate(vup, 3.1415f * yrot / 180.f); vleft = vleft.Rotate(vup, 3.1415f * yrot / 180.f); vup = vdir.Cross(vleft); // Just to make sure we don't get messed up // Around the "left" vector... vup = vup.Rotate(vleft, 3.1415f * xrot / 180.f); vdir = vdir.Rotate(vleft, 3.1415f * xrot / 180.f); vleft = vup.Cross(vdir); vdir.Normalize(); vleft.Normalize(); vup.Normalize(); }
29

Clustering

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Clustering

Hochgeladen von

Copyright:

Verfügbare Formate

HELSINKI UNIVERSITY OF TECHNOLOGY Laboratory of Computer and Information Science T-61.

195 Special Assignment 1

Clustering Algorithms: Basics and Visualization

Clustering Algorithms: Basics and Visualization

where k denotes the feature and variance

now, to gain zero mean and unit variance

Minimizing J with respect to U, see Thedoridis (1999) for details, leads to

, r = 1,...,N , s = 1,,m (3.3)

With respect to we take the gradiant of (3.1) and obtain

d (x i , j ) = (xi j )T A(xi j ) substituting this to (3.4) yields to

Figure 7: The data generation setup.

Algorithm MBSAS TTSAS GDS GAS

N=1000 10ms 0ms 80ms 12.4 s

N=2000 10ms 0ms 360ms 185.1s

N=4000 20ms 20ms 1.5s 0.4h

N=8000 70ms 60ms 7.1s

N=16000 310ms 320ms 37.4s

N=32000 960ms 900ms 231.6s

Figure 9: The meaning of correct parameters.

Das könnte Ihnen auch gefallen