Sie sind auf Seite 1von 34

CLUSTERING

Clustering involves grouping data points together according to some measure of similarity. One goal of clustering is to extract trends and information from raw data sets. An alternative goal is to develop a compact representation of a data set by creating a set of models that represent it [1]. There are two general types of clustering that are used: supervised and unsupervised clustering. Supervised clustering uses a set of example data to classify the rest of the data set. This can be called as classification and here the task is to learn to assign instances to pre-defined classes [2]. For example, consider a set of colored balls (all colors) that you want to classify into three groups: red, green, and blue. A logical way to do this is to pick out one example of each class--a red ball, a green ball, and a blue ball--and set them each next to a bucket. Then go through the remaining balls, compare each ball to the three examples and put each ball in the bucket whose example it matches the best. This example of supervised clustering is illustrative because there are two potential problems. First, the result you get is going to be dependent upon the balls you select as examples. If you were to select a red, an orange, and a blue ball, then it might be difficult to classify a green ball. Second, unless you are careful about selecting examples, you may select examples that don't represent the distribution of data. For example, you might select red, green, and blue balls, only to discover that most of the colored balls were cyan, purple, and magenta (which are in between the other 3 primary colors). This

shows the importance of selecting representative samples when you execute supervised clustering. Unsupervised clustering, on the other hand, tries to discover the natural groupings inside a data set without any input from a trainer. The main input a typical unsupervised clustering algorithm takes is the number of classes it should find. In the colored balls case, this would be like dumping them into an automatic sorting machine and telling it to create three piles. The goal of unsupervised clustering is to create three piles where the balls within each pile are very similar, but the piles are different from one another. Here no pre-defined classification is required. The task is to learn a classification from the data. One of the most important characteristics of any supervised or unsupervised clustering process is how to measure the similarity of two data points. Clustering algorithms divide a data set into natural groups( clusters). Instances in the same cluster are similar to each other, they share certain properties. Clustering algorithms can have different properties [2]: Hierarchical: These methods include those techniques where the input data are not partitioned into the desired number of classes in a single step. Instead, a series of successive fusions of data are performed until the final number of clusters is obtained [3]. Non-Hierarchical or iterative : These methods include those techniques in which a desired number of clusters is assumed at the start. Instances are reassigned to clusters to improve them.

Hard and Soft : Hard clustering assigns each instance to exactly one cluster. Soft clustering assigns each instance a probability of belonging to a cluster

Disjunctive: Instances can be part of more than one cluster

Figure below shows an illustration of the properties of clustering

Figure 1 Illustration of properties of clustering

Un-Supervised Clustering:
One of the most commonly used un-supervised clustering algorithm is K-means algorithm. The algorithm is as follows. Specify k, the number of clusters

Choose k points randomly as cluster centers

Assign each instance to its closest cluster center using Euclidian distance

Calculate the median (mean) for each cluster, use it as its new cluster center

Reassign all instances to the closest cluster center

Iterate until the cluster centers do not change any more

The figure below explains the concept of K-means clustering

Figure 2: Illustration of K-means algorithm [4] A demo of K-means algorithm is shown below. The pictures depict the change of centers for 4 clusters for 4 iterations.

(2)

(3)

(4)

After the fourth iteration, the centers do not move much and hence the centers are fixed at this position. The disadvantages of this K-means algorithm is, initially one has to mention the number of clusters and also with different set of initial random centers, one gets a different cluster center in the end.

SUPERVISED CLUSTERING ALGORITHMS:


In this section four different types of supervised clustering algorithms are presented. They are Vector quantization, fuzzy clustering, artificial neural net and fuzzy-neural algorithms. Though fuzzy and neural nets initially go through unsupervised clustering, to determine the cluster centers, only the supervised clustering algorithms are discussed here. VECTOR QUANTIZATION : Origin of this algorithm is Shanons source coding theory, which is used for transmission and encoding of data. The algorithm is as follows. A vector quantizer maps kdimensional vectors in the vector space Rk into a finite set of vectors Y = {yi: i = 1, 2, ..., N} [5]. Each vector yi is called a code vector or a codeword. and the set of all the codewords is called a codebook. Associated with each codeword, yi, is a nearest neighbor region called Voronoi region, and it is defined by:

The set of Voronoi regions partition the entire space Rk such that:

for all i

As an example we take vectors in the two dimensional case. Figure 2 shows some vectors in space. Associated with each cluster of vectors is a representative codeword (cluster center or cluster representative obtained by k-means algorithm or similar algorithms). Each codeword resides in its own Voronoi region. These regions are separated with imaginary lines in figure 1 for illustration. Given an input vector, the codeword that is chosen to represent it is the one in the sa Figure 3 : Vector Quantization illustration in 2-D space showing veronoi region formed by imaginary lines The representative codeword ( cluster center) is determined to be the closest in Euclidean distance from the input vector (instances). The Euclidean distance is defined by:

where xj is the jth component of the input vector, and yij is the jth is component of the codeword yi. FUZZY SUPERVISED CLUSTERING :

Fuzzy logic is becoming popular in the field of automatic control. Fuzzy logic requires no analytical model of the system, and offers the chance to combine heuristic knowledge with any model knowledge which may be available [6]. Fuzzy logic can also deal with vague or imprecise data. In the field of fault diagnosis, fuzzy logic has been used successfully in many applications, both as a means of residual generation, and to aid in the decision making process of residual evaluation. The idea behind fuzzy clustering is basically that of pattern recognition. Training data is used off-line to determine relevant cluster centers for each of the faults of interest. On-line, the degree to which the current data belongs to each of the pre-defined clusters is determined, and this results in a degree-of-membership to each of the pre-determined faults. This method is useful in cases where there are many residuals, or in which no expert knowledge of the system is available. Fuzzy clustering is different from fuzzy reasoning which is also used in residual analysis. Fuzzy reasoning mainly comprises of IF-THEN reasoning based on the sign of the residual. Example of fuzzy reasoning : IF residual1 is positive and residual2 is negative THEN fault1 is Present IF residual1 is zero and residual2 is zero And so on. Clustering is the allocation of data points to a certain number of classes. Each class is represented by a cluster center, or prototype, which can be considered as the point which best represents the data points in the cluster. The idea behind fuzzy clustering is that each data point belongs to all classes with a certain degree of membership. The degree to which a data point belongs to a certain class is dependant upon the distance to all cluster THEN system is fault free

centers. For fault diagnosis, each class could correspond to a particular fault. The general principle is shown for three inputs and three clusters in Fig. 3.

Figure 4: Fuzzy clustering concept showing the cluster centers and the membership grade of a data point The fuzzy clustering fault isolation procedure consists of the following two steps: Off-line phase: this is a learning phase which consists of the determination of the characteristics (i.e. cluster centers) of the classes. A learning data set is necessary for this off-line phase, which must contain residuals for all known faults. (For more details on origin of idea of fuzzy clustering refer to [7] ) On-line phase: This phase calculates the membership degree of the current residuals to each of the known classes. In this way each data point does not belong to only one cluster, but its membership is distributed among all clusters according to the varying degree of resemblance of its features with respect to those cluster centers [8].

It is important that the training data contains all faults of interest, otherwise they cannot be isolated on-line - though unknown faults can in some cases be detected. The fuzzy membership matrix and the cluster centers are computed by minimizing the following partition formula: J f (C , m) =
i =1 C

(u k ,i ) m d k ,i
k =1

subject to (1)

u
i =1

i ,k

=1

Where C denotes the number of clusters, N the number of data points, u i , k , the fuzzy membership of the k-th point to the i-th cluster, d k ,i the euclidean distance between the data point and the cluster center, and m (1, ) a fuzzy weighting factor which defines the degree of fuzziness of the results. The data class becomes more fuzzy and less discriminating with increasing m. Ingeneral, m =2 is chosen ( it is mentioned that this value of m does not produce optimal solution for all problems). The constraint in eq. (1) implies that each point must entirely distribute its membership among all the clusters. The cluster centers (centroids or prototypes) are defined as the fuzzy weighted center of gravity of the data

x,

vi =

(u
k =1 N k =1

k ,i

) m xk
i =1,2.....C

(u

(2)

k ,i

Since u i , k

affects the computation of the cluster center v i , the data with a high

membership will influence the prototype location more than points with a low membership. For the fuzzy C-means algorithm, distance d k ,i is defined as follows
( d k ,i ) 2 = x k v i
2

(3)

The cluster centers v i represent the typical values of that cluster, whereas the u i , k component of the membership matrix denotes the extent to which the data point x k is similar to its prototype. The minimization of the partition functional (1) will give the following expression for the membership,
ui ,k =
C

1 d k ,i j= 1 dk, j m 1
2

(4)

Equation (4) is determined in an iterative way since the distance d k ,i depends on membership u i , k . The procedure to calculate the fuzzy C-means algorithm is: 1. Choose the number of classes C , 2 C < n ; Chose m, 1 m < .Initialise U ( 0 )

2. Calculate the cluster centers v i using Eq. 2 3. Calculate new partition matrix U (1) using Eq. 4 4. Compare U ( j ) and U ( j +1) . If the variation of the membership degree u k ,i , calculated with an appropriate norm, is smaller than a given threshold, stop the algorithm, otherwise go back to step 2. The determination of the cluster centers is then complete. On-line, the U matrix is calculated for each data point. The elements of the U matrix then give the degree to which the current data corresponds to each of the fault classes. Fuzzy reasoning or fuzzy clustering is chosen according to the system, availability of expert knowledge of the system. If expert knowledge of the system is available then fuzzy reasoning can be used otherwise it is better to use fuzzy clustering method.

Figure 5 :Matlab fuzzy-logic toolbox demo of Fuzzy C-means clustering for 4 clusters

ARTIFICIAL NEURAL NET CLUSTERING : Before discussing the supervised clustering technique in neural nets, basics of the artificial neural network is discussed. Artificial Neural Network is a system loosely modeled on the human brain [9]. It is an attempt to simulate within specialized hardware or sophisticated software, the multiple layers of simple processing elements called neurons. Each neuron is linked to certain of its neighbors with varying coefficients of connectivity that represent the

strengths of these connections. Learning is accomplished by adjusting these strengths to cause the overall network to output appropriate results. The most basic components of neural networks are modeled after the structure of the brain. The most basic element of the human brain is a specific type of cell, which provides us with the abilities to remember, think, and apply previous experiences to our every action. These cells are known as neurons, each of these neurons can connect with up to 200000 other neurons. The power of the brain comes from the numbers of these basic components and the multiple connections between them. All natural neurons have four basic components, which are dendrites, soma, axon, and synapses. Basically, a biological neuron receives inputs from other sources, combines them in some way, performs a generally nonlinear operation on the result, and then output the final result. The figure below shows a simplified biological neuron and the relationship of its four components.

Figure 6 : Four main parts of human nerve cells, based on which artificial neurons are designed The basic unit of neural networks, the artificial neurons, simulates the four basic functions of natural neurons. Artificial neurons are much simpler than the biological neuron; the figure below shows the basics of an artificial neuron.

Figure 7 Structure of an artificial neuron with Hebbian learning ability. (weights are adjustable) D. Hebb has postulated a principle for a learning process (Hebb, 1949) at the cellular level: if Neuron A is stimulated repeatedly by Neuron B at times when Neuron A is active, then Neuron A will become more sensitive to stimuli from Neuron B (the correlation principle [10]. It implicitly involves adjustments of the strengths of the synaptic inputs, which led to the incorporation of adjustable synaptic weights on the input lines to excite or inhibit incoming signals.

An input vector x = ( x1 ...............x N ), considered to be a column matrix vector, is linearly combined with the weight vector w=( w1 .................w N ) via the inner (dot) product to form the sum s = wn x n = w T x
n =1 N

If the sum s is greater than the given threshold b , then the output y is 1, else it is 0. The function which gives the output value is called as activation function. Figure below shows some of the activation functions.

Figure 8 : Some common activation functions used Activation functions as in (a) and (b) give binary output ( 0 or 1/ +1 or -1) whereas the functions in (c) and (d) give non-binary output ( output value varies anywhere between 0 and 1/ +1 and -1). The functions are Unipolar if the output range is between 0 to 1, it is called bipolar if output range is from +1 to 1. The basic artificial neuron unit shown in figure 7 is called a perceptron.

The architecture for a network that consists of a layer of M perceptrons is shown in Figure 8. An input feature vector x = ( x1 ............... x N ),is input to the network via the set of N branching nodes. The lines fan out at the branching nodes so that each perceptron receives an input from each component of x. At each neuron, the lines fan in from all of the input (branching) nodes. Each incoming line is weighted with a synaptic coefficient (weight parameter) from the set {wnm}, where wnm weights the line from the nth component xn coming into the mth perceptron.

Figure 9 : One layer of perceptrons network with N inputs and M perceptrons The Perceptron as Hyperplane Separator:
Consider a perceptron as shown in Figure 7. The input vector x = (x1,...,xN) is linearly combined with the weights to obtain
S = w1 x1 +......... + wN x N b

,where b is the threshold. Then s is activated by a threshold function T(-) to produce the output y
xS = w1 x1 +......... + wN x N b = 0

= T(s) = 1 when s >= 0, else y = T(s) = -1. The set of all input vectors x such that forms a hyperplane H in the input vector space. H partitions the feature vector space into right and left halfspaces H+ and H-.

An example: consider a single perceptron with two inputs. Let w1 = 2 andw2 = -1, b=0, then 2x1 - x2 = 0 determines H. the points (0,0) and (1,2) belong to H The feature vector x = (x1,x2) = (2,3) is summed into S = 2(2) - 1(3) = 1 > 0, so that the activated output is y = T(1) = 1 (corresponds to H+ in the plane, i.e right half) (x1,x2) = (0,2) activates the output y = T(2(0) - 1(2)) = T(-1) = -1, which indicates that (0,2) is in the left halfspace H-. The figure below shows these points.

Figure 10 : Illustration of H+ and H-- in the hyperplane

The above example is a simple linear mapping between the input and the output. Now consider another example which illustrates how non-linear relation between input and output is implemented. Consider an XOR logic function or 2- bit parity problem. N = 2 inputs, M = 1 output, and Q = 4 sample vector (input/output) pairs for training, and K= 2 clusters (even and odd).

Table below shows the mapping of input and output for this 2-bit parity data.

Table 1: Logic for 2-bit parity data However, we see from Figure 11 below that a single hyperplane can not separate the four feature vectors into the required 2 classes, no matter how it is oriented (rotated and translated) by the weights.

Figure 11: Hyperplane diagram for 2-bit parity data, showing one hyperplane is not sufficient to separate the data into two clusters The power of a single neuron can be greatly amplified by using multiple neurons in a network of layered connectionist architecture, as displayed in Figure 12 below. Such a multiple layered perceptron (MLP) is also called a feed forward artificial neural network and abbreviated to FANN. The modifier "feed forward" distinguishes them from feedback (recursive) networks. On the left is the layer of inputs, or branching, nodes,

which are not artificial neurons. The hidden layer (the middle layer here) contains neural nodes, as does the output layer on the right. This is the architecture of a twolayeredNN(so called because there are two layers of neuronal units).

Figure 12 : A typical two layered network where the middle layer introduces the required non-linearity between input and output layers Neural networks may also have multiple hidden layers for the sake of extra power in learning to separate nonlinearly separable classes. The Hornik-Stinchcombe-White theorem, states that a layered artificial neural network with two layers of neurons is sufficient to approximate as closely as desired any piecewise continuous map of a closed bounded subset of a finite dimensional space into another finite dimensional space, provided there are sufficiently many neurons in the single hidden layer. There is no theoretical need to use more than two layers of neurons, which would increase the computational complexity and instability in training, and slow down the operation because the extra layers cause delays in processing (the idea is that the neurons in a single layer are to process in parallel, while the different layers process sequentially). But extra

layers can prevent the necessity of using an excessive number of neurons in a single hidden layer to achieve highly nonlinear classification. Consider the same XOR implementation using the two layered network shown in the figure below:

Figure 13 : A two layered network for XOR logic implementation Let result is two parallel hyperplanes that yield three convex regions. The hyperplanes are determined by

The threshold at the first neuron in the hidden layer yields

The threshold at the second hidden neuron yields

This forces the results listed in Table 2, where we use 0.1 for 0 and 0.9 for 1 (this is the usual procedure in using neural networks, because 0 and 1 have special properties that

inhibit gradient training).The four sets of above outputs yield the three unique vectors (y1,y2) = (0,1), (y1,y2) = (1,1), and (y1,y2) = (0,0) that identify the three linearly separable regions shown in Figure 14. We see from the figure that Regions 1 and 3make up the odd parity (Class 2),while Region 3 is even parity (Class 1).We saw in the previous example that a network of a single layer can not output the two correct classes, no matter how we orient the hyperplanes via translation and rotation. In all cases of non coincidental hyperplanes, we obtain three or four convex regions (the lower and upper bounds, respectively).

Table 2 : Hidden layer mapping for 2-bitparity function

To show that the network with a second layer of perceptrons can learn the nonlinearly separable classes of even and odd parity (XOR logic), we take the new weights at the single output neuron to be in figure 13. These weight the lines on

which y1 and y2 enter the output neuron (perceptron). Using the hyperplane

we need to map y = (1,1) and y = (0,0) into the same class, Class 1, as shown in Figure 14 below.

Figure 14 : The Partitioning of the 2-bit Parity Feature Space with Two Perceptron Layers This is done by choosing the weights(u) as above and threshold to be . The result is shown in the table below.

Table 3: The 2-bit Parity Mapping by Two Layers of Perceptrons There are many different kinds of learning rules used by neural networks. The most common class of ANNs is called backpropagational neural networks (BPNNs)[11]. Backpropagation is an abbreviation for the backwards propagation of error. Here learning is a supervised process that occurs with each cycle or epoch (i.e. each time the network is presented with a new input pattern). It consists of a forward activation, which results in flow of input and output of the neurons through the network, and the

backward weight adjustment schema based on the error calculated. More simply, when a neural network is initially presented with a pattern it makes a random guess as to what it might be. It then sees how far its answer was from the actual one and makes an appropriate adjustment to its connection weights. Backpropagation performs a gradient descent within the weight space towards a global minimum. The global minimum is the theoretical solution with the lowest possible error. In most problems, the solution space is quite irregular with numerous pits and hills which may cause the network to settle down in a local minimum which is not the best overall solution. This idea is depicted in figure below.

Figure 15 The weights versus error space. Here for clarity this graph is drawn in two dimensions, however, often we have many weights, say n, and this graph would be in n+1 dimensions. Since the nature of the error versus weights space can not be known a priori, one has to make several neural network analysis with different parameters to determine the best solution. The speed of the learning can be controlled by the learning rate. Another parameter, momentum, helps the network to overcome obstacles (local minima) in the

error surface and settle down at or near the global minimum. The issue of when to stop the training is non-trivial. Training should not necessarily proceed to the global minimum: this point is per definition optimal for the training set, but that may not be the case for an independent data set. The math and algorithm is as follows [12]. The main objective in neural model development is to find an optimal set of weight parameters w, such that y = y ( x, w) closely represents (approximates) the original problem behavior. This is achieved through a process called training (that is, optimization in w-space). A set of training data is presented to the neural network. The training data is presented to the neural network. The data are pairs of ( x k , d k ), k =1,2......., P , where
d k is the desired outputs of the neural model for inputs x k and P is the total number of

training samples. During training, the neural network performance is evaluated by computing the difference between actual network outputs and desired outputs for all the training samples. The difference, also known as the error, is quantified by

--------(1) where d jk is the jth element of d k , y j ( x k , w) is the jth neutral network output for the input x k , and Tr is an index set of training data. The weight parameters w are adjusted during training, such that this error is minimized.

Training Process : The first step in training is to initialize the weight parameters w, and small random values are usually suggested. During training, w is updated along negative direction of the
E , until E becomes small enough. Here, the parameter w

gradient of E, as w = w

is called the learning rate. If we use just one training sample at a time to update w, then a per-sample error function E k given by

----(2) is used and w is updated as w = w


E k . The following sub-section describes how the w

error back propagation process can be used to compute the gradient information Error Back Propagation :

E k . w

Using the definition of E k in (3.20), the derivative of E k with respect to the weight parameters of the lth layer can be computed by simple differentiation as

------(3) and

-------(4)

The gradient

E k can be initialized at the output layer as z iL

-----(5)

using the error between neural network outputs and desired outputs (training data).

Subsequent derivatives

E k are computed by back-propagating this error from l+1th z iL

layer to lth layer (see Figure below) as

-------(6)

Figure 16: Relationship between ith neuron of the lth layer, with neurons of layer l-1 and l+1

For example, if the MLP uses sigmoid (3.6) as hidden neuron activation function,

-------(7)

--------(8) and

--------(9)

li = For the same MLP network, let l i be defined as

E k representing local il

gradient at ith neuron of lth layer. The back propagation process is given by,

-------(10)

--(11)

and the derivative with respect to the weights are

----(12)

The algorithm in pictorial representation is given in figure below.

Figure 17 : Error back propagation algorithm steps Matlab neural network tool box has a demonstration for error back propagation algorithm, showing the change of error with respect to different combination of weights for a two layered network. It also shows how it is possible to get the weights corresponding to local minima. The figures below shows the Matlab demo.

Figure 18 : Variation of error with respect to layer one weights

Figure 19 : Arbitrarily chosen two points on the graph, depict the value of weights that will be obtained by the algorithm

Integration of Fuzzy systems and Neural Networks :


Neural networks process numerical information and exhibit learning capability. Fuzzy systems can process linguistic information and represent, say, experts' knowledge by fuzzy rules. Thus, the fusion of these two technologies is the current research trend. The aim is to be able to create machines with more intelligent behavior [13].

Some of the motivations for considering both fuzzy systems and Neural Networks:

(1) The Knowledge Base of a fuzzy system consists of a collection of "If... Then..." rules in which linguistic labels are modeled by membership functions.

Neural Networks can be used to produce membership functions when available data are numerical. (2) Moreover, one can take advantage of the learning capability of neural networks to adjust membership functions, say in control strategies, to enhance control precision. (3) Neural Networks can be used to provide learning methods for fuzzy inference procedures. (4) In the opposite direction, one can use fuzzy reasoning architecture to construct new NeuralNetworks (5) One can also fuzzify the Neural Networks architecture to enlarge the domain of applications. (6) The fusion of Neural Networks and Fuzzy Systems is essentially based upon the fact that Neural Networks can learn experts' knowledge (through numerical data) and Fuzzy Systems can represent experts' knowledge (through the representation of in-out relation by fuzzy reasoning). The literatures talk about basically two types of combination Neural-Fuzzy system :In this type of systems the learning ability of neural networks is utilized to realize the key components of a general fuzzy logic inference system. Neural networks are considered in realizing fuzzy membership functions

Fuzzy-Neural network system: These models talk of incorporating fuzzy principles in neural network, to create a more flexibility and robust system. Inherently neural networks model, algorithm can be fuzzified like, fuzzy neurons, fuzzified neural models and neural networks with fuzzy training. The developments are in progress in this field. There are different proposals for the building of these integrated systems and algorithms are in the proposal stage. For more detailed explanation of different types of combinations and proposals refer to [14].

REFERENCES
[1] http://www.palantir.swarthmore.edu/loicz/help/clustering.htm [2] Clustering Connections and statistical language processing , Frank Keller, University of Saarlandes [3] http://cne.gmu.edu/modules/dau/stat/clustgalgs/clust3_frm.html

[4] Refining Initial Points for K-Means Clustering, P. S. Bradley, Computer Sciences Department, University of Wisconsin, Usama M. Fayyad, Microsoft Research, Redmond, WA [5] http://www.geocities.com/mohamedqasem/vectorquantization/vq.html [6] Fuzzy Logic In Fault Diagnosis, Dr. Tracy Dalton, University of Duisburg, Germany [7] Bezdek J.C., Pattern recognition with fuzzy objective functions algorithms, Plenum Press, New York, 1991. [8] Adaptive Fuzzy Monitoring and Fault Detection, Stefano Marsili-Libelli, [9] An individual project within MISB-420-0, Author: Daniel Klerfors, Professor: Dr Terry L. Huston, St.Louis University ( http://hem.hj.se/~de96klda/NeuralNetworks.htm ) [10] Posted notes of Prof. Carl G. Looney - Computer Science Department, University of Nevada . ( http://ultima.cs.unr.edu/cs773b/CHAP3.pdf ) [11] http://www-binf.bio.uu.nl/BPA/NIntro.pdf [12] http://www.ieee.cz/knihovna/Zhang/Zhang100-ch03.pdf [13] Collection from various websites [14] Chin- Teng Lin and C. S. George Lee, Neural Fuzzy Systems , Prentice Hall, NJ. 1996

Das könnte Ihnen auch gefallen