Sie sind auf Seite 1von 25

Adaption of Simulated Annealing to Chemical

Optimization Problems, Ed. by J.H. Kalivas


1995 Elsevier Science B.V. All rights reserved. 155

Chapter 7

Classification of materials

Ruqin Yu, Lixian Sun and Yizeng Liang

Department of Chemistry and Chemical Engineering, Hunan University, Changsha, 410082,


People's Republic of China

1. INTRODUCTION

Cluster analysis as an unsupervised pattern recognition method is an important tool in


exploratory analysis of chemical data. It has found wide application in many fields such as
disease diagnosis, food analysis, drug analysis, classification of materials, etc. Hierarchical and
optimization - partition algorithms are the most widely used methods of cluster analysis [1].
One of the major difficulties for these conventional clustering algorithms is to guarantee a global
optimal solution to the corresponding problem. Simulated annealing as a stochastic optimization
algorithm [2] could provide a promising way to circumvent such difficulties. Recently,
generalized simulated annealing has been introduced into chemometrics for wavelength selection
[3] and calibration sample selection [4]. The use of cluster analysis methods based on simulated
annealing for chemometric research is of considerable interest. Three modified clustering
algorithms based on simulated annealing, K-means algorithm and principal component
analysis(PCA) are proposed and used to chemometric research. A modified stopping criterion
and perturbation method are also proposed. These algorithms are all tested by using simulated
data generated on a computer and then applied to the classification of materials such as Chinese
tea, bezoar ( traditional Chinese medicine calculus bovis) , beer samples and biological
samples. The results compare favourably with those obtained by conventional clustering
methods.
156

2. CLUSTER ANALYSIS BY SIMULATED ANNEALING

2.1. Principle of cluster analysis by simulated annealing


Simulated annealing (SA) which derives its name from the statistical mechanics simulating
the atomic equilibrium at a fixed temperature belongs to a category of stochastic
optimization algorithms. According to statistical mechanics, at a given temperature Ti and under
thermal equilibrium the probability of a given configuration i, obeys the Boltzmann-Gibbs
distribution:


f, = k e.xp ( -Ei ) (1)

where k is a normalization constant and El is the energy of the configuration i [5, 6] .


SA was proposed by Kirkpatrick et al. [7] as a method for solving combinational optimization
problems which minimizes or maximizes a function of many variables. The idea was derived
from an algorithm proposed by Metropolis et al. [8] who simulated the process involving atoms
reaching thermal equilibrium at a given temperature T. The current configuration of the atoms is
perturbed randomly and then a trial configuration is obtained according to the method of
Metropolis et al.[8]. Let Ec and Et denote the energy of the current and trial configuration,
respectively. If Et < Ec , which means that a lower energy has been reached, the trial
configuration is accepted as the current configuration. If Ec , then the trial configuration
is accepted with a probability which is directly proportional to exp(- E t - Ec )/T ).
The perturbation process is repeated until the atoms reach thermal equilibrium, i.e. the
configuration determined by the Boltzmann distribution at the given temperature. New lower
energy states of the atoms will be obtained as T is decreased and the Metropolis simulation
process is repeated. When T approaches zero, the atomic lowest-energy state or the ground
state is obtained.
Cluster analysis can be treated as a combinational optimization problem. Selim and Alsultan
[5] as well as Brown 'and Entail [9] described the analogy between SA and cluster analysis.
The atomic configuration of SA corresponds to the assignment of patterns or samples to a
cluster in cluster analysis. The energy E of the atom configuration and temperature T in SA

157

process correspond to the objective function 0 and control parameter T in cluster analysis,

respectively. Suppose n samples or patterns in d -dimensional space are to be partitioned into k


clusters or groups. Different clustering criterions could be adopted. The sum of the squared
Euclidian distances from all samples to their corresponding cluster centers is used as the
criterion .

Let A = [a l./(1 = I, 2, ..., n ; j = 1, 2 , , d ) be an n x d sample data matrix, and

W=[ w ig ] (i = 1, 2,...,n ; g= 1, 2,..., k) be an ti k cluster membership matrix; w = 1, ig

sample i is assigned to cluster g and otherwise Ewig = 1. Let Z ( g = 1, 2 ,..., k ;


g=1

j= I,2,..., d) be a kxd matrix of cluster centers where

Z w. a.. / (2)
=
Wlg
i = 1 i=1

The sum of the squared Euclidian distances is used as the objective function to be minimized:

2
( W11 , W lk; , wnl,...,wnk)

z .) (3)
1=1 1;=1 j=1

The clustering is carried out in following steps:


Step 1. Set initial values of parameters. Let T I be the initial temperature, T2 the final
temperature, p the temperature multiplier, N the desired number of Metropolis iterations,
IGM the counting number of a Metropolis iteration and i the counting number of a sample

in the sample set . Take Ti = 10 , T2 = 1 0 -99 , iu = 0.7 - 0.9, N = 4n , IGM = 0 and 1= O.

Step 2. Assign an initial class label in k classes to all of n samples randomly and then

calculate corresponding values of the objective function 0 . Let both the optimal objective
158

function value Ob and the current objective function value O c be 0, the corresponding

cluster membership matrix of all samples be Wb . Tb is the temperature corresponding to

the optimal objective function Ob Tc and We are the temperature and cluster membership

matrix, respectively, corresponding to the current objective function O (' . Let Tc = T ,

Tb = T1 , Wc = Wb
Step 3. While the counting number of Metropolis sampling step is less than N , i.e. IGM < N,
go to step 4, otherwise, go to step 7.
Step 4. Let flag = 1, let p be the threshold probability and if IGM N / 2 , p = 0.80:
otherwise, p = 0.95, a trial assignment matrix W t can be obtained from the current
assignment We by the following way:
If i > n , then, let i = i - n ; otherwise, let I = 1+ 1, take sample i from the sample set,
initial class assignment ( wig ) of this sample is expressed by f (where f belongs to arbitrary
class of k classes), i.e. f = wig , then draw a random number u (u = rand, where rand is
a random number of uniform distribution in the interval [0,1] ), if u > p , generate a random

number r which lies in the range [1, k ], here r # f, put sample i from class f to class r , let

wig = r, flag = 2; Otherwise, take another sample, repeat the above process until flag = 2.
Step 5. Let corresponding trial assignment after above perturbation be Wt . Calculate the

objective function value 0 1- of the assignment. If O t Oc let We = Wt 0c = Ot If Ot <

Ob , then, Ob = Ot , Wb = Wt , IGM = 0.

Step 6. Produce a random number y , here y = rand, if y < exp - (it - .Ic )/ Tc ), then, W.

= Wt ,Oc = 01. Otherwise, IGM = IGM + 1, go to step 3.

10 - 10 ,
Step 7. Let Tc = pTc , IGM = 0, Oc = Ob , Wc = Wb . If Tc < T2 or Tc /Tb <

then, stop; otherwise, go back to step 3.


A flow chart of the program is shown in Figure 1.

2.2. Treatment of Simulated data

The algorithm of cluster analysis based en SA was tested by using simulated data generated
159

Input
data & parameters

Generate an initial cluster membership matrix Wb randomly

and calculate 4 ;
=4); ti) = (1); T =T ;

IGM < N ?
no yes
Do perturbation operation to obtain W
t:
flag = I;
if IGM N/2; p = 0.8; else ; p = 0.95; end
while flag 2
if i > n; i = i - n; else; i = i + I ; end
f=w ; u rand;
ig
if u > p; w = r ; r f ) ; flag = 2; end
ig

Calculate :

= rand;
v < exp(-((1) t - c )IT )? GM = IGM +

yes

T= p Tc ; IGM = 0 ;

yes utput of results

Figure 1. Flow chart of a cluster analysis by simulated annealing.


160

on the computer and composed of 30 samples containing 2 variables ( x, y) for each. These
samples were supposed to be divided into 3 classes. The data were processed by using cluster
analysis based SA, hierarchical cluster analysis [10] and K-means algorithm[11-12] The

optimal objective function ( ) values obtained are shown in Table 1. Comparing the

results in the column 4 with those of column 7 in Table 1 , one notices that for SA there is
only one disagreement of the class assignment for the sample No. 6, which is actually
misclassified by all three methods. This shows that the SA method is more preferable than
hierarchical cluster analysis and K-means algorithm.
The objective function values of K-means algorithm with 50 iterative cycles are listed in Table
2, the lowest value is 168.7852. One notices that the behavior of K-means algorithm is
influenced by the choice of initial cluster centers, the order in which the samples were taken,
and, of course, the geometrical properties of the data . The tendency of sinking into local optima
is obvious. Clustering by SA can provide more stable computational results.

2.3. Classification of tea samples


Liu et al. [13] studied the classification of Chinese tea samples by using hierarcharcal cluster
analysis and principal component analysis. In their study, three categories of tea of Chinese
origin were used: green, black and oolong. Each category contains two varieties: Chunmee (C)
and Hyson (H) for green tea, Keemun (K) and Feng Quing (F) for black tea, Tikuanyin (T) and
Se Zhong (S) for oolong tea. Each sample in these groups was assigned a number according to
its quality tested by tea experts on the basis of the taste and the best quality was numbered as I.
One notices that the assignment of quality by the tea experts is valid only for samples
belonging to the same category and variety. The names of the samples are composed of the first
letter of the variety name, followed by the number indicating the quality.
The data which involve concentration(% w/w, dry weight) of cellulose, hemicellulose, lignin,
polyphenols, caffeine and amino acids for various tea samples were processed by using cluster
analysis based on SA and K-means algorithm. The results obtained by using the two methods
were compared with those given by Liu et al. [13] using hierarchical cluster analysis. The
results are summarized in Table 3, where the numbers 1, 2, 3 refer to the groups the tea samples
161

Table 1
Comparison of results obtained by using different methods in classification of simulated data

Classification results
No. Simulated data Actual class
of Hierarchical K-means* simulated
x y simulated data clustering annealing

1 1.0000 0 1 1 1 1
2 3.0000 3.0000 1 2 1 1
3 5.0000 2.0000 1 2 1 1
4 0.0535 4.0000 1 1 1 1
5 0.3834 0.4175 1 1 1 1
6 5.8462 3.0920 1 2 2 2
7 4.9103 4.2625 1 2 2 1
8 0 2.6326 1 1 1 1
9 4.0000 5.0000 1 2 1 1
10 2.0000 1.0000 1 1 1 1
11 11.000 5.6515 2 2 2 2
12 6.0000 5.2727 2 2 2 2
13 8.0000 3.2378 2 2 2 2
14 7.0000 2.4865 2 2 2 2
15 10.000 6.0000 2 2 2 2
16 9.5163 3.9866 2 2 2 2
17 6.9478 1.5007 2 2 2 2
18 11.5297 3.9410 2 2 2 2
19 10.8278 2.0159 2 2 2 2
20 9.0000 4.7362 2 2 2 2
21 7.2332 12.000 3 3 3 3
22 6.5911 9.0000 3 3 3 3
23 7.2693 9.5373 3 3 3 3
24 4.1537 8.0000 3 3 3 3
25 3.5344 11.000 3 3 3 3
26 7.5546 11.6248 3 3 3 3
27 4.7147 8.0910 3 3 3 3
28 5.0269 7.0000 3 3 3 3
29 4.1809 9.8870 3 3 3 3
30 6.3858 9.4997 3 3 3 3

Objective function value, 0b,


169.3275 189.3531 168.7852 166.9288

* The result with lowest value of b among 50 independent iteration cycles.


162

Table 2
Oil values obtained by using K-means algorithm with 50 different random initial clustering

307.3102 177.2062 168.7852 169.3275 181.4713 181.4713 347.3008

169.3275 306.3492 173.0778 173.0778 178.9398 347.3008 169.3275

173.0778 177.2062 169.3275 173.0778 171.6456 177.2062 181.4713

178.9398 176.4388 168.7852 373.6821 173.0778 202.4826 168.7852

173.0778 171.6456 181.4713 173.0778 306.3492 169.3275 177.2062

202.4826 373.6821 173.0778 202.4826 178.9398 347.3008 169.3275

181.4713 176.4388 202.4826 173.0778 178.9398 177.2062 347.3008

177.2062

are classified into ( tea samples denoted by the same number are classified into the same

group ). The objective Ob for the hierarchical clustering obtained by Liu et al. [13] was

calculated according to equation 3. As shown in Table 3, different results could be obtained by

using different methods with the same criterion. Objective function 0 obtained by using cluster

analysis based on SA was the lowest among all methods listed in Table 3. It seems that cluster
analysis based on SA can really provide a global optimal result. The column 4 of Table 3

shows the classification results of K-means algorithm with lowest O b among 50 independent

iterative cycles.
Comparing the results in the column 5 with those of column 2 in Table 3 , one notices that
there is only one case of disagreement of the class assignment for the tea sample K2. K2 was
distributed to class 1 and class 2 by hierarchical clustering and SA, respectively. As mentioned
above, the assignment of quality by the tea experts is valid only for samples belonging to the
same category and variety. According to hierarchical clustering sample K2 is classified as 1,
163

Table 3
Comparison of results obtained by using different methods in classification of Chinese tea
samples

Classification results

Sample
Hierarchical clustering K-means a K-meansb Simulated annealing

Cl 1 1 1 1
C2 1 1 1 1
C3 1 1 1 1
C4 1 2 1 1
C5 2 2 1 2
C6 2 2 1 2
C7 2 2 1 2
H1 1 1 1 1
H2 1 1 1 1
H3 1 2 1 1
H4 2 2 1 2
H5 2 2 1 2
K1 1 2 1 1
K2 1 2 1 2
K3 2 2 1 2
K4 2 2 1 2
Fl 1 1 1 1
F2 1 1 1 1
F3 1 2 1 1
F4 1 2 1 1
F5 2 2 1 2
F6 2 2 1 2
F7 2 2 1 2
T1 3 3 2 3
T2 3 3 2 3
T3 3 3 3 3
T4 3 3 3 3
S1 3 3 2 3
S2 3 3 2 3
S3 3 3 3 3
S4 3 3 3 3

Objective function

value, 0b 50.8600 140.6786 119.2311 50.6999

a. The result obtained in one iteration cycle with arbitrary initial clustering.
b. The result with lowest value of ob among 50 independent iteration cycles.
164

meaning K2 is at least as good as C4, H3, etc. SA gave a classification of class 2 for K2,

qualifying it as the same quality as C5, 114, etc. As the value of Ob for SA is slightly lower ,

the results for proposed algorithm seem to be more appropriate in describing the real situation.
The clustering results obtained by K-means algorithm seem not sufficiently reliable.

2.4. Some computational aspects of simulated annealing algorithm


Selim and Alsultan [5] pointed out that no stopping point was computationally available
on clustering analysis based on SA. Searching some stopping criteria for the use of SA cluster
analysis deserves further investigation. It is rather time-consuming to proceed the calculation

until Tc < T2 , as theoretically T2 itself should approach zero ( T2 = 10-99 is taken here. ) .

In general, the exact value of T2 is unknown for practical situations. The present authors
propose a stopping criterion based.on the ratio of temperature Tb which corresponds to the
- 10
T. When Tc /Tb < 10 10 , one stops
optimal objective function Ob to current temperature Tc

computation ( step 7, vide supra ). For example, during the data treatment when Tc = 3.8340

x 10-54 and Tb = 9.6598x 10 -44 , one stops computation. This is a convenient criterion, it

saves computing time substantially comparing to the traditional approach using extremely
small T2 .
The methods to carry out the perturbation of the trial states of sample class assignment in
cluster analysis as well as the introduction of perturbation in the SA process also deserve
consideration. The present authors propose a perturbation method based on changing class
assignment of only one sample at each time ( Figure 1 ). Such a method seems to be more
effective, as usually only the class assignment of one sample is wrong, and only the class
assignment of this sample has to be changed. Brown and Entail [9] took the sample to be
perturbed on a random basis, the corresponding perturbation operation is described as follows:

do perturbation operation to obtain Wt


flag = I;
if IGAI  N/2; P = 0.8; else; p = 0.95; end
while flag # 2
165

i = rand (1, n) ; where i is a random which lies in the interval I I , n .


f= wig ; u = rand;
if u > p; wig = r ; flag = 2; end
end

It seems that in such a method the equal opportunity in perturbation for each sample might not
really be guaranteed. Every sample will have equal opportunity in perturbation in step 4
( vide supra) which takes less computation time in obtaining comparable results ( Table 4 ). On
the other hand, Selim and Alsultan [5] proposed the below perturbation method in which the
class assignments of several samples might be simultaneously changed in each perturbation :

do perturbation operation to obtain Wt

flag =1;

if IGM  N/2; P = 0.8; else; p = 0.95; end

i=0;

while flag 2 or i n

ifi>n; i=i-n; else; i=i+1;end

f = wig ; u = rand;

if u > p; wig = r ; flag = 2; end

end

The comparison of different perturbation operations is shown in Table 4 . One notices that
the method proposed by Selim and Alsultan [5] takes the longest time, i.e. this method
converges to the global optimal solution rather slowly, and the present method converges quite
quickly.
Cluster analysis based on SA is a very useful cluster algorithm, although it has some
insufficiency. As mentioned above, the modified algorithm is more effective than K-means
algorithm and is also preferable than hierarchical cluster analysis. A global optimal solution may
be obtained by using the algorithm. Feasible stopping criterion and perturbation method are
important aspects for the computation algorithm. The present authors use minimization of the
166

Table 4

Comparison of results obtained by using different perturbation methods in classification


of simulated data

Actual class Classification results of


of
simulated data Selim's method Brown's method Present method

1 1 1 1
1 1 1 1
1 2 2 1
1 1 1 1
1 1 1 1
1 2 2 2
1 2 1 1
1 1 1 1
1 1 1 1
1 1 1 1
2 2 2 2
2 1 2 2
2 2 2 2
2 1 2 2
2 2 2 2
2 2 2 2
2 2 2 2
2 2 2 2
2 2 2 2
2 2 2 2
3 3 3 3
3 3 3 3
3 3 3 3
3 2 3 3
3 3 3 3
3 3 3 3
3 3 3 3
3 2 3 3
3 3 3 3
3 3 3 3

Objective 209.4124 173.7650 166.9288
function ( Ob )


Computation 23.5069 11.7061 1.1627
time ( hrs . )

sum of the squared Euclidian distances as the clustering criterion. As one knows that Euclidian
distances are suitable only for spherical distribution data set, searching other clustering criteria
167

suitable for different kinds of data sets for the use of SA cluster analysis deserves further
investigation.

3. CLUSTER ANALYSIS BY K-MEANS ALGORITHM AND SIMULATED


ANNEALING

3.1. Introduction

From the above research results, we notice that cluster analysis based on SA is a very useful
algorithm, but it is still time-consuming, especially for large data set. Although we have made
some improvements on the conventional algorithm and proposed a modified duster analysis
based on simulated annealing (SAC), further modification for this sort of algorithm is of
considerable interest. On the basis of the above works, a modified clustering algorithm based
on a combination of simulated annealing with the K-means algorithm (SAKMC) can be used. In
this procedure the initial class labels among k classes of all n samples are obtained by using
the K-means algorithm instead of random assignment. A flow chart of the SAKMC program
is shown in Figure 2. The algorithm is firstly tested on two simulated data sets, and then used
for the classification of calculus bovis samples and Chinese tea samples. The results show that
the algorithm which is guaranteed in obtaining a global optimum with shorter computation time
compares favorably with the original SAC and K-means algorithm.

3.2. Treatment of simulated data


The simulated data sets were composed of 30 samples (data set I) and 60 samples (data set
II) containing 2 variables ( x, y) for each (see Figure 3 and Figure 4 , respectively ).
These samples were supposed to be divided into 3 classes. The data were processed by using
cluster analysis based on simulated annealing (SAC) and cluster analysis by K-means algorithm
and simulated annealing(SAKMC), respectively. As shown in Table 5, the computation time

needed to obtain corresponding optimal objective function value (O b ) for clustering based on

K-means algorithm, simulated annealing(SAC) and SAKMC are about 3 min., 70 min. and 55
min. for data set I, and 5 min., 660 min., 183 min. for data set II, respectively. From the
above results, one notices that the larger the data set , the longer the computation time.

168

Input
data & parameters

Generate an initial cluster membership matrix Wb by K-means algorithm

and calculate 4;
(1) =4); 4r 0 ; T 147 =1 V
b

IGM < N ?
no yes
Do perturbation operation to obtain W t :

flag = I;
if IGA 4 N/2; p = 0.8; else ; p = 0.95; end
while flag 2
if i > n; i = i n; else; i = + end
f=w . ; u = rand;
ig

ifu > p; w = r ; r f ) ; flag = 2; end


ig

Calculate 4) t :

t c W t =
f O t .4) b ; 4' b =4 t ; Wb Wt ; IGM=0; end

end

W=W;

T = T ; IGM =0; tP c =0 1, ; Wc = Wb ;

es

Figure 2. Flow chart of a cluster analysis by K-means algorithm and simulated annealing.
169

Table 5
The comparison of results of different approaches

Simulated data set I Simulated data set II

K-means SAC SAKMC K-means SAC SAKMC

Objective function

value, (I) 63.7707 60.3346 60.3346 114.6960 114.5048 114.5048

Computation
time ( min . 3 70 55 5 660 183

8
)1:

7-

6-


5- **

4-

3-

2-

1
0 1 2 3 4 5 6 7 8

Figure 3. A plot of simulated data set I


170

5_ * *

2
*
,*
1
0 1 2 3 4 5 6 7
x
Figure 4. A plot of simulated data II

3.3. Classification of calculus bovis samples


Calculus bolds or bezoar is a widely used traditional Chinese medicine suitable for the
treatment of the fever and sore throat. The microelement contents in natural calculus
bovis and cultivated calculus bovis samples were determined by using an Iarre11-Ash
96-750 ICP instrument[14]. The data after normalization and the results obtained by
using different methods are listed in Table 6. K-means algorithm takes the shortest time
to obtain the final result (0 b = 96.7396), which is really a local optimal solution.
Cultivated calculus bovis samples No.4 and No.7 were misclassified into natural ones
by K-means algorithm. Both SAC and SAKMC can get a global optimal solution
( b = 94.3589 ), only sample No. .4 belonging to cultivated calculus bovis was
classified into a natural one corresponding to 0 b = 94.3589. If sample No. 4 is classified
into a cultivated one, the corresponding objective function b would be 95.2626.
this indicates that sample No.4 is closer to natural calculus bovis . From the above
results, one notices that calculus bovis samples can be correctly classified into
natural and cultivated ones on the basis of their microelement contents by means
of SAC and SAKMC except the sample No. 4. The computation times for SAC and
SAKMC were 21 and 12 minutes, respectively.
171

Table 6
The normalized data of microelement contents in natural and cultivated calculus bovis samples

Sample No.* Cr Cu Mn Ti Zn Pb

1 0.2601 1.2663 -0.3583 -0.8806 2.1462 0.7075


2 -0.5501 -0.4793 0.4264 -0.7349 1.6575 0.9897
3 -0.2094 1.4416 1.3222 -0.9609 1.3797 0.2796
4 -0.1412 -0.7887 -0.3329 -0.9448 -0.4879 2.3681
5 -0.0352 0.3886 0.9366 -0.8966 -0.3549 0.5646
6 0.4039 -0.1633 0.3890 -0.2495 -0.4589 -0.6124
7 -0.8455 1.6040 -0.8126 -0.2655 -0.5768 -0.0425
8 -0.5539 -0.9086 20.7482 0.2371 -0.4448 -0.0360
9 -0.5880 -0.6811 -0.4788 1.6784 -0.5340 -1.0765
10 -1.5648 -0.7790 -1.0007 -0.4273 -0.5804 -0.3776
11 0.0178 0.9968 -0.9148 0.6422 -0.5779 -0.4834
12 2.3159 -0.8352 -0.6767 1.9193 -0.5841 -1.1406
13 1.4905 -1.0622 2.2487 0.8831 -0.5835 -1.1406


Sample No.* Mo Ca K Na

1 0.3092 2.4408 -0.8782 -0.7953 2.7953


2 0.3092 1.1206 -0.9742 -0.8503 -0.7975
3 1.8715 0.7906 -0.7823 -0.7128 -0.9002
4 -0.1121 0.8089 -0.3985 0.4146 0.3316
5 -0.6738 -0.4288 0.7528 1.2395 -0.0790
6 -0.9546 -0.0437 1.3284 0.6620 -0.3356
7 1.1693 -0.5755 -0.2066 0.3871 -0.6435
8 -0.5334 -0.8322 -1.7122 1.9544 -0.4382
9 -0.9546 -0.4471 1.0406 -0.7953 0.7422
10 -0.1121 -0.8505 0.8487 0.6620 -0.2329
11 1.5906 -0.4838 -0.3026 0.1671 0.7422
12 -0.9546 -0.6580 -1.1660 -1.4827 -0.7462
13 -0.9546 -0.8413 -0.9742 -0.8503 -0.4382

* No. 1-3 are cultivated calculus bovis and No.4-13 are natural calculus bovis samples.

3.4. Classification of tea samples


The data which involve concentration(% w/w, dry weight) of cellulose, hemicellulose, lignin,
172

polyphenols, caffeine and amino acids for various tea samples were processed by using SAC,
SAKMC and K-means algorithm. The results obtained by using the three methods were
compared with those given by Liu et al. [13] using hierarchical cluster analysis. The results
obtained are summarized as follows:
Hierarchical clustering:
Class 1: C1-4, H1-3, K1-2, F1-4; Class 2: C5-7, H4-5, K3-4, F5-7; Class 3: T1-4, S1-4.
Objective function value Oh : 50.8600 ( The objective function Oh was calculated according to equation 3 )
Computation time: 10 min.

K-means:
Class 1: C1-7, H1-5, K1-4, F1-7; Class 2: T1-2, S1-2; Class 3: T3-4, S3-4.
Objective function value Oh : 119.2311
Computation time: 6 min.

SAC and SAKMC


Class 1: C1-4, H1-3, Kl, F1-4; Class 2: C5-7, H4-5, K2-4, F5-7; Class 3: T1-4, S1-4.
Objective function value Oh : 50.6999
Computation time: 107 min.(SAC); 68 min.(SAKMC)

One notices that there is only one case of disagreement of the class assignment for the tea
sample K2. K2 was classified into class 1 by hierarchical clustering and K-means, and it was
classified into class 2 by SAC and SAKMC. Both SAC and SAKMC give a relatively low
objective function value. The K-means algorithm is inferior as shown by the objective function.
It puts all the green and black teas into the same group and separates the oolong teas into a high
and a low quality group.

3.5. Comparison of methods with and without simulated annealing


Comparing the results obtained from the simulated data, calculus bovis samples and tea
samples, one notices that different results can be obtained by using different methods with the
same criterion, the K-means algorithm in these cases converges to local optimal solutions with
shortest time, the behavior of K-means algorithm is influenced by the choice of initial cluster
centers, the order in which the samples were taken, and the geometrical properties of the data.
The tendency of sinking into local optima is obvious . Both SAC and SAKMC can obtain
global optimal solutions but SAKMC converges faster than SAC . Both SAC and SAKMC
adopt the same stopping criterion. The main reason why SAKMC converges faster than SAC is
that SAKMC uses the results of K-means algorithm as the initial guess. As K-means algorithm
is very quick one gets faster convergence with the SAKMC version.
173

4. CLASSIFICATION OF MATERIALS BY PROJECTION PURSUIT


BASED ON GENERALIZED SIMULATED ANNEALING

4.1. Introduction
The classical principal component analysis (PCA) is the basis of many important methods for
classification of materials since the eigenvector plots are extremely useful to display n-
dimentional data preserving of the maximal amount of information in a space of reduced
dimension. The classical PCA method is, unfortunately, non-robust as the variance is adopted
as the optimal criterion. Sometimes, a principal component might be created just by the
presence of one or two outliers [15]. So if there are outliers existing, the coordinate axes of the
principal component space might be misdetermined by the classical PCA, and reliable
classification of materials could not be obtained. The robust approach for PCA analysis using
simulated annealing has been proposed and discussed in detail in the Chapter " Robust
principal component analysis and constrained background bilinearization for quantitative
analysis".
The projection pursuit (PP) is used to carry out PCA with a criterion which is more robust
than the variance[ 16], generalized simulated annealing(GSA) algorithm being introduced as the
optimization procedure in the process of PP calculation to guarantee the global optimum. The
results for simulated data sets show that PCA via PP is resistant to the deviation of the error
distribution from the normal one, and the method is especially recommended for the use in the
cases with possible outlier(s) existing in the data. The theory and algorithm of PP PCA
together with GSA are described[ 16]. Three practical examples are given to demonstrate its
advantages.

4.2. The IRIS data


A set of IRIS data[17] which consists of three classes: setosa, versicolor and virginia was used

to determine the applicability of the PP PCA algorithm for analyzing multivariate chemical data.
Figure 5 shows the classification results of PP PCA and SVD. It can be seen that the PP
PCA solutions provide a more distinct separation between the different varieties.
174

PP classification PP classification
1 7

0
6

4
-3

-4 3
-4 -2 0 -4 -2 0 2
PC1 PC1


SVD classification SVD classification


8 10 12
PC1

Figure 5. The plot of PP and SVD for IRIS data

* Setosa samples 0 Versicolor samples - Virginia sample


175

4.3. Classification of tea samples


The tea samples mentioned in section 2.3. are classified into three classes according to
their origin by using the PP PCA and SVD. As shown in Figure 6, the tea samples are
clearly classified into three classes by PP PCA algorithm which uses the SA. The
method is more feasible than classical SVD approach.

PP discrimination PP discrimination
5 2

4 cP
aw3 0
a, 0

-5 -2
-2 0 -2 2
PC1 PC1
SVD discrimination SVD discrimination
2 2
0 0, 0
00 cP 0
0
SA( A, -2
** *
*
-2 -4
-5 0 5 -5 5

PC1 PC1
Figure 6. The comparison of results for tea samples by using the PP and SVD classification

* Green tea o Black tea Oolong tea

4.4. Classification of beer samples


Xu and Miu determined the contents of alcohol and many other chemical parameters
in beer samples and classified them by using the PCA and nonlinear mapping technique
[18]. This data set is processed by PP PCA and SVD. The results is shown in Figure 7.

One sample ( No.17 ) which was first classified into " plain " beer by the manufacture is
classified into " imperial " one. The beer expert's further examination confirmed this
classification. PP PCA compared favourably with the traditional SVD algorithm and
was at least as good as the nonlinear mapping technique.
176

PP discrimination PP discrimination
-0.4 -5.5
*

0* 4 *
-0.6 0 -6
a., a-, 0

D
-0.8 -6.5
-0.6 -0.5 -0.4 -0.3 -0.6 -0.5 -0.4 -0.3
PC1 PC1

SVD discrimination SVD discrimination


4 1
0
*-
** 0 0
a.,

0 1-CD"
-2 Act.
-1
20 30 40 20 30 40
PC1 PC1
Figure 7. The comparison of results of beer samples by using the PP and SVD classification

* "Imperial" beer o "Plain " beer

4.5. Classification of biological samples


The contents of Sr, Cu, Mg and Zn in the serum of patients of the coronary heart
diseases and normal persons were determined by using the ICP-AES[19]. The data
were evaluated by using ordinary principal component analysis, cluster analysis and
stepwise discrimination analysis. It has been found that ordinary principal component

analysis and cluster analysis could not give satisfactory results with four samples

misclassified. There were two samples misclassified in stepwise discrimination

analysis. These data sets were treated by PP PCA and SVD. The PC1--PC2 plot of

PP classification shown in Figure 8 has only two samples misclassified. The

results further demonstrate that PP PCA is more preferable than the traditional SVD

algorithm.
177

PP classification PP classification
1.5 2
0 00g
0
1 CIC 0
0 0
N *
U05 0 -2 :4
41-1 0 4.1
0 civ
* 0

0 -4
00
o 0
-0.5 -6
0 0.5 1 0 0.5
PC1 PC1

SVD classification SVD classification


4

2 0
0
0.5
c\I
U 0
0 * 00
.4 * 0"T *
-2 * *0
0 coio

0
-4 0.5
20 40 60 80 20 40 60 80
PC1 PC1

Figure 8. The comparison of results of biological samples by using PP and SVD classification

* Serum samples of patients with coronary heart diseases o Serum samples of normal persons

ACKNOWLEDGEMENT

This work was supported by National Natural Science Foundation of P.R.C. and partly by the

Electroanalytical Laboratory of Changchun Institute of Applied Chemistry, Chinese Academy

of Sciences.
178

REFERENCES

1. N. Batchell, Chemometrics and Intelligent Laboratory Systems, 6 (1989) 105.


2. N. E. Collins, R. W. Eglese and B. L. Golden, American Journal of Mathematics and
Management Science, 8 (1988) 209.
3. J. H. Kalivas, S. N. Robert and J. M. Sutler, Analytical Chemistry, 61 (1989) 2024.
4. J. H. Kalivas, Journal of Chemometrics, 5 (1991) 37.
5. S. Z. Selim and K. Alsultan, Pattern Recognition, 24 (1991) 1003.
6. V. Cerny and S. E. Dreyfus, Thermodynamical Approach to the Traveling salesman
problem: An efficient simulation algorithm, Journal of Optimization Theory and
Applications, 45 (1985) 41.
7. S. Kirkpatrick, C. Gelatt and M. Vecchi, Science, 220 (1983) 671.
8. N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller and E. Teller J. Chem. Phys.,
21 (1953) 108.
9. D. E. Brown and C. L. Entail, Pattern Recognition, 25 (1992) 401.
10. J. H. Ward, J. Am. Stat. Ass., 58 (1963) 236 .
11. G. Li, G. Cai, Computer Pattern Recognition Technique, Chap. 3. Shanghai Jiaotong
University Press, (1986).
12. Q. Zhang, Q. R. Wang and R. Boyle, Pattern Recognition, 24 (1991) 331.
13. X. D. Liu, P. V. Espen, F. Adams and S. H. Yan, Anal. Chim. Acta., 200 (1987) 424.
14. Q. Zhang, K. Yan, S. Tian and L. Li, Chinese Medical Herbs (Chinese), 14 (1991) 15.
15. H. Chen, R. Gnanadesikan, J. R. Kettenring, Sarikhya, 1336 (1974) 1.
16. Y. Xie, J. wang Y. Liang, L. Sun, X. Song and R. Yu, Journal of Chemometrics, 7
(1993) 527.
17. R. A. Fisher, Annals of Eugenics, 7 (1936) 179.
18. C. Xu and Q. Miu, Computers and Applied Chemistry (in Chinese), 3 (1986) 21.
19. L. Xu, Y. Sun, C. Lu, Y. Yao and X. Zeng, Analytical Chemistry (in Chinese), 19
(1991) 277.

APPENDIX

Principle of K-means algorithm


Consider n samples in d dimensions, a1 , a 2 ,..., an , where the sample vector a1 = [ail , ai,
,..., aid], i =1, 2, ..., n, and assume that these samples are to be classified into k groups. The
algorithm[ 11-121 is stated as follows:
k arbitrary samples from all n samples as k initial cluster centers z1 (0,z2 (0) , ,
Step 1. Select
zK (0), where ze = (o) (0)
Zgd
(01= g I, k, the superscript (0) refers to the

179

initial center assignment.


Step 2. At the h -th iterative step distribute the n samples among k clusters, using the

following criterion, am should belong to sg(h) Zg(h) II < II am - z1(h)11 for all
if Il am -
m =1, 2, ... , n, and i = 1, k, g i, where sg(h) denotes the set of samples whose
cluster center is Zgal), represents the norm; Otherwise the clustering of am remains
unchanged.
Step 3. According to the results of step 2, the new cluster centers zgal+1), g = 1, k, are
calculated such that the sum of the squared Euclidian distances from all samples in sg(h) to the
new cluster center, i.e. the objective function

9 = EE Ilam zg(h+1) II m= I, 2, ..., n


g=1 amesezi

is minimized. Zg0+1) is the sample mean of sg(h). Therefore,

z 01+1) = 1 7 a
g 8.--- 1, 2, ..., k
n g 4.1.d m
am e sg (h)

where ng is the number of samples in sg(h). The name "K-means" is derived from the manner
in which the cluster centers are sequentially updated.
Step 4. If Zg 0 + 0 = zg (h) for g = 1, k , stop, the algorithm is converged; otherwise, go
to step 2.

Das könnte Ihnen auch gefallen