Beruflich Dokumente
Kultur Dokumente
Chapter 7
Classification of materials
1. INTRODUCTION
f, = k e.xp ( -Ei ) (1)
157
process correspond to the objective function 0 and control parameter T in cluster analysis,
Z w. a.. / (2)
=
Wlg
i = 1 i=1
The sum of the squared Euclidian distances is used as the objective function to be minimized:
2
( W11 , W lk; , wnl,...,wnk)
z .) (3)
1=1 1;=1 j=1
Step 2. Assign an initial class label in k classes to all of n samples randomly and then
calculate corresponding values of the objective function 0 . Let both the optimal objective
158
function value Ob and the current objective function value O c be 0, the corresponding
the optimal objective function Ob Tc and We are the temperature and cluster membership
Tb = T1 , Wc = Wb
Step 3. While the counting number of Metropolis sampling step is less than N , i.e. IGM < N,
go to step 4, otherwise, go to step 7.
Step 4. Let flag = 1, let p be the threshold probability and if IGM N / 2 , p = 0.80:
otherwise, p = 0.95, a trial assignment matrix W t can be obtained from the current
assignment We by the following way:
If i > n , then, let i = i - n ; otherwise, let I = 1+ 1, take sample i from the sample set,
initial class assignment ( wig ) of this sample is expressed by f (where f belongs to arbitrary
class of k classes), i.e. f = wig , then draw a random number u (u = rand, where rand is
a random number of uniform distribution in the interval [0,1] ), if u > p , generate a random
number r which lies in the range [1, k ], here r # f, put sample i from class f to class r , let
wig = r, flag = 2; Otherwise, take another sample, repeat the above process until flag = 2.
Step 5. Let corresponding trial assignment after above perturbation be Wt . Calculate the
Ob , then, Ob = Ot , Wb = Wt , IGM = 0.
Step 6. Produce a random number y , here y = rand, if y < exp - (it - .Ic )/ Tc ), then, W.
10 - 10 ,
Step 7. Let Tc = pTc , IGM = 0, Oc = Ob , Wc = Wb . If Tc < T2 or Tc /Tb <
The algorithm of cluster analysis based en SA was tested by using simulated data generated
159
Input
data & parameters
and calculate 4 ;
=4); ti) = (1); T =T ;
IGM < N ?
no yes
Do perturbation operation to obtain W
t:
flag = I;
if IGM N/2; p = 0.8; else ; p = 0.95; end
while flag 2
if i > n; i = i - n; else; i = i + I ; end
f=w ; u rand;
ig
if u > p; w = r ; r f ) ; flag = 2; end
ig
Calculate :
= rand;
v < exp(-((1) t - c )IT )? GM = IGM +
yes
T= p Tc ; IGM = 0 ;
on the computer and composed of 30 samples containing 2 variables ( x, y) for each. These
samples were supposed to be divided into 3 classes. The data were processed by using cluster
analysis based SA, hierarchical cluster analysis [10] and K-means algorithm[11-12] The
optimal objective function ( ) values obtained are shown in Table 1. Comparing the
results in the column 4 with those of column 7 in Table 1 , one notices that for SA there is
only one disagreement of the class assignment for the sample No. 6, which is actually
misclassified by all three methods. This shows that the SA method is more preferable than
hierarchical cluster analysis and K-means algorithm.
The objective function values of K-means algorithm with 50 iterative cycles are listed in Table
2, the lowest value is 168.7852. One notices that the behavior of K-means algorithm is
influenced by the choice of initial cluster centers, the order in which the samples were taken,
and, of course, the geometrical properties of the data . The tendency of sinking into local optima
is obvious. Clustering by SA can provide more stable computational results.
Table 1
Comparison of results obtained by using different methods in classification of simulated data
Classification results
No. Simulated data Actual class
of Hierarchical K-means* simulated
x y simulated data clustering annealing
1 1.0000 0 1 1 1 1
2 3.0000 3.0000 1 2 1 1
3 5.0000 2.0000 1 2 1 1
4 0.0535 4.0000 1 1 1 1
5 0.3834 0.4175 1 1 1 1
6 5.8462 3.0920 1 2 2 2
7 4.9103 4.2625 1 2 2 1
8 0 2.6326 1 1 1 1
9 4.0000 5.0000 1 2 1 1
10 2.0000 1.0000 1 1 1 1
11 11.000 5.6515 2 2 2 2
12 6.0000 5.2727 2 2 2 2
13 8.0000 3.2378 2 2 2 2
14 7.0000 2.4865 2 2 2 2
15 10.000 6.0000 2 2 2 2
16 9.5163 3.9866 2 2 2 2
17 6.9478 1.5007 2 2 2 2
18 11.5297 3.9410 2 2 2 2
19 10.8278 2.0159 2 2 2 2
20 9.0000 4.7362 2 2 2 2
21 7.2332 12.000 3 3 3 3
22 6.5911 9.0000 3 3 3 3
23 7.2693 9.5373 3 3 3 3
24 4.1537 8.0000 3 3 3 3
25 3.5344 11.000 3 3 3 3
26 7.5546 11.6248 3 3 3 3
27 4.7147 8.0910 3 3 3 3
28 5.0269 7.0000 3 3 3 3
29 4.1809 9.8870 3 3 3 3
30 6.3858 9.4997 3 3 3 3
Table 2
Oil values obtained by using K-means algorithm with 50 different random initial clustering
177.2062
are classified into ( tea samples denoted by the same number are classified into the same
group ). The objective Ob for the hierarchical clustering obtained by Liu et al. [13] was
using different methods with the same criterion. Objective function 0 obtained by using cluster
analysis based on SA was the lowest among all methods listed in Table 3. It seems that cluster
analysis based on SA can really provide a global optimal result. The column 4 of Table 3
shows the classification results of K-means algorithm with lowest O b among 50 independent
iterative cycles.
Comparing the results in the column 5 with those of column 2 in Table 3 , one notices that
there is only one case of disagreement of the class assignment for the tea sample K2. K2 was
distributed to class 1 and class 2 by hierarchical clustering and SA, respectively. As mentioned
above, the assignment of quality by the tea experts is valid only for samples belonging to the
same category and variety. According to hierarchical clustering sample K2 is classified as 1,
163
Table 3
Comparison of results obtained by using different methods in classification of Chinese tea
samples
Classification results
Sample
Hierarchical clustering K-means a K-meansb Simulated annealing
Cl 1 1 1 1
C2 1 1 1 1
C3 1 1 1 1
C4 1 2 1 1
C5 2 2 1 2
C6 2 2 1 2
C7 2 2 1 2
H1 1 1 1 1
H2 1 1 1 1
H3 1 2 1 1
H4 2 2 1 2
H5 2 2 1 2
K1 1 2 1 1
K2 1 2 1 2
K3 2 2 1 2
K4 2 2 1 2
Fl 1 1 1 1
F2 1 1 1 1
F3 1 2 1 1
F4 1 2 1 1
F5 2 2 1 2
F6 2 2 1 2
F7 2 2 1 2
T1 3 3 2 3
T2 3 3 2 3
T3 3 3 3 3
T4 3 3 3 3
S1 3 3 2 3
S2 3 3 2 3
S3 3 3 3 3
S4 3 3 3 3
Objective function
value, 0b 50.8600 140.6786 119.2311 50.6999
a. The result obtained in one iteration cycle with arbitrary initial clustering.
b. The result with lowest value of ob among 50 independent iteration cycles.
164
meaning K2 is at least as good as C4, H3, etc. SA gave a classification of class 2 for K2,
qualifying it as the same quality as C5, 114, etc. As the value of Ob for SA is slightly lower ,
the results for proposed algorithm seem to be more appropriate in describing the real situation.
The clustering results obtained by K-means algorithm seem not sufficiently reliable.
until Tc < T2 , as theoretically T2 itself should approach zero ( T2 = 10-99 is taken here. ) .
In general, the exact value of T2 is unknown for practical situations. The present authors
propose a stopping criterion based.on the ratio of temperature Tb which corresponds to the
- 10
T. When Tc /Tb < 10 10 , one stops
optimal objective function Ob to current temperature Tc
computation ( step 7, vide supra ). For example, during the data treatment when Tc = 3.8340
x 10-54 and Tb = 9.6598x 10 -44 , one stops computation. This is a convenient criterion, it
saves computing time substantially comparing to the traditional approach using extremely
small T2 .
The methods to carry out the perturbation of the trial states of sample class assignment in
cluster analysis as well as the introduction of perturbation in the SA process also deserve
consideration. The present authors propose a perturbation method based on changing class
assignment of only one sample at each time ( Figure 1 ). Such a method seems to be more
effective, as usually only the class assignment of one sample is wrong, and only the class
assignment of this sample has to be changed. Brown and Entail [9] took the sample to be
perturbed on a random basis, the corresponding perturbation operation is described as follows:
It seems that in such a method the equal opportunity in perturbation for each sample might not
really be guaranteed. Every sample will have equal opportunity in perturbation in step 4
( vide supra) which takes less computation time in obtaining comparable results ( Table 4 ). On
the other hand, Selim and Alsultan [5] proposed the below perturbation method in which the
class assignments of several samples might be simultaneously changed in each perturbation :
flag =1;
i=0;
while flag 2 or i n
f = wig ; u = rand;
end
The comparison of different perturbation operations is shown in Table 4 . One notices that
the method proposed by Selim and Alsultan [5] takes the longest time, i.e. this method
converges to the global optimal solution rather slowly, and the present method converges quite
quickly.
Cluster analysis based on SA is a very useful cluster algorithm, although it has some
insufficiency. As mentioned above, the modified algorithm is more effective than K-means
algorithm and is also preferable than hierarchical cluster analysis. A global optimal solution may
be obtained by using the algorithm. Feasible stopping criterion and perturbation method are
important aspects for the computation algorithm. The present authors use minimization of the
166
Table 4
1 1 1 1
1 1 1 1
1 2 2 1
1 1 1 1
1 1 1 1
1 2 2 2
1 2 1 1
1 1 1 1
1 1 1 1
1 1 1 1
2 2 2 2
2 1 2 2
2 2 2 2
2 1 2 2
2 2 2 2
2 2 2 2
2 2 2 2
2 2 2 2
2 2 2 2
2 2 2 2
3 3 3 3
3 3 3 3
3 3 3 3
3 2 3 3
3 3 3 3
3 3 3 3
3 3 3 3
3 2 3 3
3 3 3 3
3 3 3 3
Objective 209.4124 173.7650 166.9288
function ( Ob )
Computation 23.5069 11.7061 1.1627
time ( hrs . )
sum of the squared Euclidian distances as the clustering criterion. As one knows that Euclidian
distances are suitable only for spherical distribution data set, searching other clustering criteria
167
suitable for different kinds of data sets for the use of SA cluster analysis deserves further
investigation.
3.1. Introduction
From the above research results, we notice that cluster analysis based on SA is a very useful
algorithm, but it is still time-consuming, especially for large data set. Although we have made
some improvements on the conventional algorithm and proposed a modified duster analysis
based on simulated annealing (SAC), further modification for this sort of algorithm is of
considerable interest. On the basis of the above works, a modified clustering algorithm based
on a combination of simulated annealing with the K-means algorithm (SAKMC) can be used. In
this procedure the initial class labels among k classes of all n samples are obtained by using
the K-means algorithm instead of random assignment. A flow chart of the SAKMC program
is shown in Figure 2. The algorithm is firstly tested on two simulated data sets, and then used
for the classification of calculus bovis samples and Chinese tea samples. The results show that
the algorithm which is guaranteed in obtaining a global optimum with shorter computation time
compares favorably with the original SAC and K-means algorithm.
needed to obtain corresponding optimal objective function value (O b ) for clustering based on
K-means algorithm, simulated annealing(SAC) and SAKMC are about 3 min., 70 min. and 55
min. for data set I, and 5 min., 660 min., 183 min. for data set II, respectively. From the
above results, one notices that the larger the data set , the longer the computation time.
168
Input
data & parameters
and calculate 4;
(1) =4); 4r 0 ; T 147 =1 V
b
IGM < N ?
no yes
Do perturbation operation to obtain W t :
flag = I;
if IGA 4 N/2; p = 0.8; else ; p = 0.95; end
while flag 2
if i > n; i = i n; else; i = + end
f=w . ; u = rand;
ig
Calculate 4) t :
t c W t =
f O t .4) b ; 4' b =4 t ; Wb Wt ; IGM=0; end
end
W=W;
T = T ; IGM =0; tP c =0 1, ; Wc = Wb ;
es
Figure 2. Flow chart of a cluster analysis by K-means algorithm and simulated annealing.
169
Table 5
The comparison of results of different approaches
Objective function
value, (I) 63.7707 60.3346 60.3346 114.6960 114.5048 114.5048
Computation
time ( min . 3 70 55 5 660 183
8
)1:
7-
6-
5- **
4-
3-
2-
1
0 1 2 3 4 5 6 7 8
5_ * *
2
*
,*
1
0 1 2 3 4 5 6 7
x
Figure 4. A plot of simulated data II
Table 6
The normalized data of microelement contents in natural and cultivated calculus bovis samples
Sample No.* Cr Cu Mn Ti Zn Pb
Sample No.* Mo Ca K Na
* No. 1-3 are cultivated calculus bovis and No.4-13 are natural calculus bovis samples.
polyphenols, caffeine and amino acids for various tea samples were processed by using SAC,
SAKMC and K-means algorithm. The results obtained by using the three methods were
compared with those given by Liu et al. [13] using hierarchical cluster analysis. The results
obtained are summarized as follows:
Hierarchical clustering:
Class 1: C1-4, H1-3, K1-2, F1-4; Class 2: C5-7, H4-5, K3-4, F5-7; Class 3: T1-4, S1-4.
Objective function value Oh : 50.8600 ( The objective function Oh was calculated according to equation 3 )
Computation time: 10 min.
K-means:
Class 1: C1-7, H1-5, K1-4, F1-7; Class 2: T1-2, S1-2; Class 3: T3-4, S3-4.
Objective function value Oh : 119.2311
Computation time: 6 min.
One notices that there is only one case of disagreement of the class assignment for the tea
sample K2. K2 was classified into class 1 by hierarchical clustering and K-means, and it was
classified into class 2 by SAC and SAKMC. Both SAC and SAKMC give a relatively low
objective function value. The K-means algorithm is inferior as shown by the objective function.
It puts all the green and black teas into the same group and separates the oolong teas into a high
and a low quality group.
4.1. Introduction
The classical principal component analysis (PCA) is the basis of many important methods for
classification of materials since the eigenvector plots are extremely useful to display n-
dimentional data preserving of the maximal amount of information in a space of reduced
dimension. The classical PCA method is, unfortunately, non-robust as the variance is adopted
as the optimal criterion. Sometimes, a principal component might be created just by the
presence of one or two outliers [15]. So if there are outliers existing, the coordinate axes of the
principal component space might be misdetermined by the classical PCA, and reliable
classification of materials could not be obtained. The robust approach for PCA analysis using
simulated annealing has been proposed and discussed in detail in the Chapter " Robust
principal component analysis and constrained background bilinearization for quantitative
analysis".
The projection pursuit (PP) is used to carry out PCA with a criterion which is more robust
than the variance[ 16], generalized simulated annealing(GSA) algorithm being introduced as the
optimization procedure in the process of PP calculation to guarantee the global optimum. The
results for simulated data sets show that PCA via PP is resistant to the deviation of the error
distribution from the normal one, and the method is especially recommended for the use in the
cases with possible outlier(s) existing in the data. The theory and algorithm of PP PCA
together with GSA are described[ 16]. Three practical examples are given to demonstrate its
advantages.
to determine the applicability of the PP PCA algorithm for analyzing multivariate chemical data.
Figure 5 shows the classification results of PP PCA and SVD. It can be seen that the PP
PCA solutions provide a more distinct separation between the different varieties.
174
PP classification PP classification
1 7
0
6
4
-3
-4 3
-4 -2 0 -4 -2 0 2
PC1 PC1
SVD classification SVD classification
8 10 12
PC1
PP discrimination PP discrimination
5 2
4 cP
aw3 0
a, 0
-5 -2
-2 0 -2 2
PC1 PC1
SVD discrimination SVD discrimination
2 2
0 0, 0
00 cP 0
0
SA( A, -2
** *
*
-2 -4
-5 0 5 -5 5
PC1 PC1
Figure 6. The comparison of results for tea samples by using the PP and SVD classification
* Green tea o Black tea Oolong tea
One sample ( No.17 ) which was first classified into " plain " beer by the manufacture is
classified into " imperial " one. The beer expert's further examination confirmed this
classification. PP PCA compared favourably with the traditional SVD algorithm and
was at least as good as the nonlinear mapping technique.
176
PP discrimination PP discrimination
-0.4 -5.5
*
0* 4 *
-0.6 0 -6
a., a-, 0
D
-0.8 -6.5
-0.6 -0.5 -0.4 -0.3 -0.6 -0.5 -0.4 -0.3
PC1 PC1
0 1-CD"
-2 Act.
-1
20 30 40 20 30 40
PC1 PC1
Figure 7. The comparison of results of beer samples by using the PP and SVD classification
analysis and cluster analysis could not give satisfactory results with four samples
analysis. These data sets were treated by PP PCA and SVD. The PC1--PC2 plot of
results further demonstrate that PP PCA is more preferable than the traditional SVD
algorithm.
177
PP classification PP classification
1.5 2
0 00g
0
1 CIC 0
0 0
N *
U05 0 -2 :4
41-1 0 4.1
0 civ
* 0
0 -4
00
o 0
-0.5 -6
0 0.5 1 0 0.5
PC1 PC1
2 0
0
0.5
c\I
U 0
0 * 00
.4 * 0"T *
-2 * *0
0 coio
0
-4 0.5
20 40 60 80 20 40 60 80
PC1 PC1
Figure 8. The comparison of results of biological samples by using PP and SVD classification
* Serum samples of patients with coronary heart diseases o Serum samples of normal persons
ACKNOWLEDGEMENT
This work was supported by National Natural Science Foundation of P.R.C. and partly by the
of Sciences.
178
REFERENCES
APPENDIX
179
following criterion, am should belong to sg(h) Zg(h) II < II am - z1(h)11 for all
if Il am -
m =1, 2, ... , n, and i = 1, k, g i, where sg(h) denotes the set of samples whose
cluster center is Zgal), represents the norm; Otherwise the clustering of am remains
unchanged.
Step 3. According to the results of step 2, the new cluster centers zgal+1), g = 1, k, are
calculated such that the sum of the squared Euclidian distances from all samples in sg(h) to the
new cluster center, i.e. the objective function
z 01+1) = 1 7 a
g 8.--- 1, 2, ..., k
n g 4.1.d m
am e sg (h)
where ng is the number of samples in sg(h). The name "K-means" is derived from the manner
in which the cluster centers are sequentially updated.
Step 4. If Zg 0 + 0 = zg (h) for g = 1, k , stop, the algorithm is converged; otherwise, go
to step 2.