Sie sind auf Seite 1von 12

Chemometrics and

intelligent
laboratory systems

ELSEVIER

Chemometrics and Intelligent Laboratory Systems 33 (1996) 35-46

Artificial neural networks in classification of NIR spectral data:


Design of the training set
W. Wu a, B. Walcmk a,, D.L. Massart a7*,S. Heuerding b, F. Erni b, I.R. Last ,
K.A. Prebble
a ChemoAC, Pharmaceutical Institute, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090 Brussel, Belgium
b Sandoz Pharma AC, Analytical Research and Development, CH-4002 Basle , Switzerland
Analytical Department Laboratories,

The Wellcome Foundation Ltd, Dartford, Kent DA1 5AH, UK

Received 24 May 1995; accepted 18 September 1995

Abstract
Artificial neural networks (NN) with back-error propagationwere used for the classification with NIR spectra and applied
to the classification of different strengths of drugs. Four training set selection methods were compared by applying each of
them to three. different data sets. The NN architecture was selected through a pruning method, and batching operation, adaptive learning rate and momentum were used to train the NN. The presented results demonstrate that selection methods based
on Kennard-Stone and D-optimal designs are better than those based on the Kohonen self-organized mapping and on random selection methods and allow 100% correct classification for both recognition and prediction. The Kennard-Stone design is more practical than the D-optimal design. The Kohonen self-organized mapping method is better than the random
selection method.
Keywords:

Drug analysis; Neural network; NIR; Pattern recognition

1. Introduction

One observes an increasing interest in the application of neural networks (NNs) in chemical calibration and pattern recognition problems [l-13]. Although NNs do not require any assumptions about
data distribution, they can be successfully applied

??

Corresponding author.

On leave from Silesian University, Katowice, Poland.

only to sufficiently large and representative data sets.


The term sufficiently large is relative. The important factor is the ratio of the number of samples to the
number of weights considered in the net architecture.
Widrow [ 141 suggests as a rule of thumb that the
training set size should be about 10 times the number
of weights in a network. According to other authors
[15], the maximum number of nodes in the hidden
layer should be of the order g(m + l), where m and
g denote the number of input and output units, respectively. Although these suggestions differ to some
extent, all NN users agree that the higher the ratio of
the number of samples to the number of weights the

0169-7439/%/$15.00 8 1996 Elsevier Science B.V. All rights reserved


SSDI 0169-7439(95)00077-l

W. Wu et al./ Chemometrics

36

and Intelligent Laboratory Systems 33 (1996) 35-46

better the generalization ability of NN. For the given


number of samples this ratio can be maximized by
minimising the net architecture (reducing input data,
pruning redundant weights, etc.).
The second requirement, data representativity,
means that the samples in the data set should be
(evenly) spread over the expected range of data variability. In some cases it may be possible to generate
such samples as the training set using experimental
design techniques. However, in most cases such as in
analysis of food samples, one does not have this possibility [16,17]. Usually, one needs to select the training (model) samples from a large set of samples. The
other samples could be used to test the net. However,
using all samples to train the net may lead to overfitting and to large prediction errors for the test set. To
avoid this the net training must be monitored. This
means that apart from the training set, one needs two
other data sets, the monitoring and test sets. In industrial practice, the sample size is not very large. One
can use the same data set to monitor training and later
evaluate the NN. Hence, at least two data sets (the
training and test sets) are required. The principles for
the design of these two sets are the same as the principles of design of any model set. Our study aims to
evaluate different strategies of training set design,
namely random selection, the Kohonen self-organising map approach [ 181, and two new approaches proposed by us namely the Kennard and Stone design
[ 191 and D-optimal design [20,21].

2. Theory
2.1. Notation
m
g
n

z
X
N
Y
out

number of input variables for NN


number of classes (i.e. number of output
variables of NN)
number of objects in the data
number of objects in the training set
number of variables in the training set
matrix of the training set (nX m)
number of objects in the training set or in the
test set
target matrix (N X g>
output matrix of NN (N X 8).

2.2. Design of training set


2.2.1. Random selection
There are several ways of selecting the training set.
The simplest one is a random selection which means
that no clear selection criterion is applied. There is a
risk that objects of some class are not selected in the
training set. In order to avoid such risk, we select 3/4
of the objects separately from each class, and put
them together as training set. If 3/4 of the objects is
not an integer, the number is rounded to the nearest
integer in decreasing direction.
2.2.2. Kohonen self-organising maps I1 8,61
Another possible procedure for selecting the training set is to apply clustering techniques. The Kohonen network can be applied as such [6]. Zupan et al.
compared three kinds of methods in an example of the
reactivity of chemical bonds, and found that the Kohonen self-organising map performed best [6]. The
main goal of the Kohonen neural network is to map
objects from m-dimensional into two-dimensional
space. When the objects have similar properties in the
original space, they will map to the same node. In this
study, a (3 X 3) Kohonen network is chosen, containing 9 nodes. The learning rate is 0.1 at the beginning
and is linearly decreased so that at the last training
cycle it reaches 0. The neighbourhood size is also
decreased linearly but reaches a minimum of 1 after
one-quarter of the training cycles, and remains 1 for
the rest of the training. The network is stabilised after each pattern has been presented to the network
about 500 times. 3/4 of the objects are randomly selected from the objects which map to the same node.
If 3/4 of the objects is not an integer, we round it to
the nearest integers in increasing direction, otherwise
some nodes would have no objects after rounding.
This procedure is applied to each class separately. All
selected objects are put together as the training set.
2.2.3. Kennard-Stone design [19,25,291
The Kennard-Stone
algorithm technique was
originally used to produce a design when no standard
experimental design can be applied. With this technique, all objects are considered as candidates for the
training set. The design objects are chosen sequentially. At each stage, the aim is to select the objects

W. Wu et al. / Chemometrics and Intelligent Laboratory

so that they are uniformly spaced over the object


space. The first two objects are selected by choosing
the two objects that are farthest apart. The third object selected is the one farthest from the first two objects, etc. Let d,, denote the squared Euclidean distance from the ith object to the jth object. Suppose k
objects have already been selected, where k <
number of objects n in the data. Then define the
minimal distance

as the squared distance from candidate object u, not


yet in the training set, to the k objects already in the
training set. The (k + 11th object in the training set is
chosen from the remaining objects, using the criterion
max(D&

Systems 33 (1996) 35-46

37

the initial points selected by the Kennard-Stone design [21]. For NIR data of each class, ti is much larger than
n. The D-optimal design cannot be directly applied
because of the singularity of the information matrix.
Therefore, the data are pretreated by PCA after centering. The number of variables is reduced to n- 1
latent variables. Now X becomes a score matrix for
nobjects and n- 1 latent variables. Then we apply
the D-optimality method to select objects for the linear model y = C pi xi + e with xi the latent variable
i. It should be pointed out that this linear model is
used only for the selection of objects, not for modelling. This procedure is carried out for each class
separately. 3/4 of the objects are chosen from every
class, and are put together as the training set. If 3/4
of the objects is not an integer, it is rounded using the
same way described under random selection.

where u belongs to the remaining objects. 3/4 of the


objects of each class are separately selected, and put
together as the training set; the remaining objects
construct the test set. If 3/4 of the objects is not an
integer, we round it using the same way as described
under random selection.

3. Experimental data

2.2.4. D-optimal design [20,21.281


D-optimality of a design is computed when the
classical symmetrical designs cannot be used, because the experimental region is not regular in shape,
or the number of experiments selected by a classical
design is too large. The principle of this method is to
select the experimental points to maximise the determinant of the information matrix JXXI. This matrix
is equal to the variance covariance matrix when X is
defined as a matrix with nobjects and nt variables
after centering (where n is the number of objects to
be selected). The determinant of this matrix is maximal when the selected objects span the space of the
whole data. When applied to each class of data separately it aims to select influential objects (maximal
spread), particularly useful in pattern recognition
problems. There are several algorithms [20], among
others those due to Mitchell [26] and Fedorov [27]. As
stated in Ref. [20], the algorithm by Fedorov, in general, is faster and sometimes leads to larger values of
IXXl than those obtained by the algorithm by
Mitchell. Both algorithms are iterative and require an
initial solution. We apply Fedorovs algorithm with

Three NIR data sets are analyzed. The first two


data sets are measured by a IT-NIR instrument
IFS28/NIR of Bruker with a connected optical fibre;
the last data set is measured with a NIRS systems
Model 6500 near-infrared reflectance spectrometer
configured with the aperture slit facing upward. The
spectra are presented as log(1 /R) absorption values,
where R is the reflectance of the sample versus that
of a white ceramic reflectance. For convenience, the
wavelength is expressed by its index in the resulting
data matrix. For reason of industrial confidentiality,
no detailed information about chemical composition
of the data sets can be given.
Data set 1 contains 140 NIR spectra (10 001-4000
cm- 779 wavelengths) of tablets containing drugs of
different dosages (0.025, 0.05, 0.075 and 0.15 mg>
and three kinds of placebo. Twenty different tablets
of each dosage form (active and placebo) are measured four times through a glass plate on which the
tablets are positioned. The average spectra of the four
were collected.
Data set2 contains 160 NIR spectra(lOOOl-4000
cm- 779 wavelengths) of capsules containing drugs

W. Wu et al. / Chewmetrics

38

and Intelligent Laboratory

of different dosages (0.1, 0.25, 0.5, 1.0 and 2.5 mg)


and three kinds of placebo. Twenty different capsules of each dosage form (active and placebo) are
measured four times through the glass plate.
Daru set 3 contains 135 NIR spectra (1100-2500
nm; 700 wavelengths) of tablets containing different
dosages (20,50, 100 and 200 mg) of the experimental active ingredient, a placebo and a clinical comparator. There are respectively 15, 17, 15, and 21
spectra in the classes of different dosages, 47 spectra
in the class of placebo and 20 spectra in the class of
comparator. Spectra are measured through the blister
package, which contributes to the spectrum at around
1700 nm.
3.1. Data pre-processing
The pre-processing step consists of trimming, data
transformation, and training set selection. The first
and last 15 wavelengths were trimmed from each of
the spectra to remove edge effects. In this study, the
standard normal variate @NV) transformation [24]
was applied to reduce the effects of scatter, particle
size, etc. After transformation, each data set was divided into the training set and test set by the techniques described. The data of the training set were
subjected to a principal component analysis (PCA).
The first 10 principal components were taken into
consideration. The scores of the objects from the test
set in the PC space were calculated using the loadings obtained from the training set.

4. Neural network parameters


The multilayer feedforward network trained with
the backpropagation learning algorithm was applied
[22]. The goal of net training is to minimize the root
mean square error @MS)
RMs =

CfL1Cjs1(Yij -

OUtij)'

Ng

where yij is the element of target matrix y (N X g>


for the data considered (training set or test set), and
ourij is the element of the output matrix out (N X g)
of the NN.

Systems 33 (1996) 35-46

To make backpropagation faster the following


three techniques were used: batching operation,
adaptive learning rate and momentum [23]. In batching operation, one can apply multiple input vectors
simultaneously, instead of one input vector per time,
and obtain the networks response to each of them.
Adding an adaptive learning rate can also decrease
training time. This procedure increases the training
speed, but only to the extent that the net can learn
without large error increases. At each iteration new
weights and biases are calculated using the current
learning rate. The new output of the net and error term
are then calculated. If the new error exceeds the old
error by more than a predefined ratio (typically 1.041,
the new weights, biases, output and error are discarded. In addition, the learning rate is decreased
(typically multiplied by 0.7). If the new error is less
than the old error, the learning rate is increased (typically multiplied by 1.05). Otherwise the new weights,
etc., are kept [23]. Momentum decreases backpropagations sensitivity to small details in the error surface and helps the net to avoid getting stuck in shallow minima.
To avoid overfitting, the performance of the network is tested every hundred or thousand epochs
during the training, and the weights for which the
minimal RMS for the test set is observed are
recorded.
The target vector describing the belongingness of
the object to a class was set to binary values of 1 (for
corresponding class) and 0 (for other classes). The final output of the net can be evaluated in two different ways: the object can be considered as correctly
classified if the largest output regardless of its absolute value is observed on a node signalling the correct class, or the object can be considered as correctly classified if the largest output is observed on a
node signalling the correct class and its value is
higher than 0.5. The second criterion is stricter and
was chosen in the study to evaluate NN performance.
It allows soft modelling of data, i.e. it can happen that
the itb object will not be classified into any of the
predefined classes.
The performance of a classification system is additionally expressed as the percent of the number of
correctly classified objects of the training and test
sets, divided by the total number of objects present in
these sets.

W. Wu et al. / Chemometrics and Intelligent Loboratory Systems 33 (1996) 35-46

5. Results and discussion


5.1. Selection of NN architecture
To compare different techniques of training set selection, we can compare them for the optimal structure obtained with each training set selection technique. As described by Zupan and Gasteiger [6], another way is to compare the performance using a
fixed structure of NN. We chose the latter procedure.
The architecture of the network was first optimised
for the data which were divided into the training and
test sets by the Kennard-Stone algorithm. Then we
used this structure as the fixed structure. The input
and output values were range-scaled between 0.1 and
0.9 variable by variable. The backpropagation leaming rule with adaptive learning rate and momentum
was used. The initial values of learning rate and momentum were fixed at 0.1 and 0.3, respectively.
An effort was made to check the influence of the
random initialisation of the net weights upon the final classification results. For instance, with data set 1
rerunning a neural net (4 nodes in the hidden layer)
10 times with the randomised initial weights results
in correct classification rates (CCRs) for training and
test sets, each time equal to lOO%, while the mean
values of RMSs for training and test sets are equal to
0.0565 and 0.0619, with standard deviations of
0.0061 and 0.0059, respectively. It demonstrates that
CCRs for training and test sets are stable with different seeds of random generator, although the RMSs
change. The adaptive learning rate makes the results
more independent of the initial values of weights. The

39

NN utilised in this study consisted of two active layers of nodes with a sigmoidal transfer function. The
number of nodes in the output layer is determined by
the number of classes. Normally the number of nodes
in the input layer is also determined by the structure
of the data. As already explained, for NIR data the
number of variables is much larger than the number
of objects and the variables are highly correlated. The
data can be orthogonalized and reduced by principal
component analysis, but then the number of input PCs
should be optimised. According to Widrows suggestion (see Section 11, the number of objects ought to
be about 10 times the number of weights. However,
in practical use, the number of objects is limited and
there are seldom so many available. Therefore, we
relaxed this condition during the optimisation of the
NN architecture: the ratio of the number of objects to
the number of weights ought to be more than 1. If the
numbers of input and output nodes are fixed, the
maximum number of hidden nodes can be estimated
using this rule. For instance, if there are 60 objects in
the training set, we never train an NN having more
than 60 weights. If we want to use 10 input nodes and
6 output nodes, then the number of hidden nodes
cannot exceed 3. The NN with 3 hidden nodes, 10
input nodes and 6 output nodes has together 57 (11
X 3 + 4 X 6) weights. For the net with 4 hidden
nodes, the number of weights (11 X 4 + 5 X 6 = 79)
is already larger than 60.
There is no standard way to optimise the architecture of NN. The simplest way is to try systematically
all combinations of nodes to find the optimal number
of nodes in the input and the hidden layer. Data set 1

Table 1
Data set 1: correctly classified rate of the training set (CCR) and test set (CCRt) of all combinations within 10 input nodes and 4 hidden
nodes; maximum number of epochs 5000
Input nodes

1 hidden node

2 hidden nodes

3 hidden nodes

CCR

cCRt

CCR

CCRt

CCR

CCRt

CCR

CCRt

28.6
28.6
28.6
28.6
28.6
28.6
28.6
28.6
28.6

28.6
28.6
28.6
28.6
28.6
28.6
28.6
28.6
28.6

59.1
66.7
71.4
69.5
71.4
75.2
79.1
71.4
71.4

65.7
71.4
71.4
71.4
71.4
77.1
80.0
68.6
68.6

69.5
91.4
96.2
100
85.7
100
98.1
100
99.1

77.1
97.1
97.1
100
85.7
100
97.1
94.3
97.1

80.0
100
100
100
100
100
100
100
100

77.1
100
100
100
100
100
100
97.1
100

4 hidden nodes

W. Wu et al. / Chemometrics and Intelligent Laboratory Systems 33 (1996) 35-46

lo

is used as an example. We consider all combinations


of the number of input nodes varying from 2 to 10
(i.e. the first two to the first 10 PCs) and the number
of hidden nodes varying from 1 to 4. The results of
all these nets are shown in Table 1. There are eight
NNs which give 100% classification for both recognition and prediction. However, this approach requires a lot of trials, and could be even more time
consuming if we would like to take into account all
possibly combinations of two, three, etc., PCs.
A more efficient approach to optimise the number
of nodes in the input and the hidden layers is as follows: first the number of input nodes is fixed at the
maximal number of PC factors, and the number of
hidden nodes is increased from small to large until the
performance of the network does not improve any
more or both the recognition and prediction percentages are 100%.
The maximum number of PCs to be entered can be
decided by the variance explained (for instance, the
number of PCs needed to explain 99% variance) or
by the results of pilot experiments. In the latter case,

we train the net using the first 10 PCs as input and


the maximum number of hidden nodes which is estimated by the above rule. If the NN performs well, 10
can be used as the maximum number of PCs. If the
performance is not satisfying, more PCs will be taken
into account.
When the number of nodes in the hidden layer has
been fixed and the maximum number of input nodes
has been decided, the number of input nodes is pruned
according to the value of the weights. If the weights
connected to one input node are all large, this indicates that the variable corresponding to the input node
plays an important role in NN. If the weights connected to one input node are all small, this indicates
that the variable corresponding to the input node plays
a small role in the NN and can be pruned off. If the
weights are intermediate, one can try to prune the
variable and if the NN still performs well one can
decide to prune it definitively. A hidden node can also
be pruned if the weights connecting the hidden node
to the input nodes, and the weights between the hidden node and the output nodes, are small. The mag-

2-Layer Badcpmpagetionwith AdaptiveLR & Momentum

bl

0.2

60
0

500

looo

!300

looo

1500

2ooo
2500
3cMl
3!m
Epoch; . . . training; - test

1!500 Moo
2500
3ooo
3500
Epoch;... training; -test

4ooo

4ooo

4500

5ooo

4500

Fig. 1. (a) The root mean square error (RMS) as a function of the number of haining epochs; (b) the percentage of correctly classified
objects as a function of the number of training epochs; network architecture (10 X 4 X 7); data set 1.

W. Wu et al./

Chemometrics and Intelligent Laboratory

Systems 33 (1996) 35-46

41

Inputi & Biases

(b)
0

::

0.5

1.5

10

2.5

3.5

12

4.5

Nauronj
Fig. 2. (a) Hinton diagram of the weights between the nodes of the input layer and tbe nodes of the hidden layer in the network (10 X 4 X 7);
(b) sum of the absolute values of the weights of tbe node in the input layer, (c) sum of the absolute values of the weights of the node in the
hidden layer; data set 1.
p-Layer &&propgation with AdaptiveLR 8 Momentum

(a)

0.3

. . . . . . . . . . . . .A,
0.05

I
500

I
loo0

1500

so0

4ooo

4500

100

(bl

zfloo 2500
3ow
Epoch;... Iraining; -test

5ooo

g 90iti
u 80i3
$

70-:

60
0

!iw

loocl

1500

2oal
Epoch; . . .

2500

3ocm

3500

4wo

4500

I
5ow

training; -test

Fig. 3. (a) The root mean square error (RMS) as a function of the number of aaining epochs; (b) the percentage of comxtly classified
objects as a function of the number of training epochs; network architecture (3 X 4 X 7); data set 1.

42

W. Wu et al. / Chemomerrics and Intelligent L&oratory

nitude of weights can be easily displayed in the Hinton diagram. This diagram displays the elements of
the weight matrix with squares whose areas are proportional to their magnitude. The bias vector is separated from the other weights with a solid vertical line.
The largest square corresponds to the weight with
largest magnitude and all others arc drawn with sizes
relative to the largest square [23]. The sum of the absolute values of the weights connected to a node can
be used to estimate the importance of the role played
by the node. This pruning is repeated until the performance of the network degrades.
Table 2 demonstrates the results of classification
for the sequence of steps in the optimisation of the net
architecture for data set 1. As one can see, a 100%
correct classification is observed for the NN with the
first 10 PCs as input variables and 4 nodes in the
hidden layer.
Fig. 1 demonstrates the performance of the network with 10 input and 4 hidden nodes during the
training. Fig. 2 shows the Hinton diagram. The
weights of input nodes 4 to 10 are much smaller than
those of the first three nodes. This suggests that the

Systems 33 (19%) 35-46

Table 2
Data set 1: correctly classified rate of the training set (CCR) and
test set (CCRt); training set selected by the Kemxud-Stone procedure; maximum number of epochs 5000
Input
nodes

Hidden
nodes

CCR
(C)

cCRt
(%)

Time
(s)

10
10
3
2

3
4
4
4

99.1
100
100
80

97.1
100
100
77.1

568
534
531
627

PCs 4 to 10 do not contribute significantly to the network performance and the first three PC factors play
an important role in classification. After pruning
them, the network performance does not decrease
(Fig. 3). However, recognition and prediction percentages become worse when the input nodes are reduced to 2 (PCs 3 to 10 are rejected>. Therefore, the
optimal structure of the network for data set 1 is 4
nodes in the hidden layer and 3 input nodes. The final weights of the optimal network are shown in the
Hinton diagram (Figs. 4 and 5). This indicates that

input i & Biases

0.5

1.5

2.5

3.5

4.5

mj
Fig. 4. (a) Hinton diagram of the weights between the nodes of the input layer and the nodes of the hidden layer in the optimal network
(3 X 4 X 7); (b) sum of tbe absolute values of the weights of the node in the input layer. (c) sum of the absolute values of the weights of the
node in tbe hidden layer, data set 1.

W. Wu et al. / Chemometrics and Intelligent Laboratory Systems 33 (19%) 35-46

43

Input i 8 Biases

(b)iilli
0.5

0
0

1.5

2.5

4
Neuron j

3.5

4.5

Fig. 5. (a) Hinton diagram of the weights between the nodes of the hidden layer and the nodes of the output layer in the optimal network
(3 X 4 X 7); (b) sum of the absolute values of the weights of the node in the hidden layer; (c) sum of the absolute values of the weights of
the node in the output layer; data set 1.

(4 R~CIII

(4 -

I51

151------

IO

10

*ia!

(c) KannaId-Btona

:J--yy
-5

10

.
Y

(d) Doptimal

15

1 ...
15

-5
3

10

15

Fig. 6. the design of the training sat by random selection, Kohonen self-organising map, Kennard-Stone algorithm and D-optimal design
with a simulated data set; ( *) objects of the training set ; ( - ) objects of the test set.

W. Wu et al./ Chemometrics and Infelligent L&oratory

44

Table 3
Data set 2: correctly classified rate of the training set (CCR) and
test set (CCRt); training set selected by the Kemtard-Stone prccedure; maximum number of epochs 15000
Input

Hidden

CCR

CCRt

Time

10
10
9
8

4
5
5
5

99.2
100
100
99.2

97.5
100
100
100

1670
2182
1781
1763

this diagram can be used as a visual tool to reduce the


number of the input scores and to select the input
variables. The optimal architecture is obtained after
we train the NN four times as described in Table 2;
while for the same data we need 36 trials with the
systematic trial method. This kind of pruning method
is much faster than the systematic trial method.
However, this pruning method can be effectively applied only when the performance of the NN is very
good. The idea of this method is to reduce the size of
the architecture without changing the performance of
NN. If the performance of the NN is bad, there is no
sense in trying to improve the performance by pruning.
Using the pruning method, the optimal architecture for data set 2 is 5 hidden nodes with 9 input
nodes (Table 3); for data set 3 it is 3 hidden nodes
with 6 input nodes (Table 4). This architecture and
parameters were used in the following experiments.
5.2. Comparison of the four techniques of training set
selection
In order to visually compare the four techniques,
two-dimensional data of 40 objects were first simulated. In this case, half of the objects was selected by
Table 4
Data set 3: correctly classified rate of the training set (CCR) and
test set (CCRt); training set selected by the Kennard-Stone procedure; maximum number of epochs 20000
Input
nodes

Hidden
nodes

::

CCRt
(46)

Time
(s)

7
7
6
5

2
3
3
3

76.8
100
100
99.0

69.4
100
100
100

1699
1521
1867
1845

Systems 33 (19%) 35-46

the studied methods except for the Kohonen method.


For the Kohonen method, 21 objects were selected,
because sometimes half of the objects mapping in the
same node was not an integer, and this was rounded
as described earlier.
Fig. 6 shows that objects selected by the Kennard-Stone and D-optimal cover the whole data domain; while the objects selected by the random selection and Kohonen methods do not. The KennardStone procedure selects the objects so that they are
distributed evenly, and the D-optimal selects the extreme objects. The number of objects which is out of
the range of the selected objects by the Kohonen
method is lower than that by random selection. Kennard-Stone and D-optimal methods seem to select
objects that are more appropriate in the sense that
they are more representative for building the class
borders than the other methods.
Further, we studied the effect of the four techniques on the performance of NN by keeping the architecture and parameters of the network constant. In
the methods of random selection, the training set objects are randomly selected and in the case of Kohonen self-organising maps, they are randomly selected
from each cluster. This random selection step leads to
the possibility that the selection is sometimes very
good and sometimes very bad. To overcome such a
drawback, the methods of random selection and Kohonen self-organising maps were repeated three
times. The results are shown in Tables 5-7. To train
the NN one time, it takes about 10 min for data set 1,
and half an hour for data sets 2 and 3.
For data set 1, there are no differences in the performance of recognition, the recognition percentages
Table 5
Data set 1: comparison of the four different techniques of training
set selection; number of correctly classified objects divided by total number of objects expressed behveen parentheses
Method

CCR (%)

CCRt (%)

Time (s)

Random
Random
Random
Kohonen
Kohonen
Kohonen
Kemmrd-Stone
Doptimal

100(105/105)
100 (105j105)
100 (105/105)
100(113/113)
100(116/116)
100 (116/116)
100(105/105)
100 (105/105)

97.1 (34/35)
100 (35/35)
94.3 (33/35)
%.3(26/27)
100(24/24)
100 (24/24)
100(35/35)
loo (35/35)

624
634
636
566
577
684
531
516

W. Wu et al./ Chemometrics

and Intelligent Laboratory Systems 33 (1996) 35-46

45

Table 6
Data set 2: comparison of the four different techniques of training
set selection; number of correctly classified objects divided by total number of objects expressed between parentheses

Table 8
Data set 3: correctly classified rate of the training set (CCR) and
test set (CCRt); training set selected by D-optimal design ; maximum number of epochs 2OCGO

Method

CCR (%o)

CCRt (%I

Time (s)

Random
Random
Random
Kohonen
Kohonen
Kohonen
Kennard-Stone
D-optimal

99.2 (119/120)
100(120/120)
99.2(119/120)
100(132/132)
100(130/130)
99.2(130/131)
100(120/120)
100(120/120)

92.5 (37/40)
97.5 (39/40)
97.5 (39/40)
96.4 (27/28)
100 (30/30)
96.6 (28/29)
100 @O/40)
100 @O/40)

2218
2213
2212
1942
1956
2309
1781
2246

Input
nodes

Hidden
nodes

CCR
(o/o)

CCRt
(8)

Time
(s)

6
5
4

3
3
3

100
100
88.9

100
100
88.9

1881
1916
1915

being 100% for all methods. There are differences


though in the performance of prediction. The prediction percentages with the Kennard-Stone and D-optimal training sets are 100%. For the random selection method, one of the three replicates gives 100%
prediction percentage. For the Kohonen method, two
of the three replicates perform 100% in prediction.
For data set 2, Kennard-Stone and D-optimal
training sets lead to perfect performance (100% correct classification) for both recognition and prediction. The results of the random selection are not satisfying. None of the replicates allow 100% prediction, and only one of the three replicates gives 100%
of recognition. With the Kohonen method, the results
of one replicate are good (100% recognition and
100% prediction success), and the results of the other
two replicates are bad.
For data set 3, the Kennard-Stone and D-optimal
training sets give the same perfect performance. With
the random selection and the Kohonen methods, the
Table 7
Data set 3: comparison of the four different technique3 of training
set selection; number of cone&y classified objects dtvided by total number of obkcts expressed between parentheses
Method

CCR (%o)

CCRt (%I

Time (s)

Random
Random
Random
Kohonen
Kohonen
Kohonen
Kennard-Stone
D-optimal

100 (99/99)
100 W/99)
100 @9/99)
97.2 (103/106)
1OQW7/107)
100(110/110)
100 W/99)
100 W/99)

94.4 (34/36)
88.9 (32/36)
100 (36/36)
96.6 (28/29)
100 (28/28)

1863
1868
1884
1667
1685
1632
1867
1881

100 (25/25)
100 (36/36)
100 (36/36)

performances of the three replicates are sometimes


good and sometimes bad. One of the three replicates
gives good results for the random selection method,
and so do two of the three replicates for the Kohonen
method.
In order to compare the Kennard-Stone design and
D-optimal design, we tried to optimize. the architecture of NN again for the D-optimal design. The architecture of data set 3 can be further improved using the pruning method (Table 8).
The architecture of the other data sets cannot be
pruned further. This suggests that the D-optimal selection might sometimes be slightly better than the
Kennard-Stone. D-optimal selection selects the training set objects which describe the whole information
as well as possible. There are more extreme objects
selected by this method than selected by the other
methods (Fig. 6). For classification, the aim is to derive the border of every class and therefore the extreme objects are more useful than others during
training.

6. Conclusion
Artificial NN are shown to be useful pattern
recognition tools for the classification of NIR spectral data of drugs when the training sets are correctly
selected. Comparing the four training set selection
methods, the Kennard-Stone and D-optimal procedures are better than the random selection and Kohonen methods. The results of the D-optimal design may
be slightly better than those of the Kermard-Stone
design. However, the computing time of the D-optimal design (using Kennard-Stone design as the initial points) is larger than that of the Kennard-Stone

46

W. Wu et ul. / Chemometrics anaIntelligent Laboratory Systems 33 (1996) 35-46

procedure. The random selection and Kohonen methods do not allow good performance in our study.
The number of data sets studied is not sufficiently
large to prove that these conclusions are always valid
for any data set. However, they allow at least to state
that the Kennard-Stone procedure will be a useful
approach in certain instances, and according to us, in
most instances.

References
[II X.H. Song and R.Q. Yu, Chemom. Intell. Lab. Syst., 19
(1993) 101-109.
Dl C. Borggaard and H.H. Thodberg, Anal. Chem., 64 (1992)
545-55 1.
[31 Y.W. Li and P.V. Espen, Chemom. Intell. Lab. Syst., 25
(1994) 241-248.
141D. Wienke and G. Kateman, Chemom. Intell. Lab. Syst.. 23
( 1994) 309-329.
151T.B. Blank and S.D. Brown, J. Chemom., 8 (1994) 391-407.
b51 J. Zupan and J. Gasteiger, Neural Networks for Chemists: An
Introduction, Weinheim, New York, 1993.
[71 T. Noes, K. Kvaal, T. Isaksson and C. Miller, J. Near Infrared
spectrosc., 1 (1993) 1-11.
181P. de B. Harrington, Chemom. Intell. Lab. Syst., 19 (1993)
143-154.
[91 B.J. Wythoff, Chemom. Intell. Lab. Syst., 20 (1993) 129148.
[lOI G. Kateman, Chemom. Intell. Lab. Syst., 19 (1993) 135-142.
[III J.R.M. Smits, L.W. Bmedveld, M.W.J. Derksen and G. Kateman, Anal. Chim. Acta, 258 (1992) 1l-25.

[12] A.P. Weijer, L. Buydens and G. Kateman, Chemom. Intell.


Lab. Syst., 16 (1992) 77-86.
[13] J. Zupan and J. Gasteiger, Anal. Chim. Acta, 248 (1991) l[14] ZWidrow, Adaline and Madaline, in Proceedings of IEEE
1st International Conference on Neural Networks, 1987, pp.
143-158.
1151 A. Maren, C. Harston and R. Pap, Handbook of Neural Computing Applications, Academic Press, San Diego, 1990.
[16] T. Naes, J. Chemom.. l(1987) 121-134.
[ 171 T. Naes and T. Isaksson, Appl. Spectrosc., 43 (1989) 328-335.
[ 181 T. Kohonen, Self-Organisation and Associative Memory,
Springer, Heidelberg, 1984.
[19] R.W. Kemmrd and L.A. Stone, Technometrics, 11 (1969)
137-148.
[20] R. Carlson, Design and Optimization in Organic Synthesis,
Elsevier, Amsterdam, 1992.
[21] P.F. de Aguiar, B. Bourguignon, M.S. Khots and D.L. Massart, Chemom. Intell. Lab. Syst. (in press).
[22] T.P. Vogl, J.K. Mangis, A.K. Rigler, W.T. Zink and D.L.
Alkon, Biol. Cybemet., 59 (1988) 257-263.
[23] H. Demuth and M. Beale, Neural Network Toolbox Users
Guide, The MathWorks, Inc., 1993.
1241 R.J. Barnes, M.S. Dhanoa and S.J. Lister, Appl. Spectrosc.,
43 ( 1989) 772-777.
[25] B. Bourguignon, P.F. de Aguiar, K. Thorns and D.L. Massart, J. Chromatogr. Sci., 32 (1994) 144-152.
[26] T.J. Mitchell, Technomettics, 16 (1974) 203-210.
1271 V.V. Fedorov, Theory of Optimal Experiments, Moscow
University. English translation by W.J. Studden and E.M.
Klimo, Academic Press, New York, 1972.
1281 A.C. Atkinson, Chemom. Intell. Lab. Syst., 28 (1995) 35-47.
1291 B. Bourguignon, P.F. de Aguiar, M.S. Khots and D.L. Massart, Anal. Chem., 66 (1994) 893-904.

Das könnte Ihnen auch gefallen