Sie sind auf Seite 1von 23

Signal Processing 83 (2003) 2499 – 2521

www.elsevier.com/locate/sigpro

Novelty detection: a review—part 2:


neural network based approaches
Markos Markou∗ , Sameer Singh
Department of Computer Science, PANN Research, University of Exeter, Exeter EX4 4PT, UK
Received 31 March 2003; received in revised form 4 July 2003

Abstract

Novelty detection is the identi.cation of new or unknown data or signal that a machine learning system is not aware of
during training. In this paper we focus on neural network-based approaches for novelty detection. Statistical approaches are
covered in Part 1 paper.
? 2003 Elsevier B.V. All rights reserved.

Keywords: Novelty detection; Network-based approaches; MLP; ART; RBF; Neural networks

1. Introduction The main criteria for evaluating novelty detection


is the maximisation of detecting true novel samples
Neural networks have been widely used for nov- while at the same time minimising the false positives.
elty detection. In this paper we detail a variety of A commonly used method for this is the ROC analy-
neural network methods for novelty detection. Our sis. Some other metrics of performance have also been
emphasis is on the technique rather than the appli- suggested. For example, Moya et al. [59] provides
cation itself. Compared to statistical methods, some three generalisation criteria that can be used to assess
issues for novelty detection are more critical to the performance of a novelty detector. Most pattern
neural networks such as their ability to generalise, classi.cation algorithms and neural networks fail at
computational expense while training and further ex- automatic detection of novel classes because they are
pense when they need to be retrained. In this vein, discriminators rather than detectors. They often use
some networks are better suited than others, how- open decision boundaries, such as hyper-planes, to
ever, with a lack of enough comparative studies and separate targets from each other and fail to decide
meta-analysis of their novelty detection performance when a feature set does not represent any known
and its relationship to the type and quality of data class. The performance of such detectors or one-class
used, it is hard to give a subjective view except classi.ers can be measured using three generalisation
than simply present a broad review of studies in the criteria. First, within-class generalisation indicates the
area. network’s performance on non-trained known classes.
Second, between-class generalisation indicates the
∗ Corresponding author.
performance on near-known class objects from other
E-mail addresses: m.markou@ex.ac.uk (M. Markou), classes. Finally, out-of-class generalisation indicates
s.singh@ex.ac.uk (S. Singh). the classi.ers’ performance on unknown classes.

0165-1684/$ - see front matter ? 2003 Elsevier B.V. All rights reserved.
doi:10.1016/j.sigpro.2003.07.019
2500 M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521

The computational complexity of neural networks rect network architecture are highly desirable in sev-
has always been an important consideration for prac- eral applications. There are three major approaches to
tical applications. One important consideration with tackle this problem; regularisation algorithms, prun-
neural networks is that they cannot be as easily re- ing algorithms, which start with a large network size
trained as statistical models. Retraining does not nec- and remove nodes during training that are not active,
essarily need to be applied when new class data has to and constructive algorithms which start with a small
be added to the training set. In some cases, when the number of hidden units and add units until a satis-
training data no longer reDects the environmental con- factory solution is found. Obviously, the constructive
ditions. In such cases the network selects the input for algorithms have several advantages over pruning algo-
retraining from the new environment automatically. rithms. First, it is easier to decide on the initial network
Such technique is very useful in applications such as with the constructive approach whereas with pruning
video processing where the same object might grad- one does not know how big the initial network should
ually change during operation due to diEerent light- be. Then it is much faster to train smaller networks
ing conditions, exposure times and other reasons [27]. so the constructive approach will lead to smaller net-
Similarly, Zhang and Veenker [91] introduce a new works capable for learning the problem faster. This
active learning paradigm enabling a neural network to paper is concerned with this approach. There are three
adapt actively to its environment by self-generating major problems involved in the design of construc-
novel training examples during learning using genetic tive algorithms. How to connect the new units to the
algorithms. New training examples are generated by existing network? How to train the new units without
genetic recombination of two parent examples of the loosing the knowledge captured by the rest of the net-
existing training set. work. When to stop adding units to the network? This
Retraining of networks after novelty detection paper is concerned with the learning of the new units
with an enlarged training data set deserves important with computational eIciency in both time and space.
consideration. Some networks such as constructive A common technique used to improve computational
neural networks are capable of on-line adaptation of eIciency is to assume that the nodes that already exist
links and weights as new classes are added. Network in the network are useful in modelling part of the tar-
types such as cascade correlation are better suited get function. Therefore, these nodes can be frozen and
to adaptation compared to multi-layer perceptrons. update only the new nodes in the iterative training pro-
In the case of multi-layer perceptrons, retraining im- cedure. Training can be performed using the backprop-
plies the addition of new output and hidden nodes agation algorithm but a further improvement in terms
and often it is not clear how best to train the new of computational eIciency can be achieved by pro-
con.guration. ceeding in a layer-by-layer manner. First the weights
Some of the important considerations in retraining feeding into the new hidden units are training and then
is that the experimenter often does not wish to retrain the weights feeding into the output nodes while keep-
a system from scratch and some form of incremental ing the input weights constant. This way there is no
training is attractive. A limited number of approaches need to backpropagate the error signals therefore is
have been proposed in this context. Kwok and Yeung much faster. During input training, the weights feed-
[41] address the very important issue of retraining a ing into the new hidden units are trained to optimise
neural network using some objective functions after an objective function. The authors present four such
new hidden units have been added. A standard neural objective functions and provide proof of their con-
network is generally useful only if the user chooses vergence ability. Additionally, these objective func-
its architecture correctly. A small network will have tions can be computed in O(N ) time where N is
diIculties in learning while a large network my re- the number of training patterns. The proposed objec-
sult in over-generalisation and poor performance. This tive functions include: (a) projection index that .nds
is particularly true in adaptive learning when new ‘interesting’ projections deviating from the Gaussian
classes might be added to the system and the size of form; (b) the same error criterion, such as the squared
the hidden layer might also need updating. In gen- error criterion, is used in output training; (c) the co-
eral, algorithms that automatically determine the cor- variance between the residual error and the new hidden
M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521 2501

unit activation, as in the cascade-correlation network such networks do not generate closed class bound-
and its variants; and (d) an objective function based aries, devising methods of novelty detection is a
on the use of projection matrices although this has fairly challenging task specially by ensuring that
a computational and storage requirements of O(N 2 ). the generalisation property of the network does not
The objective functions were experimentally tested interfere with its novelty detection ability [59]. A va-
on a set of synthetic data and gave very satisfactory riety of approaches have been proposed as discussed
results. below.
Singh and Markou [68] present a method of creat- In some studies, parametric statistics has been used
ing new feed-forward networks for new class samples for novelty detection by post-processing ordinary neu-
while keeping the earlier trained network on previ- ral network output data. Bishop [7] states that one of
ously known classes. In this manner, as new novel the most important sources of errors in neural net-
classes are discovered, new networks are created works arises from novel input data. A network, which
which is computationally eIcient. A new test sample is trained to discriminate between a number of classes
is presented to all available networks and its output coming from a set of distributions, will be completely
is thresholded to determine which class it belongs confused when confronted with data coming from an
to. Low outputs on all networks signal a novel class entirely new distribution. It is necessary for most ap-
sample. plications for the system to output along with the clas-
Despite diIculties with neural network retrain- si.cation of a data input, a measure of con.dence of
ing and vast amount of parameter settings, neu- this decision or to ‘refuse’ to make a decision if the
ral networks are important novelty detectors. The data point is found to come from a completely new
following sections detail some important types of class. The novelty detection technique is implemented
neural networks that have been used as novelty here by estimating the density of the training data thus
detectors. modelling its distribution and checking whether an in-
put data point comes from this distribution. The goal
of network training is to .nd a good approximation to
2. Neural network approaches the regression by minimization of a sum-of-squares er-
ror de.ned over a .nite training set. It is expected that
Neural networks have been used extensively in this approximation will be most accurate in regions of
novelty detection. They have the advantage that a very input space for which the density is high, since only
small number of parameters need to be optimised for then does the error function penalise the network map-
training networks and no a priori assumptions on the ping if it diEers from the regression. This is why the
properties of data are made. Here we review a number unconditional density might give an appropriate quan-
of diEerent architectures and methods that have been titative measure of novelty with respect to the training
used for novelty detection. These include multi-layer data. If the data point falls in a region with high den-
perceptrons, self-organising maps, radial basis func- sity then the network is likely to perform well. If the
tion networks, support vector machines, Hop.eld data point falls in a region with low density then it is
networks, oscillatory networks, etc. The number of likely that the data point comes from a class that is
studies that use a diEerent neural method of novelty not represented by the training data and it is likely that
detection in addition to the above are fairly limited. the network will perform poorly. This can be used to
Examples of such work include a three layer network assign error bars to the network outputs or by placing
trained with Widrow–HoE LMS associative learn- a threshold; patterns that fail the threshold may be re-
ing rule [44], and a hardware amenable restricted jected and classi.ed by other means. The density esti-
Boltzmann machine [60]. mation is done either by using a kernel-based estimator
or by using a semi-parametric estimator constructed
2.1. MLP approaches from a Gaussian Mixture Model. The author states that
it is important that density estimation is done on the
Multi-layer perceptrons are the best known and input data, before any pre-processing techniques take
most widely used class of neural networks. Since place.
2502 M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521

Desforges et al. [23] considers probability density anomaly (novelty) is detected when the neural net-
estimation of neural network output as a measure work places low con.dence in its decision. When the
of novelty similar to Bishop [7]. Probability density maximum activation is below 0.5, a novelty is de-
functions describe the frequency of occurrences of tected. A similar approach is adopted by Augusteijn
an observation that occur in any point within a do- and Folkert [5]. For novelty detection it is suIcient
main of interest. The Epanechnikov kernel is used in to place a threshold on the output values of the net-
this work and smoothing parameter h is calculated work and either take the Euclidean distance between
using the least-squares cross-validation method. This the output pattern and the target pattern and threshold
study pays special attention to the dimensionality of that or threshold the highest output value. This thresh-
the data. The most pronounced diIculty in treating old is user set.
high-dimensional data is in the importance that must LeCun et al. [43] discuss a method of novelty detec-
be accorded to the tails of a distribution. As the dimen- tion on the handwritten character recognition problem
sionality increases, the relative quantity of data associ- using MLP/backpropagation. The rejection criterion
ated with the relatively low-density tails grows. Hence, was based on three conditions. The activity level of
even very low-density regions must be regarded as the winning node should be larger than a given thresh-
important parts of the distribution. This makes the de- old T1 , the activity level of the second winning node
cision, whether a test vector belongs to the distribution should be lower than a threshold T2 and the absolute
or it is novel, a lot more diIcult. The number of data diEerence between T1 and T2 should be larger than a
points required for an accurate estimate of the density threshold Td . All three thresholds are user-de.ned and
of the data increases as a power of the number of di- in this study optimized using performance measures
mensions. For large number of dimensions the amount on the test set.
of training data required might be prohibitive to use Vasconcelos [83] makes an important contribution
for novelty detection. In this work, dimensionality to novelty detection in MLP by suggesting how to con-
reduction was achieved through data compression us- struct closed class boundaries. Such networks tend to
ing wavelets. A small number of wavelet coeIcients classify patterns that do not belong to any of the known
were selected to represent the data using Genetic classes to one of those classes with a high degree of
Algorithms. The relative suitability of a subset of con.dence. The reason for this is that they tend to
coeIcients was evaluated on their classi.cation separate the training classes using hyper-planes form-
power using radial basis function network. The model ing open boundaries between the classes instead of
showed very good approximations of the underlying around the classes [7,59]. One of the .rst approaches
distributions. For novelty detection, the application followed to deal with the problem of spurious patterns
of such technique is very simple. Given a new set of is to train the classi.er with ‘negative’ examples of
data, the probability of the data corresponding to a set random patterns. The objective is to create attractors
of conditions for which a density function is available in the pattern space representing the ‘rejection’ class
may be evaluated. The returned value represents the so that patterns, which do not belong to one of the
scaled probability of the new data corresponding to known classes, will fall in this ‘rejection’ class. Ac-
the original operating conditions. cording to the author, this technique will fail because
One of the simplest approaches to novelty detec- it is unrealistic to expect that randomly selected pat-
tion is based on thresholding output of a neural net- terns will accurately represent the input space where
work. Low con.dence indicates novel sample. Ryan novel patterns will fall. There is, however a similar
et al. [64] present a novelty detection method using approach based on bootstrapping that according to the
neural networks, applied to the detection of illegal use authors works better. When a pattern is rejected by the
of computer resources. The Neural Network Intrusion trained network, based on thresholding its output val-
Detection (NNID) anomaly detection system is based ues, with high degree of con.dence then it is assumed
on identifying a legitimate user based on the distribu- that this decision is in fact correct and the pattern is
tion of commands she or he executes. After the data is used as a negative example to retrain the network to
collected, a backpropagation neural network is trained reinforce its decision. The rejection occurs if the re-
to identify each user based on the training data. An sponses of all output neurons are close to 0 or the
M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521 2503

response of more than one neuron is close to 1. The and misclassi.ed patterns. S is used to determine two
target output of a rejection pattern is assigned as all reject thresholds optimal with respect to the assigned
zeros. function P selecting the values that maximize it. This
Vasconcelos et al. [85] study three feedforward neu- approach is independent of the network architecture
ral networks and compare their ability to deal with the and training method. The .rst threshold is applied to
rejection of novel data. These are the standard MLP the winning node rejecting patterns whose activation
network, an MLP that employs a Gaussian activation falls below this value while the second threshold is ap-
function (GMLP) and the radial basis function (RBF). plied on the diEerence between the activations of the
It is also shown how the MLP can be modi.ed to gen- winning and second winning nodes rejecting patterns
erate boundaries surrounding the training data to en- that fall below it. The approach was tested on a neu-
hance reliability using randomly generated reject class ral classi.er made of a three-level feed-forward fully
data. The GMLP is an alternative to the MLP in which connected network with sigmoid activations trained
the sigmoid activation is replaced by Gaussian. The using the backpropagation algorithm. The objective
motivation behind this is that in the use of the Gaus- was the recognition of unconstrained hand printed and
sian function, the receptive .eld of each network’s unit multi-font printed characters. By using the technique
corresponds to a hyper-hill in the pattern space which proposed, a considerable reduction of the misclassi-
‘prunes’ the unit to respond only to part of the half .cation rate was obtained, at the expense of only a
space causing more con.ned regions surrounding the slight decrease of the recognition rate.
training data. This situation can be considered more Cordella et al. [18] extend their previous work to
reliable for rejecting spurious patterns than that ob- other types of neural networks. The authors extend
tained with the standard MLP, especially when there is their work to include diEerent types of neural net-
an increasing number of training classes present in the works such as the MLP, the RBF, the LVQ, the SOM,
problem. In contrast with both the MLP and GMLP, the ART, the PNN and the Hop.eld network. The ap-
each hidden unit in the RBF network responds to a proach to the reliability problem (rejection of novel
localised receptive .eld in the input space formed by patterns) presented in this paper aims to be more gen-
the combination of a Euclidean distance measure as eral. A neural classi.er is considered to be a black
the propagation rule and a Gaussian as the network’s box, accepting an input pattern and supplying a vector
activation function. As a result, the network’s output with numbers as the output. No knowledge of train-
reaches its maximum when the input pattern is near ing procedures or networks architectures is necessary.
a centroid and decreases monotonically when it be- A pattern should be rejected if it is signi.cantly dif-
comes more distant from the centroid. Since the cen- ferent than the training data and/or it lies in the over-
troids are randomly selected from the input data and lapping region of two or more classes. In the case of
units respond positively only to a local area around the Hop.eld network, an approach similar to the au-
the centroids, inputs very dissimilar from the train- toassociator based novelty detection can be adopted
ing patterns tend to receive very low output. RBFs [2,9,19,20]. The rest of the neural networks can be
places closed decision boundaries around each class grouped into three categories. The MLP and the RBF
making them ideal for novelty detection and much can be grouped together because their output indicates
better suited than MLPs or GMLPs for real practical a class, the LVQ, SOM and ART together because
applications. their output is the distance of the pattern and its near-
Cordella et al. [17] de.ne a performance function est prototype and the PNN on its own because its
P that takes into account the quality of a classi.er output is a probability. For all groups two criteria are
in terms of its recognition, misclassi.cation and re- de.ned namely a and b . DiEerent ways of com-
ject rates. Under this assumption, the optimal reject bining these two criteria to decide whether a pattern
threshold is the one for which P reaches its absolute should be rejected are explored.
maximum. After the training phase, the classi.er is ap- De Stefano et al. [24] also extend the technique de-
plied, without reject option to a set S of samples whose scribed by Cordella et al. [17] to other types of neu-
class is known and is representative of the training ral network classi.ers. In this paper, the method for
set and it is evaluated in terms of correctly classi.ed determining the optimal threshold is generalised and
2504 M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521

rendered independent of the architecture of the con- be used to decrease the volume of the weight space in
sidered classi.er making it applicable to any type of the optimisation process. This is achieved by adding
classi.er. The authors consider the MLP, the Learn- an error term that is proportional to the sum of the
ing Vector Quantisation (LVQ) and the Probabilistic square of the weights. Second, we can change the usual
Neural Network (PNN). Similar to Cordella et al. [17] sigmoidal activation function to a sinusoidal function.
the rejection is performed on the basis of the output This creates a signi.cant change in the dynamics of
vector given a threshold. The threshold is optimised the training since even and odd higher derivatives of
on the basis of a function P that considers the costs as- the dynamical system are never both small. This im-
sociated with the classi.er’s correct recognition rate, proves network training and dynamics and results in
rejection rate and misclassi.cation rate. The authors better error-reject performance and smaller networks.
use a more complicated and generic way of using a Third, Boltzmann pruning is used to reduce the weight
set S similar to the training set to determine the re- space dimension and class based error weights are
jection thresholds. The optimised thresholds are used used during training.
in the case of the MLP the same way as in Cordella Denouex [22] presented a new adaptive pattern
et al. [17]. For the LVQ and the PNN a slightly dif- classi.er based on the Dempster–Shafer theory of ev-
ferent approach is followed. In the case of the LVQ, idence. This method uses reference patterns as items
the output vector is composed of the values of the dis- of evidence regarding the class membership of each
tances between each Kohonen neuron (prototype) and input pattern under consideration. This evidence is
the input sample. The .nal prototypes de.ned by the represented by basic belief assignments (BBAs) and
net will be the centroids of the regions into which the pooled using the Dempster’s rule of combination. This
feature space is partitioned. Samples signi.cantly dif- procedure can be implemented in a multilayer neural
ferent from those present in the training data will have network with speci.c architecture consisting of one in-
a distance from the winning neuron greater than that put layer, two hidden layers and one output layer. The
relative to the samples in the training set. Therefore, weight vector, the receptive .eld and the class mem-
a threshold can be placed on the quotient of that dis- bership of each prototype are determined by min-
tance and the maximum distance in the training set. imising the mean squared diEerences between the
On the other hand, samples belonging to an overlap- classi.er outputs and target values. After training, the
ping region have a comparable distance from at least classi.er computes for each input vector a BBA that
two prototypes. A second threshold can be placed on provides a description of the uncertainty pertaining
the quotient of the distance of the winning and sec- to the class of the current pattern, given the available
ond winning neurons. In a PNN, for each neuron k evidence. This information may be used to implement
the output vector assumes a value proportional to the various decision rules allowing for ambiguous pat-
probability that the input sample belongs to the class tern rejection and novelty detection. The outputs of
associated with the kth neuron. The distances between several classi.ers may also be combined in a sensor
the input sample x and all the samples belonging to fusion context, yielding decision procedures that are
the training set are computed and, on the basis of these very robust to sensor failures or changes in the system
values, the probability Pk that x belongs to each class environment.
k is evaluated. These probability density functions are Singh and Markou [68] present a new model for
generally computed using the Parzen method. Then novelty detection using neural networks. They .rst use
the PNN assigns the input samples to the class associ- the concept developed by Vasconcelos et al. [84,85]
ated to the output neuron with the highest value. Pat- of using random rejects to close known class bound-
terns are rejected using the same equations as in the aries. Their rejection .lter is used to discriminate
LVQ case. between known and novel samples and only known
Wilson et al. [87] demonstrate that an MLP can samples are classi.ed. The novel samples are accumu-
have as good or better novelty detection performance lated and then clustered using fuzzy clustering. These
than competing techniques by making fundamental clusters are then compared with known class distri-
changes in the network’s optimisation strategy. There butions to check if they could be outliers of known
are three changes necessary. First, regularisation can classes or whether they represent truly novel patterns.
M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521 2505

Truly novel samples are then manually labelled as of the volume of the feature space covered by the classi-
a new class and the neural network is incrementally .er and an optimisation of the parameters may be per-
updated to learn this information. Their results are formed. The authors propose using a d-dimensional
shown on natural scene analysis application where Gaussian distribution for creating the outlier data. The
they show how novel objects can be picked up in video direction of the object vectors from the origin will
analysis. not be changed, but they rescale the norm of the
object vectors. The authors indicate that the method
2.2. Support vector machines based approaches becomes infeasible in very high dimensional data es-
pecially when a hyper-box is de.ned to surround the
Support vector machines are based on the concept target data. In this respect the method described here
of determining optimal hyperplanes for separating data works better but both methods fail for data with more
from diEerent classes [82]. Tax and Duin [77,78] seek than 30 features.
to solve the problem of novelty detection by distin- SchSolkopf et al. [67] oEer an alternative the ap-
guishing the class of objects that are represented by the proach used by Tax and Duin. The diEerence is that
training set and all other possible objects in the object instead of trying to .nd a hyper-sphere with minimal
space. A sphere is found that encompasses almost all radius to .t the data, here the authors try to sepa-
points in the data set with the minimum radius. Slack rate the surface region containing data from the region
variables are also introduced to deal with the problem containing no data. This is achieved by constructing
of outliers in the data set. The radius and the number a hyper-plane which is maximally distant from origin
of slack variables are minimised for a given constant with all data points lying on the opposite side from the
C that gives the trade oE between the volume of the origin and such that the margin is positive. The paper
sphere and the number of target objects found. A given proposes an algorithm that computes a binary function.
test point is rejected if its distance from the centre of The function returns +1 in ‘small’ regions that con-
the sphere is larger than the radius of the sphere. The tain data and −1 elsewhere. The data is mapped into
usage of kernel functions as opposed to inner prod- the feature space corresponding to the kernel and is
ucts solves the problem of non-spherical distributed separated from the origin with maximum margin. Dif-
data. They considered a polynomial and a Gaussian ferent kernels may be utilised corresponding to a va-
kernel and found that the Gaussian kernel works bet- riety of non-linear estimators. To separate the dataset
ter for most data sets. A free parameter
needs to be from the origin a quadratic program (QP) needs to be
selected which de.nes the width of the kernel. The solved. A variable is introduced which takes values
larger the width of the kernel, the less support vectors between 0 and 1 and controls the eEect of outliers in
are selected and the description becomes more spheri- the system or rather how hard or soft the boundary is
cal like. The authors proposed a leave-one-out method around the data. The drawback of this method as men-
on the training data for optimising
and for monitor- tioned by Campbell and Bennett [14] is that the origin
ing the generalisation of the system. The advantage plays a crucial role. This is a disadvantage since the
of this technique over other techniques for novelty origin eEectively acts as a prior for where the class
detection such as Tarassenko [75] is that it does not abnormal instances are assumed to lie. The method is
have to make a probability density estimation of the tested on both synthetic and real-world data. The ex-
training data. A drawback of these techniques is, ac- periment was performed on the USPS dataset of hand-
cording to the authors, that they often require a large written digits. A Gaussian kernel was used for training
dataset, especially when high dimensionality feature and the results showed that a number of outliers were
vectors are used. Also, problems may arise when large in fact identi.ed. Further criticism of this method is
diEerences in density exist. Objects in low-density available in Manevitz and Yousef [48].
areas will be rejected although they are legitimate A simpler method for novelty detection extend-
objects. ing the work of Tax and Duin, and SchSolkopf et
Tax and Duin [79] suggest creating outliers uni- al. is proposed by Campbell and Bennett [14]. This
formly in and around the target class. The fraction of system is based on the statistical analysis of data.
the accepted outliers by the classi.er is an estimate of The data distribution of the data is modelled using
2506 M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521

a binary-valued function, which is positive in those feature space, and assume that not only the origin in
regions of input space where most of the data lies and the second class, but also that all data points ‘close
negative everywhere else. This is achieved by de.n- enough’ to the origin are to be considered as noise
ing separating hyper-planes in features space that will or outliers. If a vector has few non-zero entries, then
be positive on the one side and negative on the other. this indicates that the pattern shares very few items
A number of kernels may be used to construct such with the chosen feature subset of the database. So,
boundaries. The objective is to .nd a surface in input intuitively, this item will not serve well as a repre-
space that wraps around the data clusters. Anything sentative of the class. It is reasonable to treat such a
outside this surface is considered as novel. This, vector as an outlier. By thresholding the number of
according to the authors can be easily solved using features with non-zero values, an outlier can be de-
linear programming. According to the authors, this clared. A global threshold can be set for all classes or
approach overcomes the problem of the origin [67] alternatively each class can have its own threshold.
because rather than repelling the hyper-sphere from A validation set can be used for setting the thresh-
an arbitrary point outside the data distribution they olds. After the threshold is set, one can continue
attract the hyper-sphere towards the centre of the data with the standard two-classes SVM. Linear, sigmoid,
distribution. A hyper-plane is pulled over the mapped polynomial and radial basis kernels were used in this
data points with the restriction that the margin always work.
remains positive or zero. They make the .t of this Diehl and Hampshire [26] discuss novelty detection
hyper-plane as tight as possible by minimising the for image analysis. For image classi.cation and rejec-
mean value of the output of the function. Obviously tion, a set of closed decision regions encompassing
the tighter the hyper-plane, the more sensitive the the training examples from the various object classes
system becomes to noise and outliers in the data. For is estimated. First, a large margin partition of the in-
this, a soft margin approach is followed that incor- put image feature space is learnt by minimising an
porates a user set parameter that controls the size of objective function. This function is a generalisation
the boundary. A drawback of this method is the fact of the standard formulation of support vector learn-
that the system performance is very much dependent ing that is ideally suited for learning partitions of
on the choice of the kernel parameter
. The only high-dimensional spaces from sparse datasets. Once
way to set
is through experimentation and if not the initial partition is learned, a rejection region Rreject
enough data is present for validation purposes this is de.ned by estimating C diEerential thresholds that
can become very diIcult. The usage of an ensemble yield a given class-conditional probability of detection
of models with varying
can be used to lessen the on a validation set. All images that lie in the rejec-
impact of
. Another drawback of this method is the tion region will be rejected. The logistic linear form
fact that the system always tries to .t a hyper-sphere induces closed decision regions on the surface of the
around the data points. This, according to Tax and hyper-cube as desired. A logistic linear classi.er to
Duin [76] limits how tight the boundary can be put partition the class label distribution space is used. As
around the class objects especially when classes are the sequence classi.er processes the observed image
not spherically distributed. sequences, the image sequences assigned to a given
Manevitz and Yousef [48] investigate the usage of class are rank ordered based on their likelihood. Given
SVM for information retrieval with the aid of nov- that the likelihood is generally monotonically increas-
elty detection. The paper .rst explains the method ing with increasing diEerential, they sort the image
proposed by SchSolkopf et al. [67] and how this work sequences based on the diEerential produced by the
improves upon that. The authors criticise SchSolkopf’s sequence classi.er. This allows the user to quickly fo-
technique for being to sensitive to the parameters se- cus attention to the examples that cause the greatest
lected such as the choice of kernel. The diEerence in degree of confusion for the classi.er.
performance is very dramatic based on these choices Ratsch et al. [62] show via an equivalence of mathe-
meaning that the method is not robust without a matical programs that a support vector (SV) algorithm
deeper understanding of these representation issues. can be translated into an equivalent boosting-like al-
The basic idea in this research is to work .rst in the gorithm and vice versa. They show this translation
M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521 2507

procedure for a new algorithm: one-class lever- Cooper’s Restricted Coulomb Energy network (RCE).
aging, starting from the one-class support vector All these algorithms use hyper-spheres to surround the
machine (1-SVM). This is a .rst step toward unsu- training classes and produce closed decision bound-
pervised learning in a boosting framework. Building aries. The diEerence between these algorithms is the
on so-called barrier methods known from the the- manner in which they determine the number, posi-
ory of constrained optimisation, it returns a function, tion and sizes of the hyper-spheres. During training,
written as a convex combination of base hypotheses, ART2-A .xes the size of the hyper-spheres, RCE .xes
that characterises whether a given test point is likely the position and LVQ .xes the number. After train-
to have been generated from the distribution underly- ing, if a test vector is outside the hyper-spheres it
ing the training data. In this manner, novel patterns is deemed to be unknown. Synthetic aperture radar
can be detected. (SAR) imagery data was used to train and test the
Davy and Godsill [21] present a hybrid time- networks. The results showed that ART2-A and RCE
frequency/support vector machine (TFR/SVM) abrupt depend on the value of vigilance, a user set parame-
change detector. The objective of novelty detection ter that controls the size of feature space surrounded
is to decide whether a given vector x belongs to the by the networks and consequently aEects the number
set of training vectors X or it is novel. A solution to of hyper-spheres. Large vigilance causes the network
estimating the region R consists of .tting a SVM ker- to enclose lots of small regions and make the network
nel on the support training vectors de.ning a hyper highly discriminatory at what it calls a target. This al-
surface. The most commonly used kernel is the Gaus- lows excellent between-class generalisation but poor
sian kernel. In many situations, the training set may within-class generalisation. Small values of vigilance
contain a small number of abnormal vectors that may have the opposite eEect. This value needs to be opti-
cause the optimal hyperplane to be wrongly placed. mised. Overall performance is de.ned as the minimal
The authors suggest the usage of slack variables performance over all three generalisation criteria. Af-
that allow for some abnormal vectors. The method ter optimisation, LVQ yielded 89% performance, RCE
was successfully applied to audio signal segmenta- 94% and ART2-A 100%.
tion but no comparison to competing methods was A number of ART and fuzzy ART models have
performed. been proposed in literature. The fuzzy ARTMAP is
Diehl and Hampshire [26] show an interesting ap- a self-organising neural network. Each input learns
plication of novelty detection for video sequences us- to predict an output class K. During training the net-
ing generalised support vector learning. In the .rst work creates internal recognition categories, with the
step, all objects in an image are determined through number of categories determined on-line by predic-
training a classi.er and testing it on new video se- tive success. With fast learning, the jth weight vector
quences. For a collection of images in a sequence, records the largest and smallest component values of
this then leads to each video frame analysed for the input vectors placed in the jth category. The weight
objects it contains and assigning it to a category that vector is a hyper-box that encompasses all input vec-
best represents that frame. For example, if for a frame tors that were assigned to that category during training.
that contains mostly a car, it may be labelled as “car”. With winner-take-all strategy, the node that receives
For video sequences whose labels vary greatly, from the highest activation is selected and remains acti-
frame to frame, show a degree of confusion and these vated if it satis.es a matching criterion. Otherwise, the
sequences are labelled as novel. network resets the activation .eld and searches again
for the next node that satis.es the matching criterion.
2.3. ART approaches If the node makes a wrong class prediction, a match
tracking signal raises vigilance just enough to induce
Adaptive Resonance Theory has been shown to gen- search, which continues until either a node becomes
erate eEective classi.ers for novelty detection that active for the .rst time in which case the class is as-
have been shown to outperform other classi.ers such signed to that node or a node that has already be allo-
as SOM and LVQ. For example, Moya et al. [59] cated that class becomes active. This way, the network
compared ART2-A, Kohonen’s LVQ and Reilly and learns the number of classes on-line. During testing, a
2508 M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521

test vector is assigned to the class that is represented (MMD) is used. After applying a test pattern, the
by the activated node. A test pattern is classi.ed as network with the MMD between the pattern and the
novel if a familiarity function is less than a predeter- kernel with the maximum response is selected indi-
mined threshold . This function considers all of the cating very low novelty. Second, a novelty detection
training objects that belong to the hyper-box de.ned method is based on Projective geometry using the
by the weight vector of the winning node. A test pat- RBF hidden layer and pseudo-inversion of the matrix
tern has to lie within this hyper-box to be deemed to of kernels. Finally, Parzen windows are used to as-
belong to that class. Carpenter et al. [15,16] extended sess the novelty of the test pattern (this serves as a
the fuzzy ARTMAP neural network to perform fa- baseline).
miliarity discrimination (ARTMAP-FD) and test the Roberts and Penny [63] present a method for cal-
technique on a simulated radar target recognition task culating network errors and con.dence values rely-
evaluated using ROC curves. Ideally, after training on ing upon the use of committees of networks. The au-
a set of known objects, the network should not only thors state that the use of a single weight vector is
be able to classify these objects but also abstain from sub-optimal because most problems are complex with
making a meaningless guess when presented with an several local minima. Although the weights vector
object belonging to a diEerent, unfamiliar class. The is optimised, it still represents the weights set corre-
method shows that the threshold is inDuenced by noise sponding to one of many minima in the network’s en-
in the system. That is,  will fail with increasing noise ergy function. The solution to this problem is to use
levels and new threshold values need to be calculated. a committee of networks each initialised with a dif-
This value might be set by .rst calculating the noise ferent weights vector. The output error can be cal-
level of the data. As the authors point out, because of culated from the committee’s error covariance matrix
noise and varying target patterns encountered during or just take an average of all the error values. The
operation, the robustness of the choice of the optimal .rst term in the error equation penalises the variant
I =  is an important factor in the success of appli- decisions between committee members and the sec-
cations. The technique presented here for novelty de- ond penalises the committee as a whole if the error is
tection, the strategies for setting the familiarity thresh- erroneous. The experiments were performed using a
old and the results obtained as well as a comparison committee of RBF networks each utilising a thin-plate
with another technique are presented in more detail in spline functions on the hidden layer. The technique
Granger et al. [30]. was tested on a regression problem and a real prob-
lem of muscle-tremor classi.cation task. The aim of
2.4. RBF approaches the classi.cation was to distinguish between patient
and normal groups. The results showed that this ap-
Radial basis function network represent an impor- proach outperforms that of a single network without
tant class of neural networks where the activation of discarding too much of the data as novel.
the hidden unit is determined by the distance between By adding reverse connections from the output
an input vector and prototype vector [8]. Fredrickson layer to the central layer Albrecht et al. [4] show
et al. [29] used an RBF neural network with Gaus- how a Generalized Radial Basis Function (GRBF)
sian basis functions for novelty detection. The centre can self-organize to form a Bayesian classi.er that
positions and covariance matrices are determined by is also capable of novelty detection. An RBF is fed
unsupervised clustering using k-means followed by a with D-dimensional vectors through weighed synaptic
width heuristic or the EM algorithm. Output weights connections activating M neurons in the central layer.
can be computed via supervised learning techniques, The induced activities are calculated from non-linear
such as least mean-square (LMS) gradient descent or activation functions. The most widely used activation
matrix pseudo-inversion. The network outputs are es- function, and the one used in this research, is the
timates of Bayesian a posteriori class probabilities. multivariate Gaussian. The authors use a globally nor-
The system is applied to speaker identi.cation. Three malised alternative to the normal normalisation factor
novelty assessment techniques are used to evaluate of the Gaussian introducing a small cut-oE parameter
the networks. First, minimum Mahalanobis Distance  in order to con.ne non-zero activity responses of
M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521 2509

the central layer to an input pattern from a bounded When novel data is input to the system it is com-
region within input space. Any activation below  is pared with each of the models developed for the C
set to zero. The normalised activation functions used classes and the responses are gauged and measured
do not exhibit a simple radial decay characteristic against a threshold. Each class might have its own
and therefore the authors call these functions General threshold calculated by combining the histogram of
RBF (GRBF). If each neuron in the central layer the class in question and the joint histogram of the rest
is uniquely associated to one of the training classes of the classes. One very interesting property of this
then the activity of the jth output neuron is simply system is that the addition of a new class is very easy.
the summation of the normalised activities of all A new set of EBUs for the new class only needs to be
those neurons in the central layer that are associated developed and the information of the rest of the classes
with class j. This makes the GRBF identical to the is simply carried through to the new system. The LMS
Bayesian Maximum Likelihood Classi.er (BMLC). step might be required to be performed for all classes
By adding reverse connections from the output layer though. Jakubek and Strasser [34] similarly use ellip-
to the central layer, the authors enable the GRBF to soidal functions. Their fault detection scheme works
self-organize and group the neurons of the central in three steps: First, principal component analysis of
layer that belong to the same class together. The training data is used to determine non-sparse areas
cut-oE parameter  can be used for novelty detection. of the measurement space. Fault detection is accom-
A pattern that belongs to a class previously unseen plished by checking whether a new data record lies in
by the network is likely to elicit a very small total a cluster of training data or not. Therefore, in a sec-
activity within the central layer because this total ond step the distribution function of the available data
activity is the likelihood that the input pattern has is estimated using kernel regression techniques. In or-
been drawn from the model density of the training der to reduce the degrees of freedom and to determine
set. If the activation is smaller than  then none of clusters of data eIciently in a third step the distribu-
the neurons will acquire a non-vanishing activation tion function is approximated by a neural network. In
and the output of the network will also vanish. Thus, order to use as few basis functions as possible a new
a vanishing response of the classi.er is an indication training algorithm for ellipsoidal basis function net-
of the novelty of the input. works is presented: New neurons are placed such that
Brotherton et al. [12] have used a Class Dependent- they approximate data points in the vicinity of their
Elliptical Basis Function (CD-EBF) neural network centres up to the second order. This is accomplished
for classi.cation and novelty detection. Unlike the by adapting the spread parameters using Taylor’s the-
MLP, the EBF and similar networks have nearest orem. Thus, the amount of necessary parameters and
neighbour properties that make them well suited for the computational eEort for online supervision can be
novelty detection. EBFs facilitate novelty detection reduced dramatically.
by the way they are trained. Training is performed Brotherton and Johnson [11] use an RBF for nov-
in two steps. The .rst is clustering the training data elty detection that is constrained so that groups of
into hidden-layer elliptical basis units (EBU). The hidden unit basis functions within the neural net are
number of basis units required to model a given associated with only a single class. This is achieved
class is determined and a LVQ algorithm is used to by clustering the input data into one of several candi-
delineate the basis units for the given class. This is date basis units. LVQ can be used for this clustering
performed for all training classes. The second step or in this paper, however, the easier k-means clus-
is a least mean-square (LMS) weighting of the EBU tering algorithm is used. The basis functions used in
outputs to form the desired function approximation this work are Gaussian with the mean and variance of
for classi.cation of each class. Alternatively, to sim- each dimension calculated from the data. Following
ply select the class of the EBU that has the highest clustering a least mean square weighting of the basis
activation. CD-EBF can be used for novelty detection unit outputs is applied to form the desired function ap-
because of its nearest neighbour property. Each of the proximation for classi.cation. For novelty detection,
C sets of EBUs constitutes a model for its associated the net is able to detect some new event that the net-
class. work has never encountered before. This comes about
2510 M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521

because of the nearest-neighbour properties of the non-linear components. A novel sample when input
RBF. When signal data is input to the system, it is to such a network will fail to recreate the same output
matched against the model developed. If the input and hence the error at the output can be thresholded to
signals do not fall in any of the basis units, then reject novel samples. The batch self-organising map
anomaly is detected. can also be used as an auto-associator because it can
In most applications of RBF networks, the output be used for .nding discrete approximation of prin-
strategy implemented is as with the backpropagation cipal curves (or surfaces) by means of a topologi-
network that of Winner Take All (WTA). However, cal map of units. The batch version of the SOM is
according to Li et al. [45], this is not the most desirable closely related to the principal curves algorithm. Fi-
strategy when one is dealing with unknown classes nally, principal curves can be combined with neu-
such as in the case of fault diagnosis. An alternative ral networks. Principal curves can be used to map
to WTA is to apply a threshold at the output of the an n-dimensional space to a k-dimensional non-linear
network and if the value of an output neuron exceeds principal scores (nk) and the neural network can be
this threshold, then the test vector is assigned to the applied to the k-dimensional scores to map them back
class represented by that neuron. This approach oEers to the n-dimensional corrected data set. The afore-
the advantage to classify something as neither ‘nor- mentioned methods, i.e., PCA, Principal curves and
mal’ nor ‘existing fault’ but rather as something novel. neural networks, Principal curves and splines, and
This is particularly useful in fault diagnosis tasks as self-supervised MLP were tested on a simple synthetic
not all faults are known a priori and used for train- three-dimensional mathematical problem. The results
ing but also more than one fault might be occurring show the Principal curves and splines method outper-
simultaneously. However, there is a serious problem. forms the rest of the methods with the self-supervised
There is no theoretical basis for setting this threshold MLP being the second best.
(it is set empirically to 0.5). This paper attempts to The auto-encoder neural net has a number of uses,
give a mathematical and geometrical analysis of how from non-linear principal component analysis, infor-
such a threshold might be set. mation compression to recovery of missing sensor data
[81] and motor fault detection [61]. The most strik-
2.5. Auto-associator approaches ing ability of the auto-encoder is the ability to im-
plicitly learn the underlying characteristics of the in-
The main idea behind the auto-associator approach put data without any a priori knowledge or assump-
is to try and recreate the output of a system the same tions. Most of the studies below use a neural network
as its input. Song et al. [70] describe a number of dif- based auto-encoder system. Byungho and Sungzoon
ferent ways for building an auto-associator. The sim- [13] discuss three critical properties of an autoassoci-
plest method is to use Principal Component Analy- ator MLP novelty detector that is trained with normal
sis (PCA). PCA relies on an eigenvector decompo- patterns only. First, there exist in.nite input vectors for
sition of the covariance or correlation matrix of the which such a network could produce the same output
process variables. However, PCA identi.es only lin- vector. Second, there exists the ‘output-constrained
ear correlation between the variables. Principal curves hyperplane’ on which all output vectors are projected.
and surfaces are non-linear generalisation of the .rst As long as the MLP uses the bounded activation func-
principal component and the principal manifold is a tions, the hyperplane is bounded. Finally, minimising
generalisation of the .rst two principal components. the error function leads the hyperplane located in the
Conceptually the principal curve is a curve that passes vicinity of the training pattern. Similarly, a detailed
through the middle of the data. Auto-associator can analysis of the probabilistic behaviour of a novelty .l-
also be performed using the multi-layer perceptron ter working on the autoassociative principle is avail-
architecture to implement the mapping functions in able in Ko and Jacyna [36]. Diaz and Hollmen [25]
a bottleneck and most approaches to novelty detec- also studied the properties of autoassociative nets and
tion use feed-forward networks. The whole process compared the least squares mapping to kernel regres-
can also be used as non-linear PCA where the hid- sion mapping. They found that kernel regression map-
den weights represent a smaller than number of input ping is better suited and suggest how residuals can be
M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521 2511

correlated with the prior knowledge for visualisation transmissibility functions of an uncracked beam and
that can aid fault diagnosis. it is then tested on patterns from the cracked beam.
Japkowicz et al. [35] deal with the problem of bi- In this study a comparison was made between us-
nary classi.cation, that is, classifying a signal as of ing the Euclidean distance to calculate the novelty
two classes, normal or abnormal using a novelty detec- index, as described in Surace and Worden [72] and
tion technique. The threshold setting that determines Mahalanobis distance with the covariance matrix
whether the reconstruction error is small or large is, being derived from the training data. A novelty in-
according to the authors, relatively easy and can be dex eIciency formula was used to compare the two
selected during training of the system. The error on distances. The cracked cantilevered beam used in
training de.nes the lower bound, which is then relaxed this study behaves in a non-linear manner and this
a bit and used for testing. In cases when the separa- study proves that the technique is eIcient in both
tion between the two classes is more diIcult, a few linear and non-linear structures since it requires no a
samples of the negative class may be used for training priori knowledge of the model. The patterns of the
for better setting the threshold. cracked and uncracked beam were simulated using
Similarly, Streifel et al. [71] describe the use of Euler-type .nite elements with two degrees of free-
auto-associator network for detecting shorted wind- dom. Using the technique with the Euclidean distance
ings in operational turbine-generators (novel sam- it was possible to positively identify the presence
ples). They calculate the threshold at output layer of cracks in all cases in the simulation. The simu-
based on the training data. The average vector, called lation was repeated with the Mahalanobis distance,
prototype, is found and it is subtracted from all the which showed better performance in the presence of
patterns in the training set. This is done to translate the noise.
signature signals toward the signal space origin. The Surace and Worden [72] study damage detection in
simplest detection surface, according to the authors, an oEshore platform. The method presented here is an
is the hyper-sphere. The largest Euclidean length of extension to previous work of these authors. In the
the translated healthy signature signals is used as the previous paper, the technique proposed is not robust
threshold. Any signature outside the hyper-sphere is enough to handle more than one normal condition.
considered to be a fault. This paper extends the method to handle a continuum
Worden [88] applied the auto-associator network to of normal conditions, a situation faced for example
a simulated condition monitoring task. A novelty mea- in an oEshore platform. The diEerence between this
sure is calculated for each input pattern by taking approach and previous suggested by the authors is that
the Euclidean distance between the input and the out- the training patterns are contaminated with noise to
put pattern. Training stops when is reduced to zero. increase their variability and instead of the Euclidean
Novelty is detected if a test pattern returns a non-zero distance, the Mahalanobis distance is used (with the
novelty index. Unlike the previous study by Worden covariance matrix of the training data) to de.ne the
[88], the objective of the study by Surace and Worden novelty index. The technique was tested on the same
[72] is to detect damage in structures that have at least problem as in Surace et al. [73] and a Finite Element
two normal operating conditions. The three-degrees of model of an oEshore platform.
freedom system with concentrated masses considered Ko et al. [37] present a auto-associator network
by Worden [88] was used with two normal conditions. based hierarchical identi.cation strategy for succes-
The system was successful in detecting the fault condi- sive detection of the occurrence, type, location and
tion in the system. The technique was compared with extent of structural damage in bridges. The .rst stage
a naSYve solution that considers the Euclidean distance of the proposed hierarchical identi.cation strategy is
between training and testing patterns after averaging to detect the occurrence of damage or anomaly in the
over the training patterns. However this solution fails bridges. Just as in the previous studies, the diEerence
in the presence of two normal conditions. of the input vector and the trained network’s output
Surace et al. [73] describe novelty detection vector serves as a novelty index. In this study, .ve
for crack detection in beams using auto-associator auto-associator networks are used to monitor the
network. The neural network is trained using condition of each bridge. The second stage of the
2512 M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521

hierarchical process uses a probabilistic neural net- stimuli. Hop.eld networks have been suggested as
work to detect the type and the location on the bridge good quality novelty detectors [33]. Bogacz et al. [9]
of the damage detected by the novelty detector. demonstrate that a neural network has exactly the same
Finally, a backpropagation neural network is used to characteristic. They implement familiarity discrimina-
detect the extent of the damage. tion using two models: .rst, using a single neuron
Sohn et al. [69] explore the eEect changing envi- with Hebbian learning, and second, using the energy
ronmental conditions on a auto-associator novelty de- of a Hop.eld network. Their proposed approach dif-
tection system. Novelty detection is in one sense the fers from existing approaches in that it assumes that
measurement of deviation of some features of a sys- the patterns are not correlated. These algorithms com-
tem from the norm. However, changing environmen- press information and perform discrimination either
tal conditions such as varying temperature, moisture, by discovering the underlying distribution of the fa-
lighting conditions and so on aEect the normal features miliar patterns and .nding outliers, or by construct-
and the normal conditions. Moreover, these changes ing prototypes of the various classes. The weights of
may hide the true novelty within the system [49,50]. the neural networks are used to store the information
The main diEerence to a conventional auto-associator of the uncorrelated patterns. Both models have higher
network used is that the output of the ‘bottleneck’ layer capacity for familiarity discrimination rather than for
is fed to another hidden layer with the same number retrieval. The .rst model assumes a single neuron with
of units and activation functions as the mapping layer. N inputs. The node takes values between −1 and +1
The diEerence of the reconstructed output and the in- where −1 indicates inactive state whereas +1 indi-
put to the network is called the residual error that also cates active state. All weights are initialised to zero.
acts as the novelty index. The system only provides The Hebbian rule is used to update the weights. The
indication of the presence of novelty and not the type authors have demonstrated that the average value for
or severity of it. When patterns of an unknown condi- stored patterns is 1 while for novel patterns it is 0.
tion are presented to the system it is expected that the Therefore by taking as threshold the middle value 0.5
novelty index will increase. If the index rises above such that y = sgn(h − 0:5) where h is the output of
the prede.ned threshold then the pattern is deemed to the network, they can perform novelty detection. As-
be novel. suming that the noise is small enough then for novel
Manevitz and Yousef [47] apply the auto-associator patterns y = −1 and for known patterns y = +1. The
network to document classi.cation problem. For ac- neuron works well if the noise is smaller than the ab-
ceptance threshold determination, the authors used a solute value of the threshold. The larger the amount
sophisticated method based on a combination of vari- of stored patterns, the higher the noise. The authors
ance and calculating the optimal performance. Dur- have shown that the model is successful if the num-
ing training, they checked at diEerent levels of error, ber of stored patterns is less than 0:046N where N is
the performance values of the test set. They ceased the number of input neurons. The second model used
training at the point where the performance started a in this paper is based on a Hop.eld network trained
steep decline. Then they perform a secondary analysis with the Hebbian learning rule. The value of the en-
to determine an optimal real multiple of the standard ergy function is usually lower for stored patterns and
deviation of the average error to serve as a thresh- higher for other patterns. The authors have shown that
old. The method was tested and compared with a for novelty detection and for the case when the noise
number of competing approaches and found to out- has zero mean, a known pattern will yield an aver-
perform them. The competing systems were proto- age energy value 2E = −N whereas for novel patterns
type matching, Nearest Neighbour, NaSY ve Bayes and 2E = 0 where E is the network’s energy and N is
Distance-based Probability algorithms. the number of input neurons. Therefore an appropri-
ate novelty threshold should be −N=2. If a pattern has
2.6. Hop:eld networks energy 2E ¡ − N=2 it is considered to be familiar. As
before, errors occur when noise exceeds the novelty
The human brain has more capacity for familiar- threshold. The authors have calculated that the max-
ity discrimination rather than recognition of various imum number of stored patterns should be 0:023N 2 .
M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521 2513

The capacity of the second model is exactly half of store a speci.c set of equilibrium points such that once
that of the .rst model. This is because the Hop.eld an initial condition is provided, the network eventually
network has symmetrical weights thus storing each comes to rest at that design point. Elman networks con-
piece of information twice. If the redundant connec- tain an internal feedback loop, which makes it capable
tions are removed, the capacity of both models is the of both detecting and generating temporal patterns.
same. The paper oEers no experimental results to test Elman networks have the ability to approximate any
these two models in a novelty detection task but it input/output function with a .nite number of dis-
does however present good theoretical answers on the continuities owing it to their use of a two layer sig-
capacity of these models in familiarity discrimination moid/linear architecture. Time delay networks consist
tasks. of a complete memory temporal encoding stage fol-
A similar system using a Hop.eld neural network lowed by a feedforward neural network. The results
for familiarity discrimination is discussed by Crook showed that certain architectures are better at recog-
and Hayes [19]. The model stores information about nising novelty than others. The Hop.eld networks
familiar patterns, in the weights of a Hop.eld neural were capable of discriminating between normal and
network. A Hop.eld network can be used to recon- extremely obvious novel patterns but had diIculties
struct the patterns it has learnt at its output space on other more diIcult abnormal patterns. The Elman
after an iterative process by which each neuron in the networks, showed excellent performance in recognis-
network is updated several times until the network ing known patterns as well as discriminating the vari-
relaxes to the recalled pattern. Novelty detection is ous novel patterns. The Kohonen network also showed
implemented by calculating the energy of the Hop.eld good classi.cation performance. The time delayed
network after a pattern is shown and then threshold network was able to discriminate between error and
this energy. Patterns with low energy are deemed normal patterns but like the Hop.eld network it had
as familiar whereas large energies point towards diIculties recognising novelty in sets that were pri-
novel patterns. Estimating the novelty threshold has marily consisted of normal patterns.
a theoretical basis and it is calculated as E ¡ − N=8
for familiar patterns where N is the number of in- 2.7. Oscillatory neural networks
put neurons in the network. The advantage of using
energy to determine novelty in the data and not Ho and Rouat [32] proposed a neural network
allow the neurons to settle through iterations and re- model that allows studying neural information pro-
construct the pattern is that energy is more computa- cessing in the cortex. The system dynamics and the
tionally eEective and remains constant no matter how self-organising process exhibit robustness against
many patterns are stored in the network. One of the highly noisy input patterns. Along with the neu-
shortcoming of this technique is the novelty thresh- ral network model, they present a new paradigm
old. According to the authors, the more patterns the for pattern recognition based on oscillatory neu-
network learns and obviously the more noise is intro- ral networks. The relaxation time of oscillatory
duced, the less eEective the threshold becomes. The behaviour is used as a criterion for novelty detec-
authors compare their technique with that of Marsland tion. The neuron model used here is inspired by the
et al. [52] and claim very similar performance but integrate-and-.re neuronal model with refractory pe-
with signi.cantly less learning and novelty detection riod and post-synaptic potential decay. This model
runs. de.nes a single two-dimensional sheet of excitatory
A criticism of neural network architectures is their and inhibitory neurons with recurrent connections.
susceptibility to catastrophic interference; the ability The layer consists of two populations of neurons in-
to forget previously learned data when presented with terspersed within the plane. Each neuron has a set of
new patterns. Addison et al. [2] evaluate two archi- interconnections chosen according to a square neigh-
tectures, namely Hop.eld and Elman networks and bourhood centred at the neuron itself. If the network’s
compare them with self-organising feature maps and action stimulated by an input signal is successful,
time-delayed neural networks in a novelty detection all connections of .ring neurons are reinforced, re-
task. The Hop.eld network essentially attempts to gardless of whether they participated in creating a
2514 M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521

successful action or not. If the action is unsuccessful, The activity in other parts of the network is low. This
the connections of .ring neurons are weakened. For occurs in both memorisation and recall. For novelty
updating the connection weights, the Hebbian updat- detection it is possible to choose the parameters of
ing rule is applied. For novelty detection, during the learning control in such a way that for a new stim-
learning phase, the network with randomly initialised ulus the number of resonant oscillators at the end of
connection strengths is trained with learning patterns. stimulus presentation is small, but becomes large (and
It reaches an equilibrium stage after learning. In the exceeds a certain threshold) only if the stimulus has
novelty detection phase, patterns are introduced to the been learnt before. If the stimulus fails to pass the
trained network. The network reaches an equilibrium threshold then a novelty is identi.ed. A computer sim-
stage after a relatively small number of iterations ulation was used to test the model showing that in-
if these patterns have been learnt before. Otherwise deed this type of network can be used for novelty
it takes a long time for the network to reach an detection. The simulation was limited and no clear
equilibrium stage. Novelty is de.ned by this time results or comparison with competing methods were
taken. presented.
Kojima and Ito [40] proposed an autonomous dy-
namical pattern recognition and learning system which 2.8. SOM based approaches
can learn new patterns without any external observer.
For the novelty .lter, the network is constructed from Self-organising Maps proposed by Kohonen [38,39]
Lorenz systems. The learning rule, updates the synap- are an alternative to statistical clustering of data. The
tic weights in a self-organising manner according to approach is unsupervised and therefore no a priori
the discrete time Hebbian learning rule. For measur- information on class labels of samples is necessary.
ing the output pattern of the network, they calculated In most SOM based approaches, similar to statistical
Hamming distances. Novelty detection occurs in a clustering, some form of cluster membership value is
manner similar to Ho and Rouat [32]. When a known thresholded to determine whether a sample belongs to
pattern is given to the network, the network oscillates a cluster or not.
periodically and the output pattern of the network os- Aeyels [3] provides proof and clari.es some points
cillates between the relevant embedded pattern and its regarding the convergence properties of the novelty
inverse. On the other hand if a novel pattern is in- detector and novelty .lter described by Kohonen [38].
putted to the network, the network reaches a turbulent These adaptive systems are capable of storing a num-
state. This turbulent state is considered as confusion ber of inputs and responding only to ‘new’ inputs, pat-
and thus the pattern is deemed to be novel. During this terns that the system has not ‘seen’. The author tries
state, the Hebbian learning rule was applied to learn to elucidate some of the results presented in Kohonen
the new patterns. (1988) but also to indicates some problems. Kohonen
Borisyuk et al. [10] describe a model consisting of [38] described two types of novelty detectors: novelty
a one-layer network of interacting oscillators. The ac- detector without forgetting and the novelty detector
tivity of an oscillator represents the average activity with forgetting. Regarding the novelty detector with-
of interactive neural populations (local .eld potential) out forgetting, in the case when the input to the system
with the oscillators grouped. In the initial stages, be- is constant, the convergence is easy to derive. How-
fore the oscillatory network stores any information, ever, in the case when the input to the system is a reg-
each group contains oscillators whose natural frequen- ular bounded function of time, the author proves that
cies are distributed in the whole range of input fre- in order to have the system reacting to novelty with
quencies. During information storage, these natural respect to the stored patterns, all the stimuli should
frequencies may change. An oscillator reaches and keep coming back. In other words, the system can
keeps a high level of activity if the signals that are only memorise patterns when it is frequently exposed
supplied to this oscillator through the .rst channel ar- to them. Any new stimulus will provoke an initial re-
rive in-phase. This implies that the presentation of a action in the output and will be then added to the
stimulus results in a high oscillatory activity at only a memory if it is frequently reiterated. The novelty .lter
small number of randomly chosen locations (groups). without forgetting consists of a collection of novelty
M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521 2515

detectors, connected by particular feedback laws. The presented here is similar to that proposed by Harris
author shows that the convergence of this system is (1993) and Ypma and Duin [89] but not as sophisti-
easy to prove. Finally, the novelty .lter with forget- cated as the habituation approach taken by Marsland
ting contains a forgetting term that forces the system et al. [52–54]. After the SOM is trained, it is expected
to habituate and reduce its response when similar pat- that diEerent types of transient input signals will ac-
terns are frequently shown. This is similar to habitu- tivate diEerent nodes on the Kohonen map. The au-
ation described by Marsland et al. [51–54]. thors do not threshold the Euclidean distance between
Harris [31] presents one of the earliest approaches the activated neuron and the input data as do most
on using a Kohonen self-organising map (SOM) for approaches but instead use the index of the activated
novelty detection in an engine health monitoring sys- node to discriminate between normal and fault con-
tem. A SOM is trained with examples of normal op- dition. It is expected that faulty features will excite
eration. After training, the map contains the reference diEerent nodes than healthy ones.
vectors of the input data. These reference vectors are Labib and Vemuri [42] recently described an im-
optimised to accurately represent both the density and plementation of a network based Intrusion Detection
the position of the input data. When testing, the dis- System (IDS) using SOM. NSOM is an anomaly de-
tance between the test vector and these reference tection system. The SOM implementation used was a
vectors is a measure of novelty. A variation of this Kohonen net with the winning neuron being the one
approach can be used if some examples of the faulty with the shortest distance from the input pattern. The
conditions are available. In this case they can also be NSOM is .rst trained with patterns describing normal
used during training for better representing the faulty network traIc and the output response of the NSOM
space. The task in this case is reduced to classi.cation is noted, i.e. the neurons that are activated are stored.
and not so much to novelty detection. Then the network is tested. If the winning neuron is
Ypma and Duin [89] employ a SOM to develop not one of the neurons noted then a novelty is declared.
a novelty detection technique used for the detection It is expected that the distance of an input pattern and
of faults in a fault monitoring application. The au- the winning neuron for a novel pattern will be much
thors comment on the unavailability of samples that larger than the corresponding distance of a known pat-
describe accurately the faults in the system and agree tern. The technique described here is closely related
with other authors that the best solution is to accu- to that of Emamian et al. [28].
rately build a representation of the normal operation Theo.lou et al. [80] propose a new Long-Term De-
of the system and measure faults as a deviation of pression Kohonen Network (LTD-KN) that is very
this normality. The usage of a SOM provides a do- well suited for novelty detection. The network behaves
main description instead of a probability density esti- like a normal Kohonen network in every way except
mation as used by Barnett and Lewis [6], Bishop [7], for the fact that the change of the weight vectors for
Tarassenko [74] and several others. Additionally, the winning and neighbouring neurons is determined by
topology of the input space is preserved as opposed an inverse of the classic Kohonen rule. After learning,
to using some other unsupervised clustering algorithm all patterns used in the training set and all patterns
such as the k-means algorithm, giving information similar to them give decreasing activation values. All
about the mapping that could be exploited in de.ning patterns dissimilar to the training set (novel patterns)
a more con.dent ‘compatibility measure’. The nov- result always in a stable high activation both during
elty detection technique is a very simple one. Once the and after learning. The diEerentiation between known
SOM is trained with samples of normal operation it and novel patterns comes as a natural consequence of
is tested and patterns from normal operation generate learning. This makes the LTD-KN to function as a
small distance while abnormal patterns generate large novelty detector.
distance.
Emamiam et al. [28] present a very simple nov- 2.9. Habituation based approaches
elty detection technique based on a SOM to discrim-
inate acoustic emissions of healthy machinery from Habituation is the mechanism by which the brain
that of machinery presenting a crack. The technique learns to ignore repeated stimuli and it is considered
2516 M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521

as the most basic form of plasticity within the brain. that can be very useful in many applications includ-
Marsland et al. [51] attempted to implement this phe- ing video analysis. The novelty detection technique
nomenon for novelty detection for a mobile robot ap- used is based on a clustering network that classi.es
plication. A number of subsequent studies by the same the inputs, and the output is modulated by habituable
authors attempted to improve their original idea. In synapses, so that the more a neuron .res the lower
Marsland et al. [51] they try to implement the origi- the eIcacy of the synapse becomes. If a synapse is
nal proposal made by Wang and Arbib [86] on how to fed with zero instead of nothing, then it forgets the
construct a neural model with habituation that is capa- inhibition over time. The three clustering schemes
ble of novelty detection. Wang and Arbib [86] mod- were tested on a relatively easy mobile robot applica-
elled the tectal relay and anterior thalamus (AT), the tion. The overall qualitative results were similar for
areas that process the images taken from the retina. all three networks, although the SOM took consider-
They then extended the model with a neural mecha- ably longer to produce consistent output when a new
nism for the medial pallium (MP), the region in which pattern was introduced while the TKM responded to
habituation is thought to take place. Their design con- them the quickest. When two additional light patterns
sists of a large number of columns arranged vertically were introduced, the TKM and the k-means clustering
with .ve layers of cells in each column, each layer performed much better than the SOM with the TKM
consisting of n neurons. One neuron in the AT prop- responding again the fastest.
agates its response strength to all neurons in the next Marsland et al. [54] describe an algorithm suit-
layer simultaneously. The output of these neurons are able for detecting novel stimuli based on habituation
controlled by a novelty threshold that increases mono- and apply it to an autonomous agent. The paper uses
tonically with each activation and is designed to make the habituating self-organizing map (HSOM), a neu-
cells with higher number of activations harder to .re, ral network that is capable of detecting novel objects.
so that stronger stimuli have a larger number of neu- An input vector is presented to a clustering network,
rons .ring. Marsland et al. describe the procedure in which .nds a winning neuron using a WTA strategy.
extensive detail. The only concern with the technique Each neuron in the map .eld is connected to the out-
is that it is very computationally intensive and imprac- put neuron via a habituable synapse, so that the more
tical in a real-world robot implementation. The authors frequently the neuron .res the lower the eIcacy of the
considered other ways of implementing habituation in synapse and hence the lower the strength of the output.
series of future papers [52–54]. The strength of the winning node is taken as the nov-
In Marsland et al. [52] use a Kohonen self-organising elty value and the more familiar the object is, the more
map for classifying the inputs and habituating the value decreases to zero. The clustering network is
synapses for implementing the novelty .lter. Both the implemented using a Kohonen network implementing
clustering network and the novelty .lter are described LVQ. The map has the property of self-organising de-
in detail in Marsland et al. [54,55]. Marsland et al. pending on the input vectors by moving the winning
[53] use the same technique presented in this paper neuron and to a lesser extend its immediate neighbours
is the one described in Marsland et al. [54,55]. How- towards the input. In the HSOM implementation, the
ever, in this paper two alternative clustering schemes, synapses of these neighbours are also changed. The
Temporal Kohonen map (TKM) and k-means clus- neighbourhood size and the learning rate are user de-
tering are evaluated as opposed to the SOM used in .ned and are automatically reduced during training.
Marsland et al. [54] and the GWR network presented Then the environment was changed or a new environ-
in Marsland et al. [55]. The TKM is based on Koho- ment was used to test the HSOM. The main disad-
nen’s SOM but uses ‘leaky integrator’ neurons whose vantage of the method, as pointed out by the authors,
activity decays exponentially over time. This is simi- is the organising map. The size of the SOM needs to
lar to a short-term memory allowing previous inputs be de.ned in advance, often without a priori knowl-
to have some eEect on the processing of the current edge of the number of objects or their complexity that
input, so that the neurons that have won recently are the system is likely to encounter. This can lead to the
more likely to win again. In other words some sort SOM becoming saturated with previously learnt stim-
of temporal information is retained in the system and uli being lost and novel stimuli being misclassi.ed as
M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521 2517

known. The work of Marland and colleagues [52–54] Marsland et al. [56] extend the GWR system. The
has been used by Saunders and Gero [65,66] for nov- new system presented in this paper is capable of
elty in design solutions. They use habituated SOMs autonomously selecting which novelty .lter to use
to estimate the novelty of a doorway design. A re- depending on what environment the system operates
inforcement signal proportional to the novelty of the under. The problem in most novelty detection systems
design situation is produced to reward the controller is that objects that are quite normal in some envi-
for .nding novel situations. ronments are to be considered novel in others. For
Marsland et al. [55] improve the techniques pro- example a chair is normal in an oIce but should be
posed in their earlier studies. As an alternative to found to be novel in a corridor. In this system, multi-
SOM, a new type of clustering map, called grow when ple novelty .lters are trained in diEerent environments
required (GWR) network was developed that allows and the correct .lter for the current inputs is selected.
the insertion of new nodes when required. In this net- A vector of familiarity indices keeps track of how
work, both the synapses and the nodes have counters novel each novelty .lter .nds the environment. The
that indicate how many times they are .red. Using novelty indices are updated after each perception has
these counters it is possible to determine whether a been presented to all novelty .lters. The technique
given node is still learning the inputs or it is ‘con- was experimentally tested in a similar way to the au-
fused’. In other words it tries to map inputs from diEer- thors’ previous studies. This time three environments
ent classes. If this is the case then a new node is added were used to train three diEerent novelty .lters. The
to the network between the input and the winning node correct .lter was chosen at all times.
that caused the problem. The insertion of nodes is de-
pendent upon two user-de.ned thresholds. The .rst is 2.10. Neural tree
a minimum activity threshold below which the cur-
rent node is not considered to be a suIciently good Martinez [57] introduce a competitive learning tree
match and second is a maximum habituation threshold as a computationally attractive scheme for adaptive
above that the current node is not considered to have density estimation and novelty detection. When the
learnt suIciently well. These thresholds need to be neural tree is continuously adapted, novelty detec-
set experimentally. The experiments were performed tion can be performed on-line by comparing, at each
with a small SOM, a large SOM and a GWR network time step, the current estimated model with an a
and it was found that the small network quickly satu- priori model. The proposed approach combines the
rated and was unable to learn all the objects whereas unsupervised learning property of competitive neu-
the large SOM had problems learning the novel ob- ral networks with a binary tree structure. The pro-
jects. It was very sensitive to noise in the sensors cedure performs a hierarchical partitioning in the
and kept misclassifying learnt objects as novel. The d-dimensional feature space by means of hyper-planes
GWR on the other hand showed very promising re- perpendicular to coordinate axis. This results in a bi-
sults. The network learnt quickly and was successful nary tree structure in which each internal node stores
in recognising the novel objects at the end of the third two scalar quantities, an index representing the di-
run. mension orthogonal to the hyper-plane and a weight
Crook et al. [20] compare two models of novelty representing the location of the hyper-plane on this
detection: GWR network proposed by Marsland et al. axis. The initialisation process can be performed with
(2001) and Hop.eld energy model [19]. Two diEerent N input data sampled either randomly from the train-
robot experiments were used to compare the two nov- ing data or sequentially as data becomes available.
elty detection methods. Both experiments were very The neural tree is built up by splitting nodes one at a
simple so as the comparison is strictly between the time in order to maintain a single count in each parti-
.lters and not other experimental issues associated tion cell. The cell to be further partitioned is the one
with computer vision pre-processing. The GWR nov- in which the input sample falls. Splitting occurs in the
elty .lter has shown more robustness against noise. In middle of the new data point and the one previously
general both .lters show similar results although the stored. For a given tree topology the weights are opti-
GWR is slightly better. mised by maximizing Shannon’s entropy. A top-down
2518 M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521

learning scheme is employed which consists of opti- De-correlation operator application on network out-
mising the parameters level by level within the model puts is equivalent to a class prototype rotation. So,
starting at the highest level (root node). For on-line the computed operator is applied directly to the proto-
learning, all levels are trained in parallel as data ar- type matrix, and the system stabilises itself in a state
rives. For novelty detection, the tree was applied to a of uncorrelated output cell activities. The neural net
two-dimensional data to adaptively partition the input has two fully inter-connected layers. The input layer
space into 64 equi-probable cells. The starting weight receives coeIcients of stimuli vectors. The output
con.guration was obtained by random initialisation. layer has one cell per class, and another for novelty.
Each time, the partition revealed smaller cells in Cell activations are computed by the projection of
high-density regions and larger cells in low-density stimuli vectors on prototype space. It is assumed that
regions relative to the underlying input distribution. class prototypes are linearly independent. The sys-
Thus, the learning rule provides an adaptive focusing tem is initially empty; therefore there are no known
mechanism capable of tracking time-varying distri- classes. A new class is created when the novelty cell
butions. The constructed tree from the training data activation exceeds a .xed vigilance threshold, with
serves as a reference tree. Another tree is built for the the input vector as the new class prototype. The sys-
testing data and a novelty is detected when it ‘diEers’ tem also permanently scans class inertias. If one of
too much from the reference tree. Possible distance them is lower than a deletion threshold, then the class
measures are the Kullback–Leibler Divergence or the will be deleted. This class integration and deletion
log-likelihood ratio. An appropriate threshold can be on-line process induces stabilisation of sub-space di-
used to detect the novelty. Although the paper presents mension, which depends on the thresholds and the
a very interesting approach to density estimation and input variability. In the .rst test, the .rst seven alpha-
novelty detection, the authors fail to state how their bet pattern letters were randomly mixed. All seven
method copes with very high-dimensional data. The classes were eEectively found, and original patterns
system was only tested on a two- and 16-dimensional were recovered with low noise level. The second test
data. The performance obtained for the experiments shows a signal in a real sub-aquatic environment and
performed is excellent. system responses. Recurrent events were detected
well.
2.11. Other neural network approaches Martinelli and Perfetti [58] described a cellular neu-
ral network (CNN) for novelty. Each cell is connected
Linares et al. [46] describe a new neural architec- to its neighboring inputs via an adaptive control oper-
ture for unsupervised learning of the classi.cation ator, and interacts with neighboring cells via nonlin-
of mixed transient signals. The method is based on ear feedback. In the learning mode, the control oper-
neural techniques for blind source separation and ator is modi.ed in correspondence to a given set of
subspace methods. The feed-forward network dynam- patterns applied at the input. In the application mode,
ically builds and refreshes an acoustic events classi- the CNN behaves like a memory-less system, which
.cation by detecting novelties, creating and deleting detects novelty for those input patterns that cannot
classes. Each output cell of the neural network is as- be explained as a linear combination of the learned
sociated with an event class, and several output cell patterns.
activities are considered as simultaneous presence of
events of diEerent classes. The unsupervised neural
classi.er self-organises in order to adapt itself to 3. Conclusions
environmental evolutions. This self-organising pro-
cess is made on-line, by detecting novelties, creating There are a number of studies in the area of novelty
and deleting classes. The .rst data space modelling detection but comparative work has been much less.
reduces input space dimension to a smaller one. A Only a few papers have compared the diEerent models
second process computes a de-correlation matrix that on the same data set, e.g. Zhang et al. [90], Addison
achieves prototype rotations in order to minimise et al. [1], Singh and Markou [68]. As a result of few
the second-order moments of the network output. comparative studies, there are few guidelines on which
M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521 2519

techniques will work best on what types of data. We [14] C. Campbell, K.P. Bennett, A linear programming approach
hope that this survey will provide for researchers a to novelty detection, Advances in NIPS, Vol. 14, MIT Press,
detailed account of other approaches available so that Cambridge, MA, 2001.
[15] G.A. Carpenter, M.A. Rubin, W.W. Streilein, ARTMAP-FD:
more comparative work can be performed and some of familiarity discrimination applied to radar target recognition,
the weakness of known approaches can be addressed. Proceedings of International Conference on Neural Networks,
Vol. III, Houston, TX, 1997a, pp. 1459 –1464.
[16] G.A. Carpenter, M.A. Rubin, W.W. Streilein, Threshold
determination for ARTMAP-FD familiarity discrimination,
References in: C.H. Dagli, et al., (Eds.), Intelligent Engineering Systems
Through Arti.cial Neural Networks 7, ASME Press, New
[1] J.F.D. Addison, S. Wermter, J. MacIntyre, EEectiveness of York, NY, 1997b, pp. 23–28.
feature extraction in neural network architectures for novelty [17] L.P. Cordella, C. De Stefano, F. Tortorella, M. Vento, A
detection, Proceeding of the Ninth ICANN, Vol. 2, 1999, method for improving classi.cation reliability of multilayer
pp. 976 –981. perceptrons, IEEE Trans. Neural Networks 6 (5) (1995)
[2] J.F.D. Addison, S. Wermter, K. McGarry, J. Macintyre, 1140–1147.
Methods for integrating memory into neural networks [18] L.P. Cordella, C. Sansone, F. Tortorella, M. Vento,
in condition monitoring, Proceedings of the International C. De Stefano, Neural network classi.cation reliability:
Conference on Arti.cial Intelligence and Soft Computing, problems and applications, Image Processing and Pattern
BanE, Alta., Canada, 2002, pp. 380 –384. Recognition, Vol. 5, Neural Network Systems Techniques
[3] D. Aeyels, On the dynamic behaviour of the novelty detector and Applications, Academic Press, San Diego, CA, 1998,
and the novelty .lter, in: B. Bonnard, B. Bride, J. Gauthier, pp. 161–200.
I. Kupka (Eds.), Analysis of Controlled Dynamical Systems, [19] P. Crook, G. Hayes, A robot implementation of a biologically
Progress in Systems and Control Theory, Vol. 8, Springer, inspired method for novelty detection, Proceedings of
Berlin, 1991, pp. 1–10. Towards Intelligent Mobile Robots Conference, Manchester,
[4] S. Albrecht, J. Busch, M. Kloppenburg, F. Metze, P. Tavan, 2001.
Generalised radial basis function networks for classi.cation [20] P.A. Crook, S. Marsland, G. Hayes, U. Nehmzow, A tale
and novelty detection: self-organisation of optimal Bayesian of two .lters—online novelty detection, Proceedings of 2002
decision, Neural Networks 13 (2000) 1075–1093. IEEE ICRA Conference, Washington, DC, May 2002.
[5] M.F. Augusteijn, B.A. Folkert, Neural network classi.cation [21] M. Davy, S. Godsill, Detection of abrupt spectral changes
and novelty detection, Internat. J. Remote Sensing 23 (14) using support vector machines: an application to audio signal
(2002) 2891–2902. segmentation, Proceedings of IEEE International Conference
[6] V. Barnett, T. Lewis, Outliers in Statistical Data, Wiley, New on Acoustics, Speech, and Signal Processing, Vol. 2, Orlando,
York, 1994. FL, 2002, pp. II-1313–II-1316.
[7] C. Bishop, Novelty detection and neural network validation, [22] T. Denoeux, A neural network classi.er based on Dempster–
Proceedings of IEE Conference on Vision and Image Signal Shafer theory, IEEE Trans. Systems Man. Cybernet., Part A
Processing, 1994, pp. 217–222. 30 (2) (2000) 131–150.
[8] C. Bishop, Neural Networks for Pattern Recognition, Oxford [23] M.J. Desforges, P.J. Jacob, J.E. Cooper, Applications of
University Press, Oxford, 1995. probability density estimation to the detection of abnormal
[9] R. Bogacz, M.W. Brown, C. Giraud-Carrier, High capacity conditions in engineering, Proceedings of Institute of
neural networks for familiarity discrimination, Proceedings Mechanical Engineers, Vol. 212, 1998, pp. 687–703.
of ICANN’99, Edinburgh, 1999, pp. 773–778. [24] C. De Stefano, C. Sansone, M. Vento, To reject or not to
[10] R. Borisyuk, M. Denham, F. Hoppensteadt, Y. Kazanovich, reject: that is the question—an answer in case of neural
O. Vinogradova, An oscillatory neural network model of classi.ers, IEEE Trans. Systems Man. Cybernet.—Part C
sparse distributed memory and novelty detection, Biosystems (New York) 30 (1) (2000) 84–94.
58 (2000) 265 –272. [25] I. Diaz, J. Hollmen, Residual generation and visualization for
[11] T. Brotherton, T. Johnson, Anomaly detection for advance understanding novel process conditions, Proceedings of IEEE
military aircraft using neural networks, Proceedings of 2001 IJCNN Conference, Honolulu, HI, 2002, pp. 2070 –2075.
IEEE Aerospace Conference, Big Sky Montana, March [26] C.P. Diehl, J.B. Hampshire II, Real-time object classi.cation
2001. and novelty detection for collaborative video surveillance,
[12] T. Brotherton, T. Johnson, G. Chadderdon, Classi.cation and Proceedings of IEEE IJCNN Conference, Honolulu, HI, 2002.
novelty detection using linear models and a class dependent— [27] A.D. Doulamis, N.D. Doulamis, S.D. Kollias, On-line
elliptical basis function neural network, Proceedings of retrainable neural networks: improving the performance of
IJCNN Conference, Anchorage, May 1998. neural networks in image analysis problems, IEEE Trans.
[13] H. Byungho, C. Sungzoon, Characteristics of auto-associative Neural Networks 11 (1) (2000) 137–155.
MLP as a novelty detector, Proceedings of IEEE IJCNN [28] V. Emamian, M. Kaveh, A.H. Tew.k, Robust clustering
Conference, Vol. 5, 1999, pp. 3086 –3091. of acoustic emission signals using the Kohonen network,
2520 M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521

Proceedings of IEEE ICASSP Conference, Istanbul, [46] G. Linares, P. Noc[era, H. M[eloni, Mixed acoustic events
2000. classi.cation using ICA and subspace classi.er, Proceedings
[29] S. Fredrickson, S. Roberts, N. Townsend, L. Tarassenko, of IEEE ICASSP’97, Munich, Germany, 1997.
Speaker identi.cation using networks of radial basis [47] L.M. Manevitz, M. Yousef, Learning from positive data for
functions, Proceedings of the VII European Signal Processing document classi.cation using neural networks, Proceedings
Conference, Edinburgh, 1994, pp. 812–815. of Second Bar-Ilan Workshop on Knowledge Discovery and
[30] E. Granger, S. Grossberg, M.A. Rubin, W.W. Streilein, Learning, Jerusalem, May 2000.
Familiarity discrimination of radar pulses, in: M.S. Kearns, et [48] L.M. Manevitz, M. Yousef, One-class SVMs for document
al. (Eds.), Advances in NIPS 11, Elsevier, New York, 1999, classi.cation, J. Machine Learning Res. 2 (2001)
pp. 875 –881. 139–154.
[31] T. Harris, Neural network in machine health monitoring, [49] G. Manson, G. Pierce, K. Worden, On the long-term stability
Professional Eng. July/August, 1993. of normal condition for damage detection in a composite
[32] T. Ho, J. Rouat, Novelty detection based on relaxation time panel, Proceedings of Fourth International Conference on
of a network of integrate-and-.re neurons, Proceedings of Damage Assessment of Structures, CardiE, UK, June
Second IEEE World Congress on Computational Intelligence, 2001.
WCCI 98, Anchorage, AK, 1998, pp. 1524 –1529. [50] G. Manson, G. Pierce, K. Worden, T. Monnier, P. Guy,
[33] A. Jagota, Novelty detection on a very large number of K. Atherton, Long term stability of normal condition data
memories stored in a Hop.eld-style network, Proceedings for novelty detection, Proceedings of Seventh International
of the International Joint Conference on Neural Networks Symposium on Smart Structures and Materials, California,
IJCNN-91, Vol. 2, Seattle, WA, 1991, p. 905. 2000.
[34] S. Jakubek, T. Strasser, Fault-diagnosis using neural networks [51] S. Marsland, U. Nehmzow, J. Shapiro, A model of habituation
with ellipsoidal basis functions, Proceedings of the American applied to mobile robots, Proceedings of TIMR, Towards
Control Conference, Vol. 5, 2002, pp. 3846 –3851. Intelligent Mobile Robots, Bristol, 1999.
[35] N. Japkowicz, C. Myers, M. Gluck, A novelty detection [52] S. Marsland, U. Nehmzow, J. Shapiro, A real-time novelty
approach to classi.cation, Proceedings of 14th IJCAI detector for a mobile robot, Proceedings on European
Conference, Montreal, Quebec, Canada, 1995, pp. 518–523. Advanced Robotics Systems Conference, Salford, 2000a.
[36] H. Ko, G. Jacyna, Dynamical behavior of autoassociative [53] S. Marsland, U. Nehmzow, J. Shapiro, Novelty detection
memory performing novelty .ltering, IEEE Trans. Neural for robot neotaxis, Proceedings of Second International
Networks 11 (5) (2000) 1152–1161. ICSC Symposium on Neural Computation, Berlin, 2000b,
[37] J.M. Ko, Y.Q. Ni, J.Y. Wang, Z.G. Sun, X.J. Zhou, Studies pp. 554 –559.
of vibration-based damage detection of three cable-supported [54] S. Marsland, U. Nehmzow, J. Shapiro, Detecting novel
bridges in Hong Kong, Proceedings of the International features of an environment using habituation, Proceedings on
Conference on Engineering and Technological Sciences, Simulation of Adaptive Behaviour, MIT Press, Cambridge,
China, 2000, pp. 105 –112. MA, 2000c.
[38] T. Kohonen, Self-organisation and Associative Memory, [55] S. Marsland, U. Nehmzow, J. Shapiro, Novelty detection
Springer, Berlin, 1988. in large environments, Proceedings on Towards Intelligent
[39] T. Kohonen, Self Organising Maps, Springer, Berlin, 2001. Mobile Robots Conference, Manchester, 2001.
[40] K. Kojima, K. Ito, Autonomous learning of novel patterns [56] S. Marsland, U. Nehmzow, J. Shapiro, Environment-speci.c
by utilizing chaotic dynamics, IEEE International Conference novelty detection, From Animals to Animats, Proceedings of
on Systems, Man, and Cybernetics, IEEE SMC ’99, Vol. 1, Seventh International Conference on Simulation of Adaptive
Tokyo, Japan, 1999, pp. 284 –289. Behaviour, Edinburgh, 2002.
[41] T. Kwok, D. Yeung, Objective functions for training new [57] D. Martinez, Neural tree density estimation for novelty
hidden units in constructive neural networks, IEEE Trans. detection, IEEE Trans. Neural Networks 9 (2) (1998)
Neural Networks 8 (5) (1999) 1131–1148. 330–338.
[42] K. Labib, R. Vemuri, NSOM: a real-time network-based [58] G. Martinelli, R. Perfetti, Generalized cellular neural network
intrusion detection system using self-organizing maps, for novelty detection, IEEE Trans. Circuits Systems I:
Networks Security, 2002, submitted. Fundam. Theory Appl. 41 (2) (1994) 187–190.
[43] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, [59] M.R. Moya, M.W. Koch, L.D. Hostetler, One-class classi.er
W. Hubbard, L.D. Jackel, Handwritten digit recognition networks for target recognition applications, in: Proceedings
with a back-propagation network, in: Advances in Neural on World Congress on Neural Networks, International Neural
Information Processing Systems, Vol. 2, Morgan Kaufman, Network Society (INNS), Portland, OR, 1993, pp. 797–801.
Los Altos, CA, 1990, pp. 396 – 404. [60] A.F. Murray, Novelty detection using products of simple
[44] M.A. Lewis, L.S. Simo, Certain principles of biomorphic experts—a potential architecture for embedded systems,
robots, Autonomous Robots 11 (3) (2001) 221–226. Neural Networks 14 (2001) 1257–1264.
[45] Y. Li, M.J. Pont, N.B. Jones, Improving the performance of [61] T. Petsche, A. Marcantonio, C. Darken, S.J. Hanson, G.M.
the radial basis function classi.ers in condition monitoring Kuhn, I. Santoso, A neural network autoassociator for
and fault diagnosis applications where “unknown” faults may induction motor failure prediction, Adv. NIPS 8 (1996)
occur, Pattern Recognition Lett. 23 (5) (2002) 569–577. 924–930.
M. Markou, S. Singh / Signal Processing 83 (2003) 2499 – 2521 2521

[62] G. Ratsch, S. Mika, B. Scholkopf, K. Muller, Constructing IAPR International Workshops, Sydney, Australia, 1998, pp.
boosting algorithms for SVMs: an application for one-class 593– 601.
classi.cation, IEEE Trans. Pattern Anal. Machine Intell. 24 [77] D.M.J. Tax, R.P.W. Duin, Data domain description using
(9) (2002) 1184–1199. support vectors, Proceedings of ESAN99, Brussels, 1999a,
[63] S.J. Roberts, W. Penny, Novelty, con.dence and errors in pp. 251–256.
connectionist systems, Proceedings of IEE Colloquium on [78] D.M.J. Tax, R.P.W. Duin, Support vector domain description,
Intelligent Sensors and Fault Detection, 1996/261, Savoy Pattern Recognition Lett. 20 (1999b) 1191–1199.
Place, London, 1996. [79] D.M.J. Tax, R.P.W. Duin, Uniform object generation for
[64] J. Ryan, M.J. Lin, R. Miikkulainen, Intrusion detection with optimizing one-class classi.ers, J. Machine Learning Res. 2
neural networks, in: M. Jordan, et al., (Eds.), Advances (2001) 155–173.
in Neural Information Processing Systems 10, MIT Press, [80] D. Theo.lou, V. Steuber, E.D. Schutter, Novelty detection in
Cambridge, MA, 1998, pp. 943–949. Kohonen-like network with a long-term depression learning
[65] R. Saunders, J.S. Gero, The importance of being emergent, rule, Neurocomputing 52 (2003) 411–417.
Proceedings on Arti.cial Intelligence in Design, Worcester, [81] B.B. Thompson, R.J. Marks II, J.J. Choi, M.A. El-Sharkawi,
MA, 2000. M. Huang, C. Bunje, Implicit learning in auto-encoder novelty
[66] R. Saunders, J.S. Gero, Designing for interest and novelty, assessment, Proceedings of International Joint Conference on
motivating design agents, Proceedings of Ninth International Neural Networks, Honolulu, May 2002, pp. 2878–2883.
Conference on Computer Aided Architectural Design Futures, [82] V.N. Vapnik, Statistical learning theory, Wiley/InterScience,
July 2001, pp. 725 –738. New York, 1998.
[67] A. SchSolkopf, R. Williamson, A. Smola, J.S. Taylor, J. Platt, [83] G.C. Vasconcelos, A bootstrap-like rejection mechanism
Support vector method for novelty detection, in: S.A. Solla, for multilayer perceptron networks, II Simposio Brasileiro
T.K. Leen, K.R. MSuller (Eds.), Neural Information Processing de Redes Neurais, São Carlos-SP, Brazil, 1995, pp.
Systems, Elsevier, New York, 2000, pp. 582–588. 167–172.
[68] S. Singh, M. Markou, An approach to novelty detection [84] G.C. Vasconcelos, M.C. Fairhurst, D.L. Bisset, Recognizing
applied to the classi.cation of image regions, IEEE Trans. novelty in classi.cation tasks, in: Neural Information
Knowledge Data Eng. (2003), in press. Processing Systems Workshop (NIPS’94) on Novelty
[69] H. Sohn, K. Worden, C.R. Farrar, Novelty detection under Detection and Adaptive Systems monitoring, Denver, CO,
changing environmental conditions, Proceedings of Eighth USA, 1994.
Annual SPIE International Symposium on Smart Structures [85] G.C. Vasconcelos, M.C. Fairhurst, D.L. Bisset, Investigating
and Materials, Newport Beach, CA, 2001. feedforward neural networks with respect to the rejection
[70] S.O. Song, D. Shin, E.S. Yoon, Analysis of novelty detection of spurious patterns, Pattern Recognition Lett. 16 (1995)
properties of auto-associators, Proceedings of COMADEM, 207–212.
2001, pp. 577–584. [86] D.L. Wang, M.A. Arbib, Complex temporal sequence
[71] R.J. Streifel, R.J. Maks, M.A. El-Sharkawi, Detection of learning based on short-term memory, Proc. IEEE 78 (1990)
shorted-turns in the .eld of turbine-generator rotors using 1536–1543.
novelty detectors—development and .eld tests, IEEE Trans. [87] C.L. Wilson, J.L. Blue, O.M. Omidvar, Improving
Energy Conversat. 11 (2) (1996) 312–317. neural network performance for character and .ngerprint
[72] C. Surace, K. Worden, A novelty detection method to classi.cation by altering network dynamics, Proceedings of
diagnose damage in structures: an application to an oEshore the World Congress on Neural Networks, Washington, DC,
platform, Proceedings of Eighth International Conference of 1995.
OE-shore and Polar Engineering, Vol. 4, Colorado, USA, [88] K. Worden, Structural fault detection using a novelty measure,
1998, pp. 64 –70. J. Sound Vibration 201 (1) (1997) 85–101.
[73] C. Surace, K. Worden, G. Tomlinson, A novelty detection [89] A. Ypma, R.P.W. Duin, Novelty detection using
approach to diagnose damage in a cracked beam, Proceedings self-organising maps, Progr. Connectionist Based Inform.
of SPIE, Vol. 3089, 1997, pp. 947–953. Systems 2 (1998) pp. 1322–1325.
[74] L. Tarassenko, Novelty detection for the identi.cation [90] Z. Zhang, J. Li, C.N. Manikopoulos, J. Jorgenson, J. Ucles,
of masses in mammograms, Proceedings of Fourth IEEE HIDE: a hierarchical network intrusion detection system using
International Conference on Arti.cial Neural Networks, Vol. statistical preprocessing and neural network classi.cation,
4, Perth, Australia, 1995, pp. 442– 447. Proceedings of IEEE Workshop on Information Assurance
[75] L. Tarassenko, A. Nairac, N. Townsend, P. Cowley, Novelty and Security, West Point, 2001, pp. 85 –90.
detection in jet engines, IEEE Colloquium on Condition [91] B.T. Zhang, G. Veenker, Neural networks that teach
Monitoring, Imagery, External Structures and Health, 1999, themselves through genetic discovery of novel examples,
pp. 41– 45. Proceedings of IEEE International Joint Conference on
[76] D.M.J. Tax, R.P.W. Duin, Outlier detection using classi.er Neural Networks (IJCNN’91), Vol. 1, Singapore, 1991,
instability, in: Advances in Pattern Recognition, the Joint 690 – 695.

Das könnte Ihnen auch gefallen