Sie sind auf Seite 1von 15

Data-Mining Applications for Aviation

Chris Thornton
Cognitive and Computing Sciences
University of Sussex
Email: Christopher.Thornton@
Tel: (44)1273 678856
May 21, 2003

Indirect sensing is a new sensing methodology which aims to elimi-
nate the need need for special-purpose transducers in certain domains.
Rather than the sensor signal being derived from a directly connected
transducer, it is derived by data-mining representations of ambient phe-
nomena (sound, light, temperature etc.) As an information-fusion task
this involves the fusion of a knowledge source (i.e., the bias characteris-
tics of the utilised data-mining method) with a data source. The paper
explores an application involving sensing of engine temperature in a light
aircraft. A number of data-mining methods are compared on the task and
conclusions are drawn regarding the bias characteristics which are key for
this application.

1 Introduction
Conventional sensing technologies are typically implemented in a direct fashion
[1, 2]. That is to say, a transducer is brought into physical contact with the
target property with the aim of deriving a clear and unambiguous signal. For
example, a direct-sensing approach to the problem of establishing the current
position of the undercarriage on a light aircraft is to attach pressure sensors to
the bracings which support the undercarriage in the lowered position. When
these sensors generate a signal, the undercarriage is con rmed as lowered.
An alternative is indirect sensing [3]. In this approach, there is no attempt
to bring a transducer into direct contact with the target itself. Rather the

aim is to infer changes in it by taking account of its more widespread e ects,
i.e., ambient e ects involving sound, light, temperature etc. An indirect sensing
approach to the problem of sensing undercarriage-position might involve the use
of a vibration sensor attached to the frame of the aircraft. The undercarriage
would then be con rmed as lowered when the airframe registers a particular
pattern of vibrations.
The indirect approach to sensing o ers several advantages and enables a
greater variety of engineering solutions. It also allows sensing to be carried out
using general-purpose and therefore inexpensive hardware. The drawback is
that it typically entails a much greater degree of downstream signal-processing.
In order to derive an accurate reading of target states, it will be necessary to
interpret the signatures that the target renders in the ambient array. Because of
the huge variety of sources that contribute to a typical pattern of ambient energy
release, there is considerable interference among these signatures. As a result,
they tend to form complex and ambiguous representations of the contributing
The attempt to hand-craft processing routines for this task is a daunting
challenge and one which o ers no real promise of success. The present work
explores a di erent approach which involves fusing information from di erent
sources. The essential idea is to bring together a knowledge source, in the form
of certain data-mining technologies, with a data source. The latter is made up
from the raw data representing ambient phenomena but fashioned in the form
training examples. The goal is to derive the desired processing automatically
without any need for hand-coding.
The approach places a heavy burden on the data-mining method used. How-
ever, there is every reason to believe that state-of-the-art methods are able to
bear this sort of load. In the last half century, there has been a considerable
progress on the data-mining task with innovations accumulating rapidly in the
last decade. The tangible result is that we now have an extensive `tool box' of
data-mining technologies which may be applied to tasks such as this [4].
1.1 Domain choice
The indirect-sensing approach promises to pay dividends in areas where the
implementation of direct sensors is problematic or costly. One area which ts
the bill perfectly is that of general aviation | non-commercial, light aircraft.
The costs of implementing instrumentation in conventional light aircraft, using
direct-sensing technologies, may form more than half of the total manufacturing
cost [5]. So the potential for cost-reduction through use of indirect-sensing
methodologies in this area is extremely good.
In addition, there is the possibility that indirect-sensing might enable rela-
tively sophisticated sensing functionality to `trickle down' to the less sophisti-
cated end of the aviation spectrum. A case in point involves the phenomenon of
carburettor icing, a problem which plagues the current generation of light air-
craft. Direct sensing of carurettor icing is feasible but not generally implemented
on light aircraft for reasons of cost. Pilots of aircraft such as the common Cessna

152 and 172 models must learn to sense intuitively the moment when carburet-
tor heat should be applied so as to eliminate any build-up of carburettor ice.
Using an indirect approach, there is the possibility of implementing a reliable
but low-cost sensory mechanism without the need for any direct engineering
intervention, i.e., without the need for any modi cations to the fundamental
carburettor design.
Taken to its logical extreme, the indirect sensing approach o ers the possibil-
ity of assembling an entire, `oine' instrumentation system for a light aircraft,
without introducing any engineering changes whatsoever. A portable computer
equipped with an array of general-purpose ambient sensors might provide vir-
tual back-up for all the main instrumentation systems. Since instrumentation
failure is implicated in a large proportion of general aviation accidents [5], the
net e ect might be a signi cant improvement in safety and reliability.
The work described below attempts to probe the practicality of using indirect-
sensing in this context. In particular, it focuses on engine-temperature sensing
in a Cessna-152 light aircraft. An empirical study is described involving the ap-
plication of data-mining methods to a real-world dataset. The aim of the study
was to determine whether data-mining of ambient audio data could provide the
means of accurately measuring engine temperature (i.e., whether data-mining
methods are able to determine engine temperature from engine noise).

2 The data-mining background

Data mining is the task of identifying useful regularities and patterns in large
bodies of data [6, 7, 8]. The quintessential data-mining task involves identifying
patterns of consumer behaviour from purchasing logs. In a typical case, the
mining of a supermarket's sales logs might lead to the discovery that consumers
are more likely to buy a packet of washing powder if it is not placed close to
any display of pet food.
Data mining as a eld has only been in existence for a decade or so. How-
ever, the technologies that it embraces have a longer history. Techniques are
drawn from areas such as pattern recognition [9], statistics and mathematics
[10] and machine learning [11]. Though many data mining methods are recent
innovations, others have been in use for half a century or more.
The identi cation of any regularity or pattern in a body of data necessar-
ily facilitates prediction. Theoretical work has therefore treated data predic-
tion as the key goal to be addressed. Traditionally, there have been two main
paradigms: classi cation and regression. In classi cation, the task is to predict
an appropriate classi cation label for a datum. In regression, the task is to
predict the value of a target function when applied to a particular datum. (Of
course, classi cation may be regarded as a special case of regression.)
While many di erent approaches to these two tasks have been explored,
interest has focussed largely on supervised learning. In supervised learning a
set of prediction examples are made available and the data mining method (or
`learner' as it is termed) must discover what connects a particular prediction to

a particular datum. This involves identifying commonalities among data asso-
ciated with the same prediction. In general, these commonalities are statistical
properties of the relevant data.
A review of such methods might begin with the nearest-neighbours method
[9, 10], from the eld of statistics. In this method, learning involves nothing
more than storing away the examples in memory. For any given datum d, a
prediction is then generated simply by averaging over the predictions associated
with those examples showing the highest level of similarity to d. Surprisingly,
this approach produces very respectable prediction results, although usage may
be limited in practice due to the costs involved in iterating over the entire
example set each time a new prediction is required [4].
Also originating in statistics is the method of predicting classi cation labels
using a linear-discriminant function (i.e., a linear hyperplane separating di er-
ently classi ed examples in the data space) [10]. In this method, learning usually
involves some variation on the theme of least-mean-squared (LMS) regression.
Essentially, the method uses observations of misclassi cations to incrementally
`nudge' a separating hyperplane into position. Over time, LMS regression has
come to be seen as a major foundation for classi cation methods applied to
numerical data [12].
Data mining also has roots outside of statistics. The decision tree method,
for example, has a history which embraces Quinlan's ID3 method [13, 14], a pro-
gram developed within the machine learning (arti cial intelligence) community.
Learning here involves producing a decision tree for predictions which e ectively
minimises the number of tests which have to be made in order to generate a
prediction consistent with the training examples.
Lodged somewhere between the realm of machine learning and that of statis-
tics is the eld of neural networks [15, 16]. This gained considerable territory
in the 1980s on the basis of the noted successes of the backpropagation method
of network training [17]. Backpropagation enables networks containing inter-
mediate (or hidden) units to be trained, paving the way for the derivation of
complex partitionings of the data via the introduction of superpositions of linear

3 Statistical learning theory

Recent years have seen the emergence of a theory of statistical learning [18]
which promises to provide an overarching theoretical structure in which a variety
of existing methods (including the more common neural network methods) may
be understood. The starting point for this work is the so-called bias-variance
tradeo . Any learning method aims to nd the means of producing predictions
and this `means' may be viewed as a hypothesis. So we may consider learning
as a search through a space of hypothesis for one which is consistent with the
With more hypotheses, there is a greater chance that the learner will come
across a hypothesis which is consistent with the examples. Unfortunately, there

is also a greater chance that the hypothesis will be inconsistent with unseen
cases. The likely result is that a hypothesis will be selected which produces
poor prediction. If the bias is low, there is a greater variance among consistent
hypotheses and therefore a greater chance of poor prediction and generalisation.
If the bias is high, there is lower variance but also the risk that no consistent
hypothesis will be discovered.
A useful construct for understanding this tradeo is the VC dimension
(Vapnik-Chervonenkis dimension). This is an (inverse) measure of the bias
inherent in a particular hypothesis language. A dataset is said to be `shattered'
by a hypothesis language if the language enables every possible subset of data to
be partitioned. Thus, the VC dimension of a hypothesis language is e ectively
a measure of how large a dataset can get before the ecacy of the language
starts to drop below the theoretical maximum.1 Higher VC dimension implies
lower bias and vice versa.
Using the concept of VC dimension, we can reformulate the bias-variance
issue in a clearer way. The aim in learning must be to nd the lowest VC di-
mension which will still yield successful performance. In other words, the bias
must be weak enough to admit at least one, consistent hypothesis. But it must
be strong enough to constrain the number of poor-quality (i.e., poorly predict-
ing) hypotheses. In cases where a learner uses a xed hypothesis language,
achieving this aim involves ensuring that the hypothesis language has the right
level of bias.
In some cases the strength of the bias is a modi able property of the learning
method. A case in point is the feedforward, neural network [19]. Here the bias
is partly a function of the number of hidden units in the network: with more
hidden units the network is able to de ne increasingly complex partitions. With
fewer hidden units, the network must capture the relevant regularities in terms
of simpler partitions. By varying the number of hidden units, the bias may be
set to any desired level.
But, again, there is the problem of calculating what the level should be. With
neural nets, a practical solution involves a procedure known as early-stopping
[20]. It is a natural consequence of the learning regime that true regularities
will be modelled before noise. So if learning is stopped at the moment when
cross-validation tests suggest that the network is starting to over t the data,
the network's excess capacity (low bias) will never be utilised. The bias is then
e ectively `strengthened' to an appropriate level.
3.1 Support Vector Machines
The technique of early-stopping makes bias-setting a semi-automatic part of the
learning. But, as Vapnik has shown, bias-setting can be made a fully automatic,
i.e., a seamless part of the learning process. Vapnik's solution involves a learning
method called support vector machines [21, 22]. The novel feature of these is
1 This is like rating a car's power in terms of the maximum speed at which the engine still
yields maximum energy eciency.

the way in which they divide the learning problem into two subproblems: the
problem of nding a way of re-representing the data in a higher-dimensional
space and (b) the problem of nding a simple partitioning of the data in that
SVMs are provided with a kernel function. This implicitly maps the ex-
amples into a high-dimensional feature space. (In fact, kernel functions map
combinations of examples to their inner products in the feature space; but this
less costly operation is sucient for application of the required optimisation
techniques.) They then use quadratic programming techniques to discover a
subset of the examples | the so-called `support vectors' | which are separated
by a hyperplane in the feature space. In e ect, the SVM uses QP methods to
discover a representation of the data which allows the desired partition to be
formed as a simple hyperplane.
An attractive aspect of the SVM is the way in which it automatically handles
the bias-variance tradeo . In essence, the SVM seeks a solution embodying the
highest bias by maximising the support margin, i.e, the distance between the
optimal separating hyperplane and its closest support vectors.
In e ect, then, the SVM carries out three tasks. First, there is the re-
representational task in which the data are mapped into a higher-dimensional
space. Second there is the task of identifying a separating hyperplane in the
feature space. Finally, there is the task of maximisingthe size of the margin. The
rst two tasks we might view as `learning' in the traditional sense. The third we
might view as bias maximisation (i.e., the discovery of a solution with optimal
predictive performance). SVMs e ectively wrap all three tasks into a single
optimisation task and then solve it using quadratic programming techniques.
By making the number of free parameters used dependent on the size of the
margin associated with the separating hyperplane, SVMs e ectively introduce
a feedback loop which ensures that decreases in bias are only introduced as
SVMs have a rm theoretical foundation in Vapnik's statistical learning
theory and, in addition, automatically handle the bias-selection problem. It is
sometimes argued that, in practice, other approaches such as neural networks
will be preferable on grounds of prediction performance or training cost. In the
experiment described below, however, the results show SVM to be one of the
strongest performers. In this experiment, then, there is a degree of support for
the thesis that the automatic bias selection facet of the SVM yields important,
practical advantages.

4 Temperature-sensing experiment: scenario

To investigate the potential of indirect sensing for aviation instrumentation
applications, an experiment was performed which involved sensing engine tem-
perature in a Cessna 152 light aircraft. Temperature sensing in this context is
normally implemented in the direct style. A temperature-sensitive transducer
is connected directly to the engine (typically in the vicinity of the oil-return

conduit) and the signal from this is used to drive an indicator via an electrical
The experiment aimed to discover whether it would be possible to implement
this sensing task in an indirect, inductive manner, using ambient sound energy
produced by the engine. (This is considerable in the case of the Cessna 152.). A
dataset was constructed by sampling the ambient sound energy while the engine
was subjected to a typical pattern of usage including engine warm-up, take-o
and circuit ying. Training examples were then constructed by associating
small sequences of sound data with observed temperatures. These were then
formulated as a dataset of training examples and presented to a range of data-
mining methods.

5 Data collection

Figure 1: Cessna 152 on take o .

The data for the experiment were gathered during a ight of a Cessna 152
aircraft, similar to the one shown in Figure 1.2 A digital (MiniDV) video camera
was trained on the temperature guage (highlighted area in Figure 2) and kept
running for 20 minutes from the moment of engine ignition until well into the
ight. Later the footage was downloaded to a video-processing package. The
footage was divided up into 240 clips, each of 5 seconds duration. The frames
associated with each clip were then analysed visually to discover the average
temperature reading in that period. The audio samples were processed so as
to derive ve, one-second long, mean amplitude values for the same period.
Finally, a training example was generated by associating the sequence of ve
mean amplitude values (normalised in the range 0.0-1.0) with the observed
temperature value, normalised in the range 0-8. In this way a dataset of 240
training examples was derived from the 20 minutes of recorded footage.
Average prediction accuracies were then computed for 100 applications of
each data mining method to the dataset. These were derived by dividing the
2 The aircraft was G-BNNR, registed to Sussex Flying Club at Shoreham airport, UK.

Figure 2: Cessna 152 instrument panel.

dataset into a training set of 160 examples and a testing set of 80 examples. The
examples were randomly selected in each case but with the restriction that there
should be no overlap between the two sets. Prediction accuracies derived when
the data-mining methods were tested on the testing data were then treated as
measures of the degree to which signal processing of ambient sound energy could
be used to measure engine temperature.

6 Data mining details

The data mining methods used in the experiment were the k-nearest-neighbours
method (kNN) [16], the decision tree method (C4.5) [14], the Naive Bayes Clas-
si er (NBC) [16], the multi-layer perceptron (MLP) [17] and the support vector
machine (SVM) [21]. These methods are generally regarded as among the more
robust of general-purpose methods.
The kNN method was run with the value of k set to 1 (i.e., a single near-
est neighbour was used for prediction.) The C4.5 method was run using the
information-gain ratio heuristic but without tree-pruning. The SVM method
was run using Thorsten Joachims `SVMlight' implementation using one-versus-
the-rest classi cation and the built-in, polynomial kernel function. All runs
used default values for all parameters except for the c parameter (controlling
training-error/margin tradeo ) which was set to 10.
The MLP method was run using a standard feedforward architecture with a
single layer of hidden units, the number of these being twice the number of input
units. A learning rate of 0.05 and a momentum value of 0.9 were used. Training
was continued for 5000 epochs. A visualisation of the MLP structure that

Figure 3: Typical MLP network following training on temperature data.

resulted from training is shown in Figure 3. Here positive values are represented
by red colours, negative values by blue. The intensity of the colour represents
the absolute magnitude of the value in all cases. Lines represent connections
and circles represent units in the usual way, while unit biases are represented
by small half-ovals protruding from the right edge of each unit's circle.

7 Results
Method C4.5 kNN MLP SVM NBC
Error rate 0.274 0.169 0.174 0.219 0.970
Accuracy 73% 83% 83% 82% 3%
The results of the experiment are shown in tabular form above and as a
histogram in Figure 4. The gures shown are all averages taken over 100 test
runs using randomly selected training/testing sets.
Clearly, the best results were produced by kNN, the MLP and by SVM,
which all achieved a prediction accuracy of at least 82%. (A typical pattern of
error reduction achieved by the MLP on these data is shown in Figure 5.) The
performance produced by the C4.5 method is slightly below this gure while the
performance of the NBC method is extremely poor, suggesting some extreme
form of mismatch to the data.
The Naive Bayes Classi er works by determining the frequencies with which
particular input values are associated with particular output values within the
training examples. Predictions for a speci c datum are then generated by de-
termining the most probable output for that datum, treating the frequencies

Figure 4: Error rates for temperature generalisation.

associated with the datum's input values as probabilities. (This involves appli-
cation of Bayes theorem for calculating a priori from a posteriori probabilities.)
In normal cases this procedure works well. However, the assumption is made
that all input values seen in a test datum will also be frequently represented
within the training data (this is, of course, a key element of the iid assump-
tion). If they have not done so there will be no relevant frequencies upon which
to derive class probabilities and therefore no basis upon which to generate pre-
dictions. This is precisely the problem which occurs in the temperature-sensing
experiment. Input values appearing in data are real numbers in the range 0.0-
1.0 representing mean amplitude values. The vast majority of these are unique
within the data. Thus, the NBC is put in the position of having to generate
predictions on the basis of irrelevant frequencies.
The presence of real-valued input values is also the likely explanation for
the below-par performance of C4.5. This method relies on the recursive deriva-
tion of a minimal decision tree for predictions. In the case of symbolic input
variables, decision tree branches are generated by `splitting' the relevant data
upon the possible values of the variable. However, with real-valued data, the

Figure 5: Typical error curves for MLP training.

method's only recourse is to explore possible threshold values, i.e., possible ways
of splitting the range of observed values into two or more ranges. In the im-
plementation of C4.5 utilised for this experiment, binary splits were sought for
any real-valued variables. The resulting constraint is that e ective performance
may only be achieved in the case where salient data groupings can be achieved
through the introduction of binary divisions in each variable (or dimension) of
the data space. A reasonable conclusion in the present case is that this require-
ment is not satis ed by the training data.
With respect to the three, top-performing data-mining methods (MLP, kNN
and SVM), rm conclusions are less easily drawn. As a general rule, we ex-
pect data-mining methods to produce similar levels of performance on the same
dataset. In the case where they do not, the explanation is generally related to the
inappropriateness of the bias introduced, i.e., the degree to which the particular
tendencies of the data-mining method are `out of tune' with the characteristics
of the data.
But though we may be able to explain the di erences between the three
top-performing methods in terms of bias, there is still the problem of explaining
why none of the methods produce perfect performance. This is a key issue if

the work is to achieve any real applications potential. But here we run into the
fundamental problem in this area, namely the lack of any generic theory of data
mining. Data-mining methods attempt to produce a `solution' to a `problem'.
But the solution is not de ned analytically. In fact, given the present state of
theoretical understanding, it is but vaguely characterised.
However, we can begin to make some guesses about where the problems
might lie. There appear to be several possible explanations for the sub-optimal
levels of prediction performance among these three methods. First, there is the
possibility that the methods were applied in an inappropriate way. Where a
method involves the setting of user parameters this is an ever-present risk. This
applies to kNN but particularly to SVM and MLP. In the case of the latter
method, the user must choose the number of hidden units to use, the activation
function, the learning rate and momentum as well as the number of epochs over
which training should be applied. Although the MLP method is surprisingly
robust, it is clear that the selection of an inappropriate value for one of these
variables might well a ect the performance in an adverse manner.
Another possible explanation relates to the nature of the data-acquisition
process. The rather crude way in which audio data has been presented in the
form of training examples may have worked to conceal the regularities that
would enable accurate prediction. Although the original source for the training
data was a high-resolution audio le, for purposes of representing the data as a
training set, the data were reformulated in the form of mean-amplitude values
averaged over a one second interval. Thus, the information available to the data-
mining methods was in fact a low resolution representation of the underlying
sound source. It is possible that this reduction in resolution had an adverse
a ect on the ultimate performance of all the data mining methods.
Another, more interesting possibility is that there is a fundamental mis-
match between the biases of the data mining methods used and the signi cant
characteristics of the data. Recall that in this experiment, the aim is to infer
changes in an underlying target (engine temperature) from the signature that
those changes produce in an ambient energy array (engine noise). In e ect, the
aim is to use signal processing of the ambient data to obtain a virtual sensor
for the underlying target.
As noted, the fundamental problem with this idea is the fact that the way in
which the target impacts the ambient data will be very complex. Because many
di erent properties will contribute to changes in the same ambient variables,
there will be high levels of interference between di erent signatures and the
correspondences between variables of the target and variables of the ambient
data will be 1-to-many. We should therefore not expect to see any simple corre-
spondences between absolute values of ambient variables and absolute values of
the underlying variables. At best, the signatures to be processed will comprise
relational (i.e., non-absolute) patterns. If the indirect-sensing approach is to
succeed, then, the data-mining methods utilised must have the ability to detect
and exploit patterns of relationships.
This observation may point towards the beginnings of a more theoretical
explanation for the below-par performance of the data-mining methods used

on this problem. The degree to which these characteristically empirical data-
mining methods are able to exploit relationships is a matter of debate. However,
as has been argued elsewhere [Thornton, Truth from 23], it is rather clear that
simpler approaches such as nearest-neighbours and the decision-tree method are
not sensitive in any degree to relational properties of the data. They are solely
sensitive to absolute properties.
With respect to the neural network method (MLP), the judgement on this
issue is less certain. Early advocates of the method tended towards the view
that the MLP method is capable of exploiting relational properties of the data,
with its well-known capabilities with respect to the XOR problem being o ered
as a `proof' [19]. However, this demonstration is open to re-interpretation [24]
and a conservative judgement on this issue would now rate the abilities of the
MLP with respect to relational problems as still under evaluation [25].
Similar remarks might be made with respect to the SVM, at least in the
con guration used for this experiment (i.e., using the standard polynomial ker-
nel function). However, the fact that the performance achieved by the SVM on
these data is close to that achieved by the ineluctably non-relational nearest-
neighbours method suggests that the performance de cit shown by both meth-
ods may well be put down to the same cause.
More work is clearly needed to bring clarity to these issues. The long-
term aim must be to more accurately characterise the characteristics of the
data signatures which are key in this application. Theoretical development in
the area of computational learning theory may well start to ll in the gaps
with respect to our understanding of the interaction between bias and data.
With this information to hand, better judgements may be possible regarding
the bias-appropriateness of di erent data-mining methods for indirect-sensing

8 Concluding comments
The paper has explored the practicality of using indirect-sensing methodologies
for general-aviation instrumentation solutions. The approach described relies on
fusing information inherent in the bias characteristics of a data-mining method,
with exemplar-formatted data. Real-world data were collected and utilised in a
comparative study to identify which methods would produce best performance
on the data-mining aspect of the problem. The results showed that good results
were obtained using several di erent methods.
The lack of any rm theoretical framework for understanding bias/data in-
teraction was noted to be a signi cant problem for this investigation. It pro-
foundly limits the progress that can be made with respect to method selection
and/or customisation. Informal analysis suggest that the key aspect of bias in
this application is the ability to detect and exploit relational data e ects, since
the 1-to-many character of the forces a ecting ambient variables means that
absolute patterns are most unlikely to be of any signi cance.
While the performance levels obtained from the best-performers in this study

are inadequate for realistic applications work, they are nevertheless considerably
above the level of chance. Given the early stage of work, there is every reason
to hope that more research will yield practicable levels of performance in due

[1] Gibson, P. and Power, C. (2000). Introductory Remote Sensing: Application
and Digital Image Processing. London: Routledge.
[2] Gibson, P. (2000). Introductory Remote Sensing: Principles and Concepts.
Routledge Publishers.
[3] Thornton, C. (2003). Indirect sensing through abstractive learning. Intelli-
gent Data Analysis, 7, No. 3/4.
[4] Michalski, R., Bratko, I. and Kubat, M. (1998). Machine Learning and
Data Mining: Methods and Applications. New York: Wiley.
[5] Platt, J. (2000). Human Factors and Flight Safety. Manchester: Airplan
Flight Equipment.
[6] Holsheimer, M. and Siebes, A. (1994). Data mining: the search for knowl-
edge in databases. Technical report CS-R9406, CWI.
[7] Weiss, S. and Indurkhya, N. (1998). Predictive Data Mining: A Practical
Guide. San Francisco: Morgan Kaufmann Publishers Inc.
[8] Edelstein, H. (1999). Introduction to data mining and knowledge discovery
(3rd ed). Potomac, MD: Two Crows Corp.
[9] Duda, R. and Hart, P. (1973). Pattern Classi cation and Scene Analysis.
New York: Wiley.
[10] Nilsson, N. (1990). The Mathematical Foundations of Learning Machines.
San Mateo, California: Morgan Kaufmann.
[11] Mitchell, T. (1997). Machine Learning. McGraw-Hill.
[12] Hinton, G. (1989). Connectionist learning procedures. Arti cial Intelli-
gence, 40 (pp. 185-234).
[13] Quinlan, J. (1983). Learning ecient classi cation procedures and their
application to chess end games. In R. Michalski, J. Carbonell and T.
Mitchell (Eds.), Machine Learning: An Arti cial Intelligence Approach.
Palo Alto: Tioga.
[14] Quinlan, J. (1993). C4.5: Programs for Machine Learning. San Mateo,
California: Morgan Kaufmann.

[15] Rumelhart, D., McClelland, J. and the PDP Research Group, (Eds.) (1986).
Parallel Distributed Processing: Explorations in the Microstructures of
Cognition. Vols I and II. Cambridge, Mass.: MIT Press.
[16] Michie, D., Speigelhalter, D. and Taylor, C. (Eds.) (1994). Machine Learn-
ing, Neural and Statistical Classi cation. Ellis Horwood.
[17] Rumelhart, D., Hinton, G. and Williams, R. (1986). Learning representa-
tions by back-propagating errors. Nature, 323 (pp. 533-6).
[18] Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York:
[19] Rumelhart, D., Hinton, G. and Williams, R. (1986). Learning internal rep-
resentations by error propagation. In D. Rumelhart, J. McClelland and
the PDP Research Group (Eds.), Parallel Distributed Processing: Explo-
rations in the Microstructures of Cognition. Vols I and II (pp. 318-362).
Cambridge, Mass.: MIT Press.
[20] Prechelt, L. (1996). Automatic early stopping using cross validation: quan-
tifying the criteria. Neural Networks, 9 (pp. 457-462). 3.
[21] Burges, C. (1998). A tutorial on support vector machines for pattern
recognition. Data Mining and Knowledge Discovery, 2, No. 2 (pp. 121-
[22] Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support
Vector Machines. Cambridge: Cambridge University Press.
[23] Thornton, C. (2000). Truth from Trash: How Learning Makes Sense. MIT
[24] Clark, A. and Thornton, C. (1997). Trading spaces: computation, rep-
resentation and the limits of uninformed learning. Behaviour and Brain
Sciences, 20 (pp. 57-90). Cambridge University Press.
[25] Thornton, C. (1996). Parity: the problem that won't go away. In G.
McCalla (Ed.), Proceeding of AI-96 (Toronto, Canada) (pp. 362-374).