Sie sind auf Seite 1von 13

Experiments with Domain Knowledge

in Knowledge Discovery
Javier Bejar bejar@lsi.upc.es
Ulises Cortes ia@lsi.upc.es
Ramon Sanguesa sanguesa@lsi.upc.es
Departament de Llenguatges i Sistemes Informatics
Universitat Politecnica de Catalunya ph: +34 3 4017016
Jordi Girona Salgado 1-3. Barcelona 08034. SpainFax: +34 3 4017014
Manel Poch rqte3@onyar.udg.es
Laboratori d'Enginyeria Qumica i Ambiental
Universitat de Girona. Placa Hospital 6. Girona 17071. Spain

keywords: Conceptual Clustering, Domain Knowledge, Classi cation.


Abstract
Using domain knowledge in unsupervised learning has shown to be a useful strategy when
the set of examples of a given domain has not an evident structure or presents some level
of noise. This background knowledge can be expressed as a set of classi cation rules and
introduced as a semantic bias during the learning process.
In this work we present some experiments on the use of partial domain knowledge with
the tool LINNEO+ , a conceptual clustering algorithm. The domain knowledge (or domain
theory) is used to select a set of examples that will be used to start the learning process, this
knowledge has not to be complete neither consistent. This bias will increase the quality of the
nal groups and reduce the e ect of the order of the examples. Some measures of stability of
classi cation are used.
This technique is applied to identify operational situations in the functioning of an urban
wastewater treatment plant.
1 Introduction
The use of unsupervised learning to discover useful concepts in sets of unclassi ed examples
simpli es the construction of rules for knowledge bases. Tools to aid to this labour and to
increase the quality of the knowledge obtained are very desirable. These are more desirable
if the domain has problems of ambiguity, lack of a clear structure, use of qualitative and
quantitative knowledge or lack of a broad consensus between experts.
In this work we present the methodology used by LINNEO+ [1, 2], that has been extended
to use domain knowledge in order to semantically bias a conceptual clustering algorithm [9,
5]. This domain knowledge helps to obtain more stable classi cations and more meaningful
concepts from unclassi ed observations. The purpose of this technique is to extract concepts

1
or rules from a database of observations described by experts as an alternative to ease the
building of Knowledge-bases for expert systems or Case-bases for case based reasoning tools.
It is shown, also, that little knowledge can produce considerable gain, despite of the ambi-
guity or the partial incorrectness of the knowledge. This ambiguity can be also solved using the
improved classi cations, performing specializations or generalizations that correct the problems
on the domain knowledge.
This paper is organized as follows: section 2 is devoted to give some basic notions about
LINNEO+ . In x3 we give a detailed description of the concept of domain theory and its use
in the classi cation process, x4 exposes a classical example to illustrate the e ects of the bias
in the quality of the results. in x5 and x6 exposes a practical application of the classi cation
procedure and the use of domain knowledge in a real domain (Wastewater treament plants).
In x7 some conclusions are given.
2 LINNEO+
LINNEO+ is a knowledge acquisition tool oriented to ill-structured domains, domains with
a weak structure, imprecise information and a membership function of the observations to the
concepts. It uses an unsupervised learning strategy and incrementally accepts a stream of
observations, trying to discover a classi cation scheme from the data. As a control strategy it
retains only the best hypotheses that are consistent with the observations given a similarity
criterion. Part of LINNEO+ could be considered as conceptual clustering method with two
tasks critically important for its performance: clustering, which determines useful subsets of
a dataset; and characterization, which determines a concept for each extensionally de ned
set discovered by clustering. The nal characterization of the classes is build-up by GAR[14]
as a rule set. This task needs from an expert to accept (or reject) the resulting clusters. Other
modules try to better exploit observational knowledge from the dataset, or take advantage of
the experts knowledge if available.
The expert has to de ne a set of observations that he thinks as sucient to model the
domain and also de nes a set of attributes relevant to the classi cation goal intended. The
expert is allowed to represent attributes by means of two prede ned types: Quantitative,
Qualitative. For each observation, a vector is de ned whose length (n) is just the number of
attributes. Classically, this vector is the value vector for each property of the object, then the
usual representation of an object is:

Oi = ((Attributek ; V aluek )+)


In LINNEO+ this representation has evolved to a more expressive one:

Oi = ((Attributek ; V aluek ; Statusk )+)


where Status could adopt one among the following values [Missing; Nought; Illegal;
Acceptable]. The status denotes implicit information about the value of an attribute, for a
given object, the idea is to exploit this additional information. Also we maintain the represen-
tation (Attribute:V alue) because it is more compact. So, when the status of an attribute is
Acceptable, that means that for a given Attributek its V aluek belongs to the range, otherwise
the V alue is just its Status.
Missing has the traditional interpretation. Two alternative strategies are used in order to
substitute the missing value. The rst consists in to assign a value for that Attribute using
mean of the values or the value with greater frequency in its column, it is an a priori approach.
The second tries to induce the values after a classi cation excluding the objects with missing
values, it is an a posteriori approach. The values of the classes that are more similar to the
objects with the missing values are used. LINNEO+ uses the second. When Nought appears
as status, it means that for this object the value of that attribute is irrelevant, and a special
treatment is given while the classi cation process is running.
When the status is Illegal, the interpretation is more complex, because it means that this
object can not have an Acceptable value (nor other status) for this Attributek , because there
exist an structural or causal dependence with some Attributew , which absence or presence
forbids to Attributek to have a meaningful value. This information can be viewed as a part of
the domain theory and it is treated after the classi cation step ([1]).
Once the expert has selected the attributes and the observations sample the classi cation
process starts. This process induces a tentative conceptual structure for the domain assuming
that all the available information is present in the dataset. In general, any inductive classi -
cation process will group objects into classes using some criterion of similarity. We decided to
use the classical concept of distance.
The numerical values are normalized to the interval [0; 1] in order to avoid the in uence
of the di erent scales. The distance that will be used, in the example, for determining the
similarity between two objects, Oi and Oj , is the generalized Hamming distance:
Xn
d(Oi; Oj ) = (diff (Oik ; Ojk )) (1)
k=1
where diff (Oik ; Ojk ) is, for a qualitative attribute, 1 if Oik and Ojk are di erent modalities
and 0 otherwise, and for a quantitative attribute, it is the absolute values of their di erence.
This similarity measure is very simple to compute and helps to nd easily a preliminary
structure. Each attribute has the same contribution to the similarity (in the interval [0; 1]),
so, this allows to mix the two kind of attributes and to study the in uence that have in the
description of the data.
The center of a classi is obtained by calculating the mean value for each quantitative at-
tribute of every object. For qualitative attributes, the center includes each one of its modalities
with its corresponding occurrence frequency. Note that the center of a class is considered as
the prototype of the objects contained in the class. The distance between an object and a class
prototype can be taken as the inverse of the degree of membership of the object to the class.
The aggregation algorithm builds clusters of similar objects given a initial parameter that
we call radius that selects the level of generality of the induced concepts. A scheme of the
algorithm is:
 Use the rst object of the dataset to generate the rst class.
 For each one of the remaining objects in the dataset, the best class among the current
ones is selected. The best class for an object is the one in the previous set of classes with
minimum distance to the object. Two things can happen at this moment:
{ Distance to the best class is less than the current classi cation radius. In such case,
the object is included in the class and the center of class is recalculated. While recal-
culating the center some objects may escape from the center and locate themselves
farther than the radius. If this happens, these objects are eliminated from the class
and marked as not classi ed in the rst step.
{ Distance to the best class is greater than the current classi cation radius or there
is not a best class. In such case a new class is created. The vector of the object
currently under consideration becomes the initial center of the class.
 The objects marked as not classi ed in the rst step are now reclassi ed but this time
without modifying the centers to avoid endless recalculation.
The result, once it has been con rmed by the expert, is a list of classes. Two aspects
are taken in account when representing a class: its extensional description and its intensional
description. The rst is given by the enumeration of all elements contained in the class. The
second one is a vector containing n attributes representing the class center.
The incrementality of the algorithm has the e ect that the results depend on the order of
presentation of the observations [6]. Some syntactical heuristics has been developed to reduce
this e ect called not-yet heuristics [16, 1].
This methodology has been successfully applied to some real domains as mental illnesses
[15], marine sponge classi cation [2] and fault diagnose in wastewater treatment plants [17]
3 Using a Domain Theory
In this section we introduce the concept of domain theory (DT) that expresses what the
expert can explain about the domain that he is de ning. We will describe the syntax used by
the expert in order to de ne a domain theory and the role that this knowledge plays in the
classi cation process.
3.1 De ning a Domain Theory
In unsupervised learning, and specially in ill-structured domains, the description of the obser-
vations usually is not enough to build a set of concepts. The noise of the observations, the
existence of irrelevant descriptors or the non homogeneity of the sampling of observations can
deviate the learning process from a meaningful result. It is desirable, thus, a guide from a
higher level of knowledge to assure the success of the acquisition.
In our methodology, we allow the expert to de ne as Domain Theory (DT) as a group of
constraints guiding the inductive process. Therefore, the DT semantically biases the set of
possible classes. This DT acts just as a guide; it does not need to be complete. It could be
very interesting for the experts to play with several de nitions of DT as they could model
several levels of expertise or to obtain di erent classi cations using di erent points of view or
bias. LINNEO+ with no DT available acts just as an apprentice with a syntactic heuristic
to group objects by their similarity. The expert is allowed to express his DT in terms of rules
that determine the de nition of a part of the de nition of classes he already knows to exist. A
rule is composed by a class name (an identi er) and some constraints, a set of conditions that
elements must ful ll in order to belong to the class. LINNEO+ accepts the following syntax
to express rules:
L val = V aluej(V alue+)
Op = = j neq(6=) j > j < j >= j <= j range
Clause = (Op L val attributei ) j (rel lisp exp)
CompClause = Clause j (or Clause+)
Rule = (CompClause+ =) Cj )
\(rel exp)" stands for a relational expression between attributes expressed in LISP syntax;
the lval could be a non-null list of modalities in the case of qualitative attributes, and a single
value in the case of quantitative attributes; Attributei is the target attribute and, Cj is a
dummy identi er for the set of objects that satis es this rule. Each clause in a rule represent
a conjunction and the or operator allows to express disjunctions. This syntax could be easily
adapted to a given domain, if needed.
3.2 Biasing with a Domain Theory
If the expert is able to build a DT, it is possible to use this knowledge to bias the classi cation
using the constraints as a guide to preprocessing the dataset. Even in ill-domains the expert
knows that to ignore some attributes for certain classes can be useful, because those attributes
are not relevant predicting class membership. In the same way, the expert, knows that there are
other attributes, or their conjunction, that could be used, with a certain degree of con dence,
to try to predict class membership. The idea is to create a partition of the dataset using the
rules de ned by the expert in meaningful parts, the objects with some knowledge about its
relation (those described by the rules). Those objects that not ful ll none of the rules are
treated as without Domain Theory.
The treatment of the dataset is done previously to the classi cation as follows. All the
objects that satis es a rule (Ri ) are grouped together (SR ). Sometimes the expert gives more
i

than one rule to constraint a set (or class). Sometimes, when rules are too general, two or
more rules select the same object, in this case a special set is created and the rules pointing
to that object are attached. All the objects that do not accomplish any rule are grouped in a
residual set. After this process at maximum, LINNEO+ generates r +2 sets of objects, where
r is the number of sets that the expert has constrained.
In each one of these sets, except for the special and the residual, LINNEO+ starts a
classi cation process and, eventually, creates at least a class for each. Then a new process
begins with the centers of these classes as seeds of the new classi cation and the rest of
objects. In this process new classes can be formed corresponding to classes not described by
the rules.
The bias is obtained by the reordering and previous grouping of the observations in a
meaningful scheme, rather than the random order of the unbiased process. This yields a more
meaningful set of classes, more in the idea that the expert has of his domain structure. This
avoids also the instability induced by the ordering of the observations.
This technique allows to the expert to make explicit and manage his knowledge about a
given domain easily and in an incremental fashion. This knowledge could be used when a DT
(a semantic bias) is available to re ne the experts knowledge, or to model several degrees of
experience while trying to generate a classi cation schema. The use of a DT improves the
results and helps the expert to explore the structure of his domain. A similar aproach is taken
in CONFORT [20].
4 Experiments with a Domain Theory
In order to test the e ect of a domain theory in the process of classi cation, we have written
a small set of rules for the Soya bean domain (See table 1) to bias the resulting classes. These
rules have been built, inspecting the prototypes of the classes of a unbiased classi cation and
extracting the attributes more distinctive. This set of rules is not complete neither consistent,
because we just want to show that only a small piece of domain knowledge is enough to improve
the stability, and therefore the quality, of a classi cation. These rules select 130 observations
from a total of 307.
The experiment was carried out by comparing two sets of 20 random ordered classi cation
using LINNEO+ of the Soya bean dataset [10] obtained from the UCI Repository of Machine
Learning Databases and Domain Theories [12]. The rst without use of domain theory, the
second using our set of rules as domain theory.
In order to compare the resulting classi cations we have developed an algorithm that pro-
vides a measure of the di erences between two classi cations [1, 4]. This measure, that we call
structural coincidence, is used to provide a value for the stability of each set of classi cations
as the mean of the di erence of each pair of classi cations in the set. Among these di erences,
((= (diseased) fruit-pods) ((= (lt-normal) plant-stand)
(= (colored) fruit-spots) (= (severe) severity)
(= (norm) seed) (= (brown dk-brown-blk) canker-lesion)
-> frog-eye-leaf-spot) (1) -> phytophthora-and-rhizoctonia-root-rot)(2)
((= (norm) fruit-pods) ((= (dk-brown-blk) canker-lesion)
(= (tan) canker-lesion) (= (abnorm) seed)
(= (lt-norm norm) precip) (= (gt-norm) precip)
-> charcoal-and-brown-stem-rot)(3) -> anthracnose)(4)
((= (abnorm) seed) ((= (few-present) fruit-pods)
(= (tan) canker-lesion) -> cyst-nematode(6)
-> purple-seed-stain)(5)
((= (norm) leaves) ((= (above-sec-nde absent) stem-cankers)
(= (gt-norm) temp) (= (brown) canker-lesion)
-> diaporthe-pod-and-stem-blight)(7) (= (norm) fruit-pods)
-> diaporthe-stem-canker-and-brown-spot)(8)
((= (lower-surf) leaf-mild) ((= (upper-surf) leaf-mild)
-> downy-mildew)(9) -> powdery-mildew)(10)
((= (no) lodging)
(= (w-s-marg no-w-s-marg) leafspots-marg)
(= (90-100%) germination)
-> brown-stem-rot-and-herbicide-injury) (11)
Table 1: A Soya Bean Domain Theory
it is taken in account the coincidence of objects in the same group and the number of classes
of each classi cation (See appendix A).
Another measure of stability that is used in the comparison is based on the coincidence of
the pairs of associations of observations between two partitions described in [4], this measure
decreases with the similarity.
The stability of a classi cation of the Soya Bean dataset without the DT is 77.6% for
the rst measure and -1013.4 for the second. The stability using the DT increases to 91%
for the rst measure and -4285.6 for the second. A cross comparison between the two sets
of classi cation yields a value of 79.9% for the structural coincidence. This value has been
calculated comparing each class resulting from each method with all the others and averaging.
The interpretation of this value is that the classi cations using the domain theory are similar
to those created without using a bias but much more stable.
Another result is that in the set of classi cation without domain theory the number of
classes is inside the interval 15 to 21, however using the rules the number of classes is inside
the interval 15 to 19, both with a mean of 18 classes.
Applying this technique to other datasets yields similar results as can be seen on table 2
([1]).
In the light of these results, we can say that the use of domain knowledge in unsuper-
vised learning reduces the problem of obtaining meaningless groupings and also it reduces the

Dataset Without DT with DT


Marine Sponges 73.3% 80.9%
Mental Illnesses 74.5% 88.5%
Wastewater 63.7% 69.5%
Table 2: Structural coincidence in other datasets
P1 P2 P3
- Pretreat - Primary -
ment
settler 6 Biological
reactor
.... ..
..........................
..................
...

Primary Recycling P4
sludge ?
? Secondary -
settler
Purge
?
Figure 1: Wastewater treatment process
instability induced by an improper input order.
5 Application to WWTP
LINNEO+ has been applied to classify operating situations of real urban wastewater treat-
ment plants that use a biological process known as activated sludge process. Activate sludge
is undoubtedly the most widely extended wastewater treatment. In this process, a mixture of
several microorganisms transform the biodegradable pollutant (expressed in units of organic
matter as Biological Oxygen Demand (BOD) or Chemical Oxygen Demand (COD)), into new
biomass, with the addition of dissolved oxygen supplied by any aeration system. Previous to
the input in the biological reactor a primary treatment is usually established.
In gure 1, a scheme of a prototype plant is presented. As shown, after the primary settler,
water is rst treated in the bioreactor where, by the action of the microorganisms, the level of
substrate is reduced. Next, the water ows to a secondary settler, where the biomass sludge
settles. Thus clean water remains at the top of the settler and is carried out of the plant.
A fraction of the sludge is returned to the input of the bioreactor in order to maintain an
appropriate level of biomass, allowing the oxidation of the organic matter. The rest of the
sludge is purged.
Real time control of the process constitutes a quite complex problem due to the lack of
reliable instrumentation and simplicity of the models to describe the microbiological processes
that takes place in the bioreactor. In this context, although some advanced control techniques,
like predictive control, have obtained promising results, they are not able to handle a number
of situations that need to consider qualitative knowledge [11]. Consequently, the personal
expertise of the plant manager is necessary to attain an ecient management of the process.
Simultaneously to this problem, and taking into account the social importance of this kind
of plants in order to preserve the ecological equilibrium of water bodies, a lot of variables
related with the organic matter and microorganisms are measured in the plants, giving a lot
of information that is dicult to manage.
The plant studied is located in Manresa, a town of 100,000 inhabitants, located near
Barcelona (Catalonia). The plant treats a ow of 35,000 m3 /day mainly domestic wastew-
ater although wastewater from industries located inside town are received in the plant too.
Initially, the experts provided a set of 38 variables, 8 of which are quality indicators, that are
daily measured (see table 3 in several places of the plant (at the input (P1), denoted with
the sux -E that characterizes the hydraulic ow; after the pretreatment (P2), denoted with
the sux -P; at the input of the biological reactor (P3), denoted with the sux -D; and at
the water output of the plant (P4), denoted with the sux -S), 9 of which are percentages
of performance, denoted with the pre x -Rd (see gure 1). In this study the behavior of the
plant along 527 days has been considered during the period 1990-1991. The dataset was origi-
nally recorded to be used with the K-means algorithm [17] and its adaptation to be used with
LINNEO+ did not caused extra work to the expert and not data engineering was done.
The original classi cation (see table 4) was obtained after setting the radius to 5 and without
the DT. The results were 13 meaningful classes (at the taxonomic level of \working situation"
[18][19]). A situation is an operating working state of the plant, described by measures of the
relevant attributes of the process.
In a parallel study, we used the K-means algorithm implemented in the Systat package,
with the euclidean distance in order to classify the wastewater-treatment data and compare
the results. The experts choose 12 as the most suitable number of classes. The use of the
euclidean distance and the normalizations of the data using the standard deviation seems to
improve the results of the K-means algorithm. Most of the resulting classes were almost the
same. Thus, both methodologies allow to classify the data because the nature of the problem
that presents each day.
There is only a signi cant di erence, the K-means algorithm gives three big classes that
group all the correct functioning situations or normal with very slightly semantical di erences
but with a very similar number of days (175, 181, 123). LINNEO+ nds also the same three
classes {these classes are Class-1, Class-5 and Class-11 of table 4{ but with more di er-
ence between the number of days of them (275, 116, 53) the expert also agree to consider all the
classes representing the correct functioning situation also with slightly di erences. LINNEO+
also identi ed a fourth class of normal situations, that is Class-9, that represents a set of 69
days where operational conditions were just out of the limits but the overall behavior of the
system was considered as normal both by the system and the expert. This separation of normal
situations in four classes was one the major di erences with expert's rst classi cation. In his
opinion there was only one class including Class-1, Class-5, Class-9 and Class-11.
The classes that correspond to abnormal situations(ie storm, bulking, etc) are almost iden-
tical in both classi cations. The normal situations, as explained, were clustered di erently,
but all the classes show few di erences. See [17] for more details.
As an example of the classi cation obtained, the center of Class-11 is showed in table 5.
According to the values of the prototype this class has been identi ed as a normal situation of
the plant with a normal in uent values and with a performance slightly over the mean situation
obtaining a normal euent. This interpretation was con rmed when it was confronted with
the daily log of the plant.
The data used in this example are available in the UCI Repository of Machine Learning
Databases and Domain Theories, and can be obtained by anonymous ftp from ics.uci.edu

Attrib. Description Units


Q Flow m3 /day
Zn Concentration of Zn mg/l
pH pH mg/l
BOD Measure of the biodegradable organic matter mg/l
COD Measure of the chemical oxidable organic matter mg/l
SS Measure of the suspended solids mg/l
VSS Measure of the volatile suspended solids mg/l
Sed Measure of the sedimentable solids mg/l
Cond Measure of electric conductivity mg/l
Table 3: List of experimental attributes considered in the classi cation study
Class Days Operation Interpretation Classi cation
1 275 Right { Normal
2 1 (13/3/90) Out of limits Operation Problems in the
secondary treatment
3 1 (14/3/90) Out of limits Operation Problems in the
secondary treatment
4 4 (15/3/90 Out of limits Operation Problems in the
17-18-19/7/91) secondary treatment
5 116 Right { Normal
6 3 (5/6/90, Right Input Overloading
28-31/5/91)
7 1 (29/4/90) Out of limits Operation Problems in the primary
and secondary treatment
8 1 (14/9/90) Right Input Storm
9 69 Out of limits { Normal
10 1 (12/8/90) Right Input Storm
11 53 Right { Normal
12 1 (22/10/90) Right Input Storm
13 1 (24/5/91) Right Input Overloading
Table 4: List of classes obtained by LINNEO+ and experts' interpretation
In order to measure the quality of the classi cation it has been chosen the parameter of
stability of the classi cation as in the previous example. The value of stability for this dataset
is of 63.7%, this value has been computed with 25 randomly ordered classi cation. This means
that the structure of the domain is weak but it exists. The main reason of this instability are
the four normal classes, because they are very similar and some objects change of class each
classi cation. The objects of the other classes usually stay in the same class.
6 Applying domain knowledge to WWTP
After the rst classi cation (see table 4), that was performed without using DT, we pass to
the inclusion of rules to bias the classi cation process to improve the resulting classes. The
expert was asked to give some rules to allow the system to use his knowledge to bias the classi-
cation process. We are going to show two experiences of biasing the process towards di erent
objectives. In the rst experiment, the expert gave only 1 rule pointing to 1 independent set
of objects.
He takes into consideration stringent water quality standards imposed by the European
Union recommendations for the optimum performance of a wastewater treatment plant of this
kind to be include as the biasing rule. Concretely, these recommendations decree to ne the
plant operators when the limits of biomass and suspended solids exceed the safety intervals.
At the time these European regulations were more strict than Spanish regulations.
This has lead to the rule shown in the gure 2 that allows to the plant manager to decide
if the plant is performing well (inside of EU safety limits) and on the contrary to take some
corrective measures.
The results obtained after biasing the classi cation process has been satisfying, due to that
all the normal situations has been put apart in a set containing 370 days out of 527. After
the classi cation process, all the rest of days were grouped in di erent classes, allowing the
experts to identify and interpret the di erent situations of malfunctioning allowing them to
Attrib. Value Attrib. Value Attrib. Value
Q-E 32935.3 Sed-P 5.82 VSS-S 83.75
Zn-E 2.82 Cond-P 1861.2 Sed-S 0.76
pH-E 8.11 pH-D 8.07 Cond-S 1812.6
BOD-E 244.58 BOD-D 159.25 Rd-BOD-P 41.26
COD-E 521.25 COD-D 353.18 Rd-SS-P 53.13
SS-E 216.37 SS-D 108.3 Rd-Sed-P 85.44
VSS-E 71.64 VSS-D 78.51 Rd-BOD-S 87.36
Sed-E 5.55 Sed-D 0.01 Rd-COD-S 74.68
Cond-E 1835.91 Cond-D 0.49 Rd-BOD-G 91.61
pH-P 8.1 pH-S 7.8 Rd-COD-G 82.32
BOD-P 279.94 BOD-S 19.62 Rd-SS-G 91.34
SS-P 241.72 COD-S 87.79 Rd-SedS-G 99.49
VSS-P 69.39 SS-S 17.85
Table 5: Example center: Class center number 11

((< 25 BOD-S) (< 35 SS-S) (< 125 COD-S) -> normal)

Figure 2: The biasing rule


take speci c measures for each situation.
At the same time, the clustering process allows to establish di erences among the normal
situations, because, although to exist a big class composed of situations (days) with a normal
in uent and a normal performance, there are sets of situations with a high or very high
in uent or with low in uent or speci c situations like storms or presence of toxic that are also
recognized, improving and easing the identi cation process.
Introducing a bias allows the plant manager to have information about the measures to
take to correct the functioning of the plant. This represents an improvement respect to the
previous classi cation, because with it the experts had only a set of classes that did not show
if the European Union administrative recommendations were accomplished or not. Also, the
stability of the classi cation is increased from 63.7% to 65.2%.
Evidently, this results are conditioned by the initial objective of establishing two groups of
situations, when the plant accomplishes with the recommendations inside the European Union
safety limits and when that does not happen.
In the case of wishing another kind of objective, another set of rules would be taken in
consideration, introducing a di erent kind of bias.
To prove this, a second set of rules was used. In this case, the expert characterized three
kinds of normal situations that appears to be in the dataset: normal, normal over the mean
and normal below the mean. That roughly correspond to Class-1, Class-5 and Class-11
of the original classi cation. The set of rules can be seen in the gure 3.
This set of rules increases the stability of the original classi cation from 63.7% to 69.5%.
The number of objects selected are 95. These rules take in consideration the experience of the
experts on the functioning of the plant instead of a UE advise to obtain a better de nition of
the three classes of normal situations.
A very interesting result was that, after using LINNEO+ to create classi cations, with and
without using DT, in the wastewater treatment domain, the experts nd out that LINNEO+ 's
class representation is very expressive. And after some iterations with it, they discover that
((rel (< (/ BOD-E COD-E) .65)) (rel (> (/ BOD-E COD-E) .35))
(rel (< (/ COD-S COD-E) .35)) (rel (> (/ COD-S COD-E) .32))
-> below)

((rel (< (/ BOD-E COD-E) .65)) (rel (> (/ BOD-E COD-E) .35))
(rel (< (/ COD-S COD-E) .3)) (rel (> (/ COD-S COD-E) .27))
-> normal)

((rel (< (/ BOD-E COD-E) .65)) (rel (> (/ BOD-E COD-E) .35))
(rel (< (/ COD-S COD-E) .25)) (rel (> (/ COD-S COD-E) .22))
-> over)

Figure 3: The second set of rules


the set of attributes they choose at rst to describe the domain had a subset of irrelevant
elements (ie: all those attributes pre xed with Rd-, see table 3), and some other have to be
used in combination with others. After the identi cation of irrelevant attributes we removed
those 9 (out of 38) referred to the performance of the plant.
The studied dataset is specially hard due to the short number of examples for the mal-
functioning situations (only 17 from a total of 527 observations) but the results are promising
because all of them have been clearly discovered.
The results of this process are used to construct a knowledge base for the automatic control
and supervision of the waste-water plant and reported in [18] and [17].
7 Conclusions
LINNEO+ , as shown, implements a methodological automatic knowledge acquisition process
from observational datasets in ill-structured domains. It has some advantages, as to allow
the use of quantitative and qualitative attributes in the same dataset. It allows an expert to
make explicit and manage his knowledge about a given domain easily and in an incremental
fashion. This knowledge could be used when a DT (a semantic bias) is available to re ne the
experts knowledge (x3). Or to model several degrees of experience while trying to generate a
classi cation schema. The use of a DT improves the results and helps the expert to explore
the structure of his domain(x6).
It has been shown that the use of domain knowledge as semantic bias in a unsupervised
learning algorithm increases the quality of the result. The domain knowledge has not to be
perfect, can have some ambiguities or inconsistencies, an increase of stability of the results
could still be achieved.
This knowledge aids also to cope with the ordering e ect that su er all incremental algo-
rithms, biasing the process towards a meaningful result, obtaining a better set of concepts.
The use of domain theory also allows to explore di erent options of classi cation based on the
selection of various relevant aspects of the domain.
A Algorithm to compare classi cations
The algorithm that calculate the similarity between two classi cations is as follows:
1. Calculate the coincidence matrix between two classi cations. For each class in the rst
classi cation count the number of objects that appear in each class of the other classi -
cation.
2. For each class in the rst classi cation nd the class in the other classi cation that is
most similar. The most similar class is the class with more common objects. This is done
without repeat associations between classes, this avoid the possibility of non-symmetry
in the measure.
3. The measure is obtained by the quotient between the sum of the coincidences between
the pairs of classes and the total number of objects
The stability of a set of classi cation is the mean of the similarity between all the pairs of
classi cations of the set.
References
1. J. Bejar. Adquisicion de conocimiento en dominios poco estructurados. PhD thesis, De-
partament de Llenguatges i Sistemes Informatics. Facultat d'Informatica de Barcelona.
Universitat Politecnica de Catalunya, 1995.
2. J. Bejar, U. Cortes, and M. Domingo. Using domain theory to bias classi cation pro-
cesses in ill-domains. In Actas del IV congreso Iberoamericano de Inteligencia Arti cial
(IBERAMIA '94), pages 187{197, 1994.
3. G. Biswas, J. Weinberg, Q. Yang, and G. R. Koller. Conceptual clustering and exploratory
data analysis. In Proceedings of the 8th international workshop on Machine Learning,
pages 591{595, 1991.
4. D. Faith and L. Belbin. Comparison of classi cations using measures intermediate between
metric dissimilarity and consensus similarity. Journal of Classi cation, 3:257{280, 1986.
5. D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learn-
ing, 2:139{172, 1987.
6. D. Fisher, L. Xu, and N. Zard. Ordering e ects in clustering. In Proceedings of the Ninth
International Workshop on Machine Learning, pages 163{168, 1992.
7. D. F. Gordon and M. desJardins. Evaluation and selection of biases in machine learning.
Machine Learning, 20:5{22, July 1995.
8. A. K. Jain and R. C. Dubes. Algorithms for clustering data. Prenctice Hall, 1989.
9. R. Michalski and R. E. Steep. Learning from observation: Conceptual Clustering, volume 2
of Machine Learning an A.I. Perspective, chapter 11, pages 331{363. Ed. Tioga, Palo Alto,
California, 1983.
10. R. S. Michalski and R. L. Chilausky. Learning by being told and learning from examples:
An experimental comparison of the two methods of knowledge acquisition in the context
of developing an expert system for soybean disease diagnosis. International Journal of
Policy Analysis and Information Systems, 4(2), 1980.
11. R. Moreno, C. de Prada, J. Lafuente, M.Poch, and G. Montagne. Non linear predictive
control of dissolved oxygen in the activated sludge process. In ICCAFT 5/IFAC-BIO 2
Conference. Keystoke (USA), 1992.
12. P. M. Murphy and D. W. Aha. UCI repository of machine learning databases. Technical
report, Irvine, CA: University of California, Department of Information and Computer
Science., 1994.
13. D. Ourston and R. J. Mooney. Theory re nement combining analytical and empirical
methods. Arti cial Intelligence, 1993.
14. D. Ria~no. Automatic knowledge generation from data in classi cation domains. Master's
thesis, Facultat D'Informatica de Barcelona. Unversistat Politecnica de Catalunya, 1994.
15. E. Rojo. Aplicacion del software LINNEO a la clasi cacion de transtornos mentales. PhD
thesis, Divisio de Ciencies de la Salut. Facultat de Medicina. Universitat de Barcelona,
1993.
16. J. Roure. Study of methods and heuristics to improve the fuzzy classi cations of
LINNEO+ . Master's thesis, Facultat d'Informatica de Barcelona Universitat Politecnica
de Catalunya, 1994.
17. M. Sanchez, U. Cortes, J. Bejar, J. de Gracia, J. Lafuente, and M. Poch. Concept
formation in WWWTP by means of classi cation techniques: A compared study. Applied
Intelligence, 1996. (Accepted).
18. P. Serra, M. Sanchez, J. Lafuente, U. Cortes, and M. Poch. DEPUR: a knowledge-based
tool for waste water treatment plants. Engineering Applications of Arti cial Intelligence,
7(1):23{30, 1994.
19. P. Serra, M. Sanchez, J. Lafuente, U. Cortes, and M. Poch. ISCWAP: A knowledge-based
system for supervising activated sludge processes. Computers and Chemical Engineering,
1996. (Accepted).
20. J. J. F. Vasco, C. Faucher, and E. Chouraqui. A knowledge acquisition tool for multi-
perspective concept formation. In 9th European Knowledge Acquisition Workshop (EKAW
'96), pages 227{244, 1996.

Das könnte Ihnen auch gefallen