Beruflich Dokumente
Kultur Dokumente
in Knowledge Discovery
Javier Bejar bejar@lsi.upc.es
Ulises Cortes ia@lsi.upc.es
Ramon Sanguesa sanguesa@lsi.upc.es
Departament de Llenguatges i Sistemes Informatics
Universitat Politecnica de Catalunya ph: +34 3 4017016
Jordi Girona Salgado 1-3. Barcelona 08034. SpainFax: +34 3 4017014
Manel Poch rqte3@onyar.udg.es
Laboratori d'Enginyeria Qumica i Ambiental
Universitat de Girona. Placa Hospital 6. Girona 17071. Spain
1
or rules from a database of observations described by experts as an alternative to ease the
building of Knowledge-bases for expert systems or Case-bases for case based reasoning tools.
It is shown, also, that little knowledge can produce considerable gain, despite of the ambi-
guity or the partial incorrectness of the knowledge. This ambiguity can be also solved using the
improved classications, performing specializations or generalizations that correct the problems
on the domain knowledge.
This paper is organized as follows: section 2 is devoted to give some basic notions about
LINNEO+ . In x3 we give a detailed description of the concept of domain theory and its use
in the classication process, x4 exposes a classical example to illustrate the eects of the bias
in the quality of the results. in x5 and x6 exposes a practical application of the classication
procedure and the use of domain knowledge in a real domain (Wastewater treament plants).
In x7 some conclusions are given.
2 LINNEO+
LINNEO+ is a knowledge acquisition tool oriented to ill-structured domains, domains with
a weak structure, imprecise information and a membership function of the observations to the
concepts. It uses an unsupervised learning strategy and incrementally accepts a stream of
observations, trying to discover a classication scheme from the data. As a control strategy it
retains only the best hypotheses that are consistent with the observations given a similarity
criterion. Part of LINNEO+ could be considered as conceptual clustering method with two
tasks critically important for its performance: clustering, which determines useful subsets of
a dataset; and characterization, which determines a concept for each extensionally dened
set discovered by clustering. The nal characterization of the classes is build-up by GAR[14]
as a rule set. This task needs from an expert to accept (or reject) the resulting clusters. Other
modules try to better exploit observational knowledge from the dataset, or take advantage of
the experts knowledge if available.
The expert has to dene a set of observations that he thinks as sucient to model the
domain and also denes a set of attributes relevant to the classication goal intended. The
expert is allowed to represent attributes by means of two predened types: Quantitative,
Qualitative. For each observation, a vector is dened whose length (n) is just the number of
attributes. Classically, this vector is the value vector for each property of the object, then the
usual representation of an object is:
than one rule to constraint a set (or class). Sometimes, when rules are too general, two or
more rules select the same object, in this case a special set is created and the rules pointing
to that object are attached. All the objects that do not accomplish any rule are grouped in a
residual set. After this process at maximum, LINNEO+ generates r +2 sets of objects, where
r is the number of sets that the expert has constrained.
In each one of these sets, except for the special and the residual, LINNEO+ starts a
classication process and, eventually, creates at least a class for each. Then a new process
begins with the centers of these classes as seeds of the new classication and the rest of
objects. In this process new classes can be formed corresponding to classes not described by
the rules.
The bias is obtained by the reordering and previous grouping of the observations in a
meaningful scheme, rather than the random order of the unbiased process. This yields a more
meaningful set of classes, more in the idea that the expert has of his domain structure. This
avoids also the instability induced by the ordering of the observations.
This technique allows to the expert to make explicit and manage his knowledge about a
given domain easily and in an incremental fashion. This knowledge could be used when a DT
(a semantic bias) is available to rene the experts knowledge, or to model several degrees of
experience while trying to generate a classication schema. The use of a DT improves the
results and helps the expert to explore the structure of his domain. A similar aproach is taken
in CONFORT [20].
4 Experiments with a Domain Theory
In order to test the eect of a domain theory in the process of classication, we have written
a small set of rules for the Soya bean domain (See table 1) to bias the resulting classes. These
rules have been built, inspecting the prototypes of the classes of a unbiased classication and
extracting the attributes more distinctive. This set of rules is not complete neither consistent,
because we just want to show that only a small piece of domain knowledge is enough to improve
the stability, and therefore the quality, of a classication. These rules select 130 observations
from a total of 307.
The experiment was carried out by comparing two sets of 20 random ordered classication
using LINNEO+ of the Soya bean dataset [10] obtained from the UCI Repository of Machine
Learning Databases and Domain Theories [12]. The rst without use of domain theory, the
second using our set of rules as domain theory.
In order to compare the resulting classications we have developed an algorithm that pro-
vides a measure of the dierences between two classications [1, 4]. This measure, that we call
structural coincidence, is used to provide a value for the stability of each set of classications
as the mean of the dierence of each pair of classications in the set. Among these dierences,
((= (diseased) fruit-pods) ((= (lt-normal) plant-stand)
(= (colored) fruit-spots) (= (severe) severity)
(= (norm) seed) (= (brown dk-brown-blk) canker-lesion)
-> frog-eye-leaf-spot) (1) -> phytophthora-and-rhizoctonia-root-rot)(2)
((= (norm) fruit-pods) ((= (dk-brown-blk) canker-lesion)
(= (tan) canker-lesion) (= (abnorm) seed)
(= (lt-norm norm) precip) (= (gt-norm) precip)
-> charcoal-and-brown-stem-rot)(3) -> anthracnose)(4)
((= (abnorm) seed) ((= (few-present) fruit-pods)
(= (tan) canker-lesion) -> cyst-nematode(6)
-> purple-seed-stain)(5)
((= (norm) leaves) ((= (above-sec-nde absent) stem-cankers)
(= (gt-norm) temp) (= (brown) canker-lesion)
-> diaporthe-pod-and-stem-blight)(7) (= (norm) fruit-pods)
-> diaporthe-stem-canker-and-brown-spot)(8)
((= (lower-surf) leaf-mild) ((= (upper-surf) leaf-mild)
-> downy-mildew)(9) -> powdery-mildew)(10)
((= (no) lodging)
(= (w-s-marg no-w-s-marg) leafspots-marg)
(= (90-100%) germination)
-> brown-stem-rot-and-herbicide-injury) (11)
Table 1: A Soya Bean Domain Theory
it is taken in account the coincidence of objects in the same group and the number of classes
of each classication (See appendix A).
Another measure of stability that is used in the comparison is based on the coincidence of
the pairs of associations of observations between two partitions described in [4], this measure
decreases with the similarity.
The stability of a classication of the Soya Bean dataset without the DT is 77.6% for
the rst measure and -1013.4 for the second. The stability using the DT increases to 91%
for the rst measure and -4285.6 for the second. A cross comparison between the two sets
of classication yields a value of 79.9% for the structural coincidence. This value has been
calculated comparing each class resulting from each method with all the others and averaging.
The interpretation of this value is that the classications using the domain theory are similar
to those created without using a bias but much more stable.
Another result is that in the set of classication without domain theory the number of
classes is inside the interval 15 to 21, however using the rules the number of classes is inside
the interval 15 to 19, both with a mean of 18 classes.
Applying this technique to other datasets yields similar results as can be seen on table 2
([1]).
In the light of these results, we can say that the use of domain knowledge in unsuper-
vised learning reduces the problem of obtaining meaningless groupings and also it reduces the
Primary Recycling P4
sludge ?
? Secondary -
settler
Purge
?
Figure 1: Wastewater treatment process
instability induced by an improper input order.
5 Application to WWTP
LINNEO+ has been applied to classify operating situations of real urban wastewater treat-
ment plants that use a biological process known as activated sludge process. Activate sludge
is undoubtedly the most widely extended wastewater treatment. In this process, a mixture of
several microorganisms transform the biodegradable pollutant (expressed in units of organic
matter as Biological Oxygen Demand (BOD) or Chemical Oxygen Demand (COD)), into new
biomass, with the addition of dissolved oxygen supplied by any aeration system. Previous to
the input in the biological reactor a primary treatment is usually established.
In gure 1, a scheme of a prototype plant is presented. As shown, after the primary settler,
water is rst treated in the bioreactor where, by the action of the microorganisms, the level of
substrate is reduced. Next, the water
ows to a secondary settler, where the biomass sludge
settles. Thus clean water remains at the top of the settler and is carried out of the plant.
A fraction of the sludge is returned to the input of the bioreactor in order to maintain an
appropriate level of biomass, allowing the oxidation of the organic matter. The rest of the
sludge is purged.
Real time control of the process constitutes a quite complex problem due to the lack of
reliable instrumentation and simplicity of the models to describe the microbiological processes
that takes place in the bioreactor. In this context, although some advanced control techniques,
like predictive control, have obtained promising results, they are not able to handle a number
of situations that need to consider qualitative knowledge [11]. Consequently, the personal
expertise of the plant manager is necessary to attain an ecient management of the process.
Simultaneously to this problem, and taking into account the social importance of this kind
of plants in order to preserve the ecological equilibrium of water bodies, a lot of variables
related with the organic matter and microorganisms are measured in the plants, giving a lot
of information that is dicult to manage.
The plant studied is located in Manresa, a town of 100,000 inhabitants, located near
Barcelona (Catalonia). The plant treats a
ow of 35,000 m3 /day mainly domestic wastew-
ater although wastewater from industries located inside town are received in the plant too.
Initially, the experts provided a set of 38 variables, 8 of which are quality indicators, that are
daily measured (see table 3 in several places of the plant (at the input (P1), denoted with
the sux -E that characterizes the hydraulic
ow; after the pretreatment (P2), denoted with
the sux -P; at the input of the biological reactor (P3), denoted with the sux -D; and at
the water output of the plant (P4), denoted with the sux -S), 9 of which are percentages
of performance, denoted with the prex -Rd (see gure 1). In this study the behavior of the
plant along 527 days has been considered during the period 1990-1991. The dataset was origi-
nally recorded to be used with the K-means algorithm [17] and its adaptation to be used with
LINNEO+ did not caused extra work to the expert and not data engineering was done.
The original classication (see table 4) was obtained after setting the radius to 5 and without
the DT. The results were 13 meaningful classes (at the taxonomic level of \working situation"
[18][19]). A situation is an operating working state of the plant, described by measures of the
relevant attributes of the process.
In a parallel study, we used the K-means algorithm implemented in the Systat package,
with the euclidean distance in order to classify the wastewater-treatment data and compare
the results. The experts choose 12 as the most suitable number of classes. The use of the
euclidean distance and the normalizations of the data using the standard deviation seems to
improve the results of the K-means algorithm. Most of the resulting classes were almost the
same. Thus, both methodologies allow to classify the data because the nature of the problem
that presents each day.
There is only a signicant dierence, the K-means algorithm gives three big classes that
group all the correct functioning situations or normal with very slightly semantical dierences
but with a very similar number of days (175, 181, 123). LINNEO+ nds also the same three
classes {these classes are Class-1, Class-5 and Class-11 of table 4{ but with more dier-
ence between the number of days of them (275, 116, 53) the expert also agree to consider all the
classes representing the correct functioning situation also with slightly dierences. LINNEO+
also identied a fourth class of normal situations, that is Class-9, that represents a set of 69
days where operational conditions were just out of the limits but the overall behavior of the
system was considered as normal both by the system and the expert. This separation of normal
situations in four classes was one the major dierences with expert's rst classication. In his
opinion there was only one class including Class-1, Class-5, Class-9 and Class-11.
The classes that correspond to abnormal situations(ie storm, bulking, etc) are almost iden-
tical in both classications. The normal situations, as explained, were clustered dierently,
but all the classes show few dierences. See [17] for more details.
As an example of the classication obtained, the center of Class-11 is showed in table 5.
According to the values of the prototype this class has been identied as a normal situation of
the plant with a normal in
uent values and with a performance slightly over the mean situation
obtaining a normal euent. This interpretation was conrmed when it was confronted with
the daily log of the plant.
The data used in this example are available in the UCI Repository of Machine Learning
Databases and Domain Theories, and can be obtained by anonymous ftp from ics.uci.edu
((rel (< (/ BOD-E COD-E) .65)) (rel (> (/ BOD-E COD-E) .35))
(rel (< (/ COD-S COD-E) .3)) (rel (> (/ COD-S COD-E) .27))
-> normal)
((rel (< (/ BOD-E COD-E) .65)) (rel (> (/ BOD-E COD-E) .35))
(rel (< (/ COD-S COD-E) .25)) (rel (> (/ COD-S COD-E) .22))
-> over)