Beruflich Dokumente
Kultur Dokumente
2
The optimistic concept can change upon system status.
2. Case study: clustering from po trees are helpful to determine the fields that may be
useful for a certain classification, because they place
This case study is based on a Spanish text taken the best fields for that task near the root.
from a Web page. The topic it dealt with was orchids Table 1
(url: orquidea.blogia.com temas-que-es-una-orquidea- LOG LIKELIHOOD VS. FIELDS
.php). The text is transformed into a set of words and
symbols. After this, each one is converted into an HBE
[9] and structured as an oriented graph (Ics), built upon
morphosyntactic considerations [11]. Each HBE has
associated values that describe the morphosyntactic
characteristics of the word and the sentence it belongs
to. In a similar way WIB detects words and sentences
and considers them relevant (they are called pointer). It
also defines the relative weight in the meaning of the
sentence (positive or negative). This is useful to According to the J484 tree −trained with the EM
determine if any word is connected to the main topic clusters− (Fig. 1), poECI is a good discriminator when
[8]. The data fields are: the length value exceeds seven words. It was also
ID: word extracted from the original text;
poID: po weight for the ID word; shown in [10] that the poECI is a reliable
poECI: po derived from all the poIDs in the same Ics; discriminator in the case of words that belong to the
indicadora: if it is a pointer word, i.e. if the HBE belongs to a main topic. This is related to the highest position of
sentence that refers to the main topic. this field in the tree. By contrast, the main value of
poID resides in providing a basis for the fuzzy
3.1. Descriptive statistics clustering.
3
Algorithm used to find maximum likelihood estimates of
parameters in probabilistic models, where the model depends on
unobserved latent variables. EM alternates between the performance
of an expectation (E) step (which computes an expectation of the
likelihood by including the latent variables as if they were observed)
4
and that of a maximization (M) step (which computes the maximum Extension of the C4.5 induction tree algorithm [15].
5
likelihood estimates of the parameters by maximizing the expected In EM, a log likelihood is the ratio of the maximum probability
likelihood found on the E step). The parameters found on the M step of a result under two different hypotheses. In this algorithm it
are then used to begin another E step, and the process is repeated. indicates the degree of optimization of the clusters.
The clusters have a special distribution (Table 3). Table 4
Most members of the clusters are negative or positive CLUSTER CHARACTERISTICS
modifiers of the sentence. There are also some subsets;
in one subset all members are indicadora whereas in
another, all the poID are less than 0.
The number of clusters is the same when they are
calculated using Mean Squared Error6 (Fig. 2).
The clustering was repeated using the same fields,
with the assistance of a Multi Layer Perceptron7 There are three random sentences (Ics) taken from
(MLP) Results achieved were similar to the ones the text. Each dot represents a word (i.e. an HBE). The
mentioned above. The resultant classification was used paragraphs (an Ics sequence) also tend to be distributed
to train the J48 tree shown in Fig. 3. The field according to a base class (Fig. 5).
indicadora became the most important, followed by
poID.
Table 3
CLUSTERS USING EM