Sie sind auf Seite 1von 6

Automatic Text processing for Spanish Texts

M. Daniela López De Luise, Mariana Soffer


AIGroup. Universidad de Palermo. Buenos Aires. Argentina.
lopezdeluise@yahoo.com.ar

Abstract texts. Besides, it assists the system in its internal text


processing and in the building of certain structures that
This work focuses on some aspects of automatic text represent the text and are part of the system’s dynamic
processing by using a metric named po defined in a database.
working prototype WIB (Web Intelligent Browser). The The rest of this section will explain the main
word weighting generated by this metric is defined characteristics of this metric. Section two will define a
with morphosyntactic considerations and allows the case study using po analyses of hard and fuzzy clusters.
categorization of text words in fuzzy clusters. This Finally, in section three we will present our
weighting could also be used as a model of the conclusions and future work.
original text at a sentence and paragraph level. A case
study is presented here to help analyze the 1.1. The generation of po
interpretation of the categories.
The original Web text is downloaded by WIB and
1. Introduction converted into a sequence of words and symbols.
There are morphosyntactic descriptor fields named
Language expressions are heterogeneous and HBE (Homogenized Basic Element) associated with
depend on the culture, topic, geography, etc. [1]. All of each word. All the HBEs that belong to a single
them also influence any expression taken from the sentence are organized into an Ics (Internal
Web. In order to manipulate those expressions, it Composition Structure). Every Ics of a same text is
would be helpful to have some tools to detect author structured into an Ecs (External Composition
profile, to extract briefs and to categorize word Structure). The function that promotes an Ics to an Ecs
representativeness automatically. defines the po value according to the following
There are several proposals to extract briefs principles:
automatically [2] [12] [13]. Some models use a -The po value must reflect simple morphosyntactic
predefined document database D= {d1, d2..., dm}, and characteristics (e.g. word length, number of strong
set some labels to each di manually with a predefined vowels, etc.).
representativeness. The queries are solved by using -The domain value must be [-2.0; +2.0].
fuzzy logic on such labels instead of on original -Most values must be 0.0.
documents [5]. -The unsigned value is near 0.0 when the associated
Fuzzy manipulation has proved to be as competitive word does not have much influence on the meaning of
and efficient as its classic algorithmic counterpart [12]. the sentence1.
At first, only the searched keywords were fuzzyfied -The unsigned value is far from 0.0 when the
[13]. Then fuzzy logic evolved significantly since it associated word has a significant influence on the
improved retrieval efficiency [6]. Nowadays, there are meaning of the sentence.
many proposals such as semi-fuzzy operators [12] and -The poIcs results from a combination of po with the Ics.
extensions in Information Retrieval (IR) used to This combination will vary, depending on the system
enhance Case Based Reasoners (CBR) [3]. status. According to [7], the usual formula is:
The WIB prototype performs a process at word and (1) p(w0 ) n p(wi )
at sentence level, but not at document level, unlike
p =o +
2n
∑2
i =1
n+1−i

previous models. For that purpose, it defines a


weighting named po [7] that categorizes each Spanish 1
To have much influence does not imply to have an important
word into the actual text automatically [8]. This meaning. For instance, a negative like no changes the meaning to the
weighting can also be used to characterize types of opposite.
Where the number of words in the sentence is n, w to this, the main processing unit is not the document
is the HBE of a certain word and p (w) refers to the but the sentence. This approach with fuzzy sets is
weighting heuristic when it is applied to some w. somewhat different from previous ones.
The paragraph weighting is reflected in the poEcs. It In the following section we will consider two main
results from the most optimistic poIcs value2. tasks.

1.2. The peculiarities of po 1.4.1. Document database fuzzy manipulation. To


perform this task, the prototype must follow these
The po has been designed to support the automatic steps:
fuzzy clustering of words, but it has a number of -Save documents.
characteristics that make its usage different in WIB -Derive the Ecs and Ics.
from other fuzzy approaches: -Find out po values for words and sentences (i.e. for Ecs
-There is no predefined label set: it derives from the and Ics).
weigh value of each HBE. There are two problems to be solved here [13]:
-The HBEs weight influences all the Ics. It is also 1) A mathematic description of a proper quantifier.
combined with the rest of the HBEs from the same 2) The numeric manipulation with fuzzy operators
sentence. upon a concrete query.
-The weight is derived from the locus and structure of In this paper, po will be evaluated to determine if it
each word (morphosyntaxis). results good enough to define fuzzy clusters. This
-The weight has a [-2; +2] domain instead of a [0; 1] approach should be completed with the
domain. implementation of the proper po values in order to filter
Although the detailed fuzzy model and its the results of any query or browsing activity.
implementation exceed the scope of this paper, some Regarding the second problem, it will be solved by
basic outlines for a WIB will be provided here. means of a set of proposed quality metrics [7].

1.3. The statistical behaviour of po 1.4.2. Query/Browse fuzzy manipulation. During a


query, the fuzzy logic makes it possible to manipulate
This section summarizes the main statistical the ambiguity that exists in any natural language. This
behaviour of this weighting procedure as presented in specific topic exceeds the scope of this paper but some
[8]. basic outlines will be provided for the sake of
There are two main criteria for text classification: completeness.
1. Type of text: Any written text should follow a In fuzzy approaches for queries, the user must
formal structure. The types proposed are literary, provide a numeric relevance. But most users are not
technical and messages. able to define this appropriately [5]. Any weighted
2. Writer profile: According to the topic, the writer words in the query have to be processed with a special
adapts the text to suit his needs. The proposed profiles operator LOWA (Linguistic Ordered Weighted
are forum, Web index, document and blog. Averaging), which can be successfully extended to
According to [8], po is a new metric, invariant with other areas like decision making [4], where the
respect to document size and type of text. It could be problem becomes more complex since there are many
used to discriminate the writer’s profile and to assess sources of relevant quantification.
the relevance of a sentence within a text. Consequently, the following steps should be
performed by a WIB:
1.4. The Fuzzy Model in WIB -Define a short HBE sequence (typically between
seven and nine) that represents the document. In other
According to [8], there are profiles and types of proposals this task uses predefined labels. E.g. {null,
sentences with different degrees of quality and very-low, low, medium, high, very-high}.
representativeness. A search (or query) could use -Define membership function for each label. This
fuzzy logic on the po values to determine whether a corresponds to the membership functions derived from
sentence deserves to be better scored within a list of the fuzzy clustering.
possible answers. -Define fuzzy operators: negation, comparison,
In this work, it should be noted that the fuzzy sets aggregation, etc.
are used on words instead of on documents. According -Reduce query terms to a small subset.

2
The optimistic concept can change upon system status.
2. Case study: clustering from po trees are helpful to determine the fields that may be
useful for a certain classification, because they place
This case study is based on a Spanish text taken the best fields for that task near the root.
from a Web page. The topic it dealt with was orchids Table 1
(url: orquidea.blogia.com temas-que-es-una-orquidea- LOG LIKELIHOOD VS. FIELDS
.php). The text is transformed into a set of words and
symbols. After this, each one is converted into an HBE
[9] and structured as an oriented graph (Ics), built upon
morphosyntactic considerations [11]. Each HBE has
associated values that describe the morphosyntactic
characteristics of the word and the sentence it belongs
to. In a similar way WIB detects words and sentences
and considers them relevant (they are called pointer). It
also defines the relative weight in the meaning of the
sentence (positive or negative). This is useful to According to the J484 tree −trained with the EM
determine if any word is connected to the main topic clusters− (Fig. 1), poECI is a good discriminator when
[8]. The data fields are: the length value exceeds seven words. It was also
ID: word extracted from the original text;
poID: po weight for the ID word; shown in [10] that the poECI is a reliable
poECI: po derived from all the poIDs in the same Ics; discriminator in the case of words that belong to the
indicadora: if it is a pointer word, i.e. if the HBE belongs to a main topic. This is related to the highest position of
sentence that refers to the main topic. this field in the tree. By contrast, the main value of
poID resides in providing a basis for the fuzzy
3.1. Descriptive statistics clustering.

From the sixty sentences (i.e. Ics) of the original


html, the following fields were calculated: long
(number of words in the Ics), poECI (Ics weight), and
cantIndic or indicadora (number of pointer words in
the Ics). When a clustering with Expectation
Maximization (EM)3 is done, the best result is obtained
when indicadora is not considered and the worst, when
poECI is not considered (Table 1). The plot and length Fig. 1. J48 WITH EM CLUSTERS
of the Ics show that:
3.2. Hard clustering
-The field long has a low value in cluster 4.
-For clusters 1 and 3 the same field has a higher value
When a clustering with EM is done −reaching a
than for cluster 4. Cluster 2 has the highest value of all.
maximum of 100 iterations− the algorithm finds five
When we analyse the Ics in relation to the poECI we
clusters. The best log likelihood5 is obtained when all
find that:
variables are considered (Table 2).
-Clusters 3 and 4 have similar poEci values.
-Cluster 1 has a big dispersion with lower values than Table 2
the media. LOG LIKELIHOOD IN EM
-Cluster 2 presents values above the mean.
Consequently, it could be inferred that the number of
indicadora may not be important, but the Ics length and
poECI values could serve to classify words. Induction

3
Algorithm used to find maximum likelihood estimates of
parameters in probabilistic models, where the model depends on
unobserved latent variables. EM alternates between the performance
of an expectation (E) step (which computes an expectation of the
likelihood by including the latent variables as if they were observed)
4
and that of a maximization (M) step (which computes the maximum Extension of the C4.5 induction tree algorithm [15].
5
likelihood estimates of the parameters by maximizing the expected In EM, a log likelihood is the ratio of the maximum probability
likelihood found on the E step). The parameters found on the M step of a result under two different hypotheses. In this algorithm it
are then used to begin another E step, and the process is repeated. indicates the degree of optimization of the clusters.
The clusters have a special distribution (Table 3). Table 4
Most members of the clusters are negative or positive CLUSTER CHARACTERISTICS
modifiers of the sentence. There are also some subsets;
in one subset all members are indicadora whereas in
another, all the poID are less than 0.
The number of clusters is the same when they are
calculated using Mean Squared Error6 (Fig. 2).
The clustering was repeated using the same fields,
with the assistance of a Multi Layer Perceptron7 There are three random sentences (Ics) taken from
(MLP) Results achieved were similar to the ones the text. Each dot represents a word (i.e. an HBE). The
mentioned above. The resultant classification was used paragraphs (an Ics sequence) also tend to be distributed
to train the J48 tree shown in Fig. 3. The field according to a base class (Fig. 5).
indicadora became the most important, followed by
poID.
Table 3
CLUSTERS USING EM

Fig. 2. J48 WITH ICS DESCRIPTORS


Fig. 4. CLUSTER ASSIGN WITHIN THREE ICSS

Fig. 3. J48 TRAINED WITH MLP CLUSTERS


After examining the clusters, we concluded that
cluster 1 was the biggest in size (it had most of the text
words). Cluster 2 had few words but with negative
meaning. The rest of the characteristics were displayed
in Table 4.
If we analyse the class (i.e. cluster number) for
words and sentences, we can see that there is a base
class with few changes (Fig. 4).
Fig. 5. CLUSTER ASSIGN IN THREE PARAGRAPHS
6
Statistics used to assess the distance between an estimator and 3.3. Fuzzy clustering
real values.
7
An artificial neural network (ANN), often called a "neural
network" (NN), is a mathematical or computational model based on In this section, the number of clusters obtained in
biological neural networks. It consists of an interconnected group of the previous section will be used to get fuzzy sets. This
artificial neurons and processes information using a connectionist will be done because:
approach to computation.
-The best statistic results were achieved with five clusters.
-A contrastive study of cluster meaning, using po and poID
incidence, will be carried out.
-The fuzzy set relation −with special characteristics within
the documents− will be studied.
For the sake of simplicity, and as a first step in the
exploratory study of fuzzy set applicability, the
membership percentage is taken as the biggest value
for each instance. In this way, the cluster is assigned,
ignoring any fuzzy situation between two or more
clusters. The c-means approach was selected. This
approach minimizes the Minkowsky distance [16]
based on a parameter to change the cluster shape. The
fuzzy sets that resulted from the data obtained are
shown in Table 5. In the case of EM and MLP, data
distribution differs from the distribution in fuzzy
clusters. Words in hard cluster 1 spread into three
fuzzy clusters. Cluster 3 feeds two fuzzy sets and the
rest of the clusters concentrates on fuzzy cluster 5.
This could suggest a different clustering criteria. Fig. 7. FUZZY CLUSTER ASSIGN IN 3 PARAGRAPHS
Table 5
HARD AND FUZZY CLUSTER DISTRIBUTION In general, the IcsS show less cluster variation with
fuzzy clusters than with hard clusters. Fig. 8 sketches a
plot of data clusters. This data was reviewed in order
to understand the new distribution. The characteristics
of the cluster are summarized in Table 6.

Fig. 6 and Fig. 7 show the fuzzy class (i.e. fuzzy


cluster number) for three sentences and paragraphs.

Fig. 8. FUZZY CLUSTERS


From the analysis of Table 4 and Table 6, the
following can be concluded:
-There is a better discrimination of words. This
contributes to their location in the sentence. There are
words with strong meaning (cluster 1), words that
contribute to the meaning partially (cluster 2), and
words with almost no contribution at all (cluster 3). In
hard clustering, cluster 3 represents all three cases
(Table 6).
-Words that change sentence meaning are clustered
together in hard clustering (see cluster 5 in Table 6
and cluster 4 in Table 4).
Fig. 6. FUZZY CLUSTER ASSIGN WITHIN 3 ICSS -Words related to the main topic are categorized
according to their degree of representativeness of the
topic (clusters 1 and 2).
Table 6 Mathware and Soft Computing. vol. 7 (2 - 3): pp. 149 - 158.
FUZZY CLUSTER CHARACTERISTICS 2000.
[3] Jackzynski M., Trousse B. “Fuzzy Logic for the
retrieval step of a Case-Based Reasoner”. Proc. of the
Second European Conference on Case-Based Reasoning, pp.
313-322. 1994.
[4] Herrera F., Herrera Viedma E., Verdegay J. L.
“Aggregating linguistic preferences: properties of lowa
operator”. Proc. of VI IFSA World Congress, Sao Paulo.
Brazil. Vol. II. pp. 153 - 157. 1995.
[5] Herrera Viedma E., López Herrera A. G., Luque M., Porcel C.
“A Fuzzy Linguistic IRS Model Based on a 2-Tuple Fuzzy
Linguistic Approach”. V Congreso ISKO. pp. 148 - 157. España.
España. 2001.
[6] Herrera Viedma E., Pasi G. “Approaches to access
information on the Web: recent developments and research trends”.
Fuzzy. Proc. International Conference on Fuzzy Logic and
Technology (EUSFLAT 2003), pp. 25–31, Zittau (Germany). 2003.
[7] López De Luise M. D., Agüero M. J. “Aplicabilidad de
métricas categóricas en sistemas difusos”. IEEE Latin America
Magazine. Vol. 5. Issue 1. Editor Jefe José Antonio Jardini. 2007.
[8] López De Luise M. D. “A Metric for Automatical Word
Categorization”. In Advances in Systems, Computing Sciences and
Software Engineering. Proc. Of SCSS 2007. Springer. Tarek Sobh &
-Words that are used to complete the sentences in Khaled Elleithy Editors. Aceptado para publicación. 2007.
order to conform to the rules of language grammar are [9] López De Luise M. D. “A Morphosyntactical Complementary
Structure for Searching and Browsing”. In Advances in Systems,
in a separate cluster (fuzzy cluster 3). Computing Sciences and Software Engineering. Proc. Of SCSS
Therefore, it seems reasonable to state that a certain 2007. Springer. Tarek Sobh & Khaled Elleithy Editors. Pp. 283 –
meaning in the fuzzy sets can be found. These fuzzy 290. 2005.
sets can be processed as fuzzy logic. [10] López De Luise M. D. “Induction Trees for Automatic Word
Classification”. Anales XIII Congreso Argentino de Ciencias de la
Computación (CACIC07). Corrientes. Argentina. Pp. 1702. 2007.
3. Conclusions and future work [11] López De Luise M. D. “Una representación alternativa para
textos”. Ciencia y Tecnología. Colección C&T. ISSN 1850 0870.
It was shown that clusters could be generated by 2007-4. Buenos Aires, Argentina. Pp. 119 – 130. 2007.
using po. In our case study, there were five clusters. [12] Losada D. E., Barro Ameneiro S., Bugarín Diz A. J., Díaz
Hermida F. “Experiments on using fuzzy quantified sentences in
The clusters represented consistent word adhoc retrieval”. ACM Symposium on Applied Computing. vol. 0,
agglomerations. The fields indicadora and poID were pp. 1059 - 1066. 2004.
useful at a word processing level because they [13] Losada D. E., Díaz Hermida F., Bugarín A. “Semi-fuzzy
provided a better discrimination in fuzzy sets. At quantifiers for information retrieval”. International Journal of
approximate reasoning. vol. 34, pp. 49-88. 2003.
paragraph level, poECI and long were better
[14] Morillas Raya A.”Introducción al análisis de datos difusos”.
discriminators. Depto. de Estadística y Econometría. Univ. de Málaga. España.
This approach must be completed with the use of a www.eumed.net/libros/2006b/amr/. ISBN: 84-689-9208-2. 2006.
po weighting to filter the results of any query or [15] Witten I. H., Frank E. “DataMining - Practical Machine
browsing activity. The fuzzy operators to be used for Learning Tools and Techniques”. 2nd ed. Morgan Kaufmann
Publishers. 2005.
this purpose must also be defined.
[16] http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/,
Clustering introduction.
10. References
[1] Bargalló M., Forgas E., Garriga C., Rubio A. “Las lenguas de
especialidad y su didáctica”. J. Schnitzer Eds. Universitat Rovira i
Virgili. Tarragona, cap. 1 (P. Schifko, Wirtschaftsuniversität Wien),
pp. 21-29. 2001.
[2] Delgado M., Sánchez D., Serrano J.M., Vila M.A. “A
Survey of methods to evaluate quantified sentences”.

Das könnte Ihnen auch gefallen