Beruflich Dokumente
Kultur Dokumente
com
Abstract
Ontology learning refers to extracting conceptual knowledge from several sources and building an ontology from scratch, enriching,
or adapting an existing ontology. It uses methods from a diverse spectrum of elds such as natural language processing, articial intelligence and machine learning. However, a crucial challenging issue is to quantitatively evaluate the usefulness and accuracy of both techniques and combinations of techniques, when applied to ontology learning. It is an interesting problem because there are no published
comparative studies.
We are developing a exible framework for ontology learning from text which provides a cyclical process that involves the successive
application of various NLP techniques and learning algorithms for concept extraction and ontology modelling. The framework provides
support to evaluate the usefulness and accuracy of dierent techniques and possible combinations of techniques into specic processes, to
deal with the above challenge. We show our frameworks ecacy as a workbench for testing and evaluating concept identication. Our
initial experiment supports our assumption about the usefulness of our approach.
Crown copyright 2007 Published by Elsevier B.V. All rights reserved.
Keywords: Semantic Web; Ontologies; Ontology learning; NLP methods; Machine learning methods
1. Introduction
The Semantic Web is an evolving extension of the
World-Wide Web, in which content is encoded in a formal
and explicit way, and can be read and used by software
agents [2]. It depends heavily on the proliferation of ontologies. An ontology constitutes a formal conceptualization
of a particular domain shared by a group of people. In
complex domains to identify, dene, and conceptualize a
domain manually, can be a costly and error-prone task.
This problem can be eased by semi-automatically generating an ontology.
Most domain knowledge about domain entities and
their properties and relationships is embodied in text collections with varying degrees of explicitness and precision. Ontology learning from text has therefore been
among the most important strategies for building an ontol*
0950-7051/$ - see front matter Crown copyright 2007 Published by Elsevier B.V. All rights reserved.
doi:10.1016/j.knosys.2007.11.009
ontology learning is that most frameworks use a predened combination of techniques. Thus, they do not
include any mechanism for carrying out experiments with
combinations or the ability to include new ones. Reinberger et al. [22] point out that: To our knowledge no comparative study has been published yet on the eciency
and eectiveness of the various techniques applied to ontology learning.
Our motivation is to help to make the ontology learning
process controllable. Because of this, it is important to
know the contribution of the available techniques and the
eciency of a technique combination. We think that the
failure to evaluate the relative ecacy of dierent NLP
techniques is likely to hinder the development of eective
learning and knowledge acquisition support for ontology
engineering. Due to the above problem, both a exible
framework and an integrated tool-suite to congure and
combine techniques applied to ontology learning are proposed. The general architecture of our solution integrates
an existing linguistic tool (WMatrix [20]), which provides
part-of-speech (POS) and semantic tagging, an ontology
workbench for information extraction, and an existing
open source ontology editor called Protege [16].1 This work
is part of a larger project to build ontologies semi-automatically by processing a collection of domain texts. It involves
dealing with four fundamental issues: extracting the relevant domain terminology, discovering concepts, deriving
a concept hierarchy, and identifying and labeling ontological relations. Our work involves the innovative adaptation, integration and application of existing NLP and
machine learning techniques in order to answer the following research question:
Can shallow analysis of the kind enabled by a range of linguistic and statistical NLP and corpus linguistic techniques
identify key domain concepts? Can it do it with sucient condence in the correctness and completeness of the result?
The main contributions of our project are:
Providing ontology engineers with a coordinated and
integrated tool for knowledge objects extraction and
ontology modelling.
Evaluating the contribution of dierent NLP and
machine learning techniques and their combinations
for ontology learning.
Proposing a guideline to congure and combine techniques applied to ontology learning.
In this paper we present the results achieved so far:
The denition of a framework which provides support
for testing dierent NLP and machine learning techniques to support the semi-automatic ontology learning
process.
http://protege.stanford.edu/
193
194
ture so it can include new algorithms. However, in contrast, it can include techniques from existing linguistic
and ontology tools by using java APIs (Application Program Interface) directly where it is possible. In addition,
Tex2Onto denes the user interaction as a core aspect
whereas our framework provides support to process algorithms in an unsupervised mode as well. In the next section
we describe our Ontology Acquisition Framework before
explaining in the subsequent section how our framework
supports evaluation.
3. The ontology framework: OntoLancs
Our research project principally addresses the issue of
quantitatively evaluating the usefulness or accuracy of
techniques and combinations of techniques applied to
ontology learning. We have integrated a rst set of natural
language processing, corpus linguistics and machine learning techniques for experimentation. They are: (a) POS
grouping, (b) stopwords ltering, (c) frequency ltering,
(d) POS ltering, (e) lemmatization, (f) stemming, (g) frequency proling, (h) concordance, (i) lexicon-syntactic pattern (j) co-occurrence by distance, and (k) collocation
analysis. Our framework facilitates experiments with dierent NLP and machine learning techniques in order to
assess their eciency and eectiveness, including the performances of various combinations of techniques. All such
functions are being built into a prototype workbench to
evaluate and rene existing techniques using a range of
domain document corpora.
In this paper several existing knowledge acquisition
techniques are selected for performing the concept acquisi-
http://www.comp.lancs.ac.uk/ucrel/usas/
195
3
4
5
http://www.daml.org/ontologies/
http://www.m-w.com/
http://dictionary.cambridge.org/
http://www.natcorp.ox.ac.uk/
196
http://www.lgi2p.ema.fr/~ranwezs/ontologies/soccerV2.0.daml
Precision: measures the number of classes of the reference ontology which were matched by a concept returned
by applying the selected NLP techniques to the document
corpus divided by the number of the candidate terms.
Recall: measures the number of classes of the reference
ontology which were matched by a concept returned by
applying the selected NLP techniques to the document corpus divided by the number of ontology classes.
We dened a set of NLP and machine learning techniques combinations, grouped by the use of a morphological technique, and then obtained precision and recall
values for each combination (see Fig. 3).
The results of the rst evaluation, after applying grouping by POS, stemming or lemmatization and frequency
proling techniques, showed low values of recall and precision. This is a consequence of the fact that we used an
unsupervised method and applied a limited number of
techniques for identifying domain concepts.
In the above experiments, although we applied one NLP
and one machine learning technique only on the set of candidate terms, we collected a reasonable number of matched
classes with the ontology all experiments had a recall
above 42% (see Table 1). Applying the stemming technique
before applying the frequency proling technique on the set
of candidates terms, produced the lowest values of recall.
All were above 32% and below 34% (see Table 2). In the
case of precision, the results were lower than the independent morphosyntactic technique. In contrast, applying lemmatization before applying the frequency proling
technique produced the best results. In particular, the set
of candidate terms ltered by using a 95% condence produced values of recall above 47% (see Table 3). In the case
of precision, the results were higher than other cases (3.43%
the highest value).
From the experiments, we can conclude that the lemmatization technique produces better results of precision and
recall than the stemming technique for the domain concept
197
Table 1
Performance using dierent techniques morphosyntactic technique
independent
Combination
Recall
Precision
A1
A2
A3
A4
A5
A6
45.45
44.71
44.98
43.74
44.02
42.79
2.36
2.83
2.39
2.88
2.42
2.97
Table 2
Performance using dierent techniques stemming
Combination
Recall
Precision
S1
S2
S3
S4
S5
S6
33.33
32.52
33.33
32.52
33.33
32.25
2.25
2.65
2.25
2.65
2.34
2.77
Table 3
Performance using dierent techniques lemmatization
Combination
Recall
Precision
L1
L2
L3
L4
L5
L6
47.62
45.45
47.62
45.45
47.14
45.45
2.98
3.43
2.98
3.43
3.16
3.43
198
199