Beruflich Dokumente
Kultur Dokumente
MrioRodrigues
AntnioTeixeira
Advanced Applications
of Natural Language
Processing for
Performing Information
Extraction
123
Mrio Rodrigues
Antnio Teixeira
Advanced Applications
of Natural Language
Processing for Performing
Information Extraction
Mrio Rodrigues
ESTGA/IEETA
University of Aveiro
Portugal
Antnio Teixeira
DETI/IEETA
University of Aveiro
Portugal
ISSN 2191-8112
ISSN 2191-8120 (electronic)
SpringerBriefs in Electrical and Computer Engineering
ISBN 978-3-319-15562-3
ISBN 978-3-319-15563-0 (eBook)
DOI 10.1007/978-3-319-15563-0
Library of Congress Control Number: 2015935192
Springer Cham Heidelberg New York Dordrecht London
The Authors 2015
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, express or implied, with respect to the material contained herein or for any errors
or omissions that may have been made.
Printed on acid-free paper
Springer International Publishing AG Switzerland is part of Springer Science+Business Media
(www.springer.com)
Mrio Rodrigues
Antnio Teixeira
Advanced Applications
of Natural Language
Processing for Performing
Information Extraction
Mrio Rodrigues
ESTGA/IEETA
University of Aveiro
Portugal
Antnio Teixeira
DETI/IEETA
University of Aveiro
Portugal
ISSN 2191-8112
ISSN 2191-8120 (electronic)
SpringerBriefs in Electrical and Computer Engineering
ISBN 978-3-319-15562-3
ISBN 978-3-319-15563-0 (eBook)
DOI 10.1007/978-3-319-15563-0
Library of Congress Control Number: 2015935192
Springer Cham Heidelberg New York Dordrecht London
The Authors 2015
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, express or implied, with respect to the material contained herein or for any errors
or omissions that may have been made.
Printed on acid-free paper
Springer International Publishing AG Switzerland is part of Springer Science+Business Media
(www.springer.com)
Preface
Mrio Rodrigues
Antnio Teixeira
v
Contents
Introduction ...............................................................................................
1.1 Document Society ..............................................................................
1.2 Problems ............................................................................................
1.3 Semantics and Knowledge Representation ........................................
1.4 Natural Language Processing ............................................................
1.5 Information Extraction .......................................................................
1.5.1 Main Challenges in Information Extraction ..........................
1.5.2 Approaches to Information Extraction...................................
1.5.3 Performance Measures ...........................................................
1.5.4 General Architecture for Information Extraction ..................
1.6 Book Structure ...................................................................................
References ...................................................................................................
1
1
2
3
4
5
5
6
7
8
8
10
13
13
15
15
16
17
18
19
20
21
23
23
24
24
24
vii
viii
Contents
27
27
30
32
32
33
34
37
37
38
39
40
41
44
45
49
51
51
53
53
54
58
58
58
59
59
61
65
67
68
Conclusion .................................................................................................
71
Index .................................................................................................................
73
Chapter 1
Introduction
1Introduction
The number of such documents and rate of increase are overwhelming. Some
examples: Governments produce large amounts of documents at the several levels
(local, central) and of many types (laws, regulations, minutes of meetings (public),
etc.); information in companies intranets is increasing; more and more exams,
reports and other medical documents are stored in servers by health institutions. Our
personal documents augment day by day in number and size. As such, health
research is one of the most active areas, resulting in a steady flow of documents (e.g.
medical journals and masters and doctoral theses) reporting on new findings and
results. There are also many portals and web sites with health information such as
the example presented in Fig.1.1.
Much of the information that would be of interest to citizens, researchers, and
professionals is found in unstructured documents. Despite the increasing use of
tables, images, graphs and movies, a relevant part of these documents adopts at least
partially written natural language. The amount of contents available in natural
language (English, Portuguese, Chinese, Spanish, etc.) increases every day. This is
particularly noticeable in the web.
1.2 Problems
Despite the many useful applications that the information on these documents can
potentiate, it is harder and harder to obtain the wanted information. This huge and
increasing amount of documents available in the web, companies intranets and
1.2Problems
1Introduction
1Introduction
Relative to the technology used, earlier IE systems were essentially rule based
approaches, also called knowledge engineered approaches. This type of technology
is still used in modern approaches, at least partially. It uses hard coded rules created
by human experts that encode linguistic knowledge by matching patterns over a
variety of structures: text strings, part-of-speech tags, dictionary entries. The rules
are usually targeted for specific languages and domains and this type of systems are
generically very accurate and ready to use out of the box (Andersen etal. 1992;
Appelt etal. 1993; Lehnert etal. 1993). As manual coding of the rules can become
a time-consuming task, and also because rules rarely remain unchanged when
porting for other languages and/or domains, some implementations introduced
algorithms for automatically learning rules from examples (Soderland 1999;
Califf and Mooney 1999; Ciravegna 2001).
The success IE motivated the broadening of its scope to include more unstructured and noisy sources and, as result, statistical learning algorithms were introduced. Among the most successful approaches are the ones based on Hidden
Markov Models, conditional random fields, and maximum entropy models
(Ratnaparkhi 1999; Lafferty etal. 2001). Later were developed more holistic analyses of the document including techniques for grammar construction and ontologybased IE (Viola and Narasimhan 2005; Wimalasuriya and Dou 2010).
Hybrid approaches, which use a mix of the previous two, combine the best features of each kind of approach: the accuracy of rule based approaches with the
coverage and adaptability of machine learning approaches.
Some IE approaches use ontologies to store and guide the IE process. The success of these approaches motivated the creation of the term Ontology-Based
Information Extraction (OBIE). These approaches will be described in Chap. 4 of
this book.
Despite the different approaches, there is no clear winner. The advent of Semantic
Web and Open Data made ontology-based IE (OBIE) one of the most popular trends
in the field. However OBIE includes other IE algorithms and is not an alternative
method but rather an approach that processes natural language text through a mechanism guided by ontologies and presents the output using ontologies (Wimalasuriya
and Dou 2010).
Comprehensive overviews about IE approaches are provided in (Sarawagi 2008;
Piskorski and Yangarber 2013).
1Introduction
Recall =
Fmeasure = (1 + b )
F1 = 2
Precision Recall
b 2 Precision + Recall
Precision Recall
Precision + Recall
A difficulty when computing these measures is that it is necessary to know all the
relevant findings of the documents, specifically when calculating recall, and thus
F-measure. This implies having someone reading all documents and annotating
the relevant parts of texts, which is a time consuming task. Ideally, the annotation
should be performed by more than one person and followed by group consensus
about which annotations are the correct ones. It is possible to find some sets of
documents already annotated, named golden collections.
Documents
Domain independent
Domain Specific
Sentence split
Tokenization
Morphological analysis
POS tagging
Syntatic parsing
Information
/
Knowledge
systems that are able to learn how to extract relevant information from natural
language documents, and assigning semantic meaning to it. The chapter also
includes some background information on semantics, ontologies, knowledge representation and Natural Language Processing.
The two main groups of processing modules of the generic architecture are the
subject of the following two chapters. First, in Chap. 2, are presented the domain
independent modules that, in general, split the text into relevant units (sentences and
tokens) and enrich the document by adding morphological and syntactic information. The third chapter presents information on how to extract entities and relations
and create a semantic representation with the extracted information.
As OBIE is a very important trend, a complete chapter, the fourth, is dedicated
to present a proposal of software architecture for performing OBIE using an arbitrary ontology and describing a system developed based on the architecture.
As this book aims at including real applications, Chap. 5 illustrates how to implement working systems. The chapter presents two systems: the first is a tutorial
systemthat we challenge all readers to performdeveloped by almost direct use
of freely available tools and documents; the second one, more complex and for a
language other than English, illustrating a state-of-the-art system and how it can
deliver information to end users.
Book ends with some comments to what was selected as content for the book and
some considerations regarding the future.
10
1Introduction
References
Allen JF (2000) Natural language processing. In: Ralston A, Reilly ED, Hemmendinger D (eds)
Encyclopedia of computer science, 4th edn. Wiley, Chichester, pp12181222
Andersen PM etal (1992) Automatic extraction of facts from press releases to generate news
stories. In: Proceedings of the third conference on applied natural language processing.
pp170177
Antoniou G, van Harmelen F (2009) Web Ontology Language: OWL. In: Staab S, Studer R (eds)
Handbook on ontologies, 2nd edn. International handbooks on information systems. Springer,
Berlin, pp91110
Appelt DE et al (1993) FASTUS: a finite-state processor for information extraction from realworld text. In: IJCAI. pp11721178
Buckland M (2013) The quality of information in the web. BiD: textos universitaris de biblioteconomia i documentaci (31)
Califf ME, Mooney RJ (1999) Relational learning of pattern-match rules for information extraction. In: AAAI/IAAI. pp328334
Chang C-H, Hsu C-N, Lui S-C (2003) Automatic information extraction from semi-structured web
pages by pattern discovery. Decis Support Syst 35(1):129147
Chiticariu L, Li Y, Reiss FR (2013) Rule-based information extraction is dead! Long live rule-
based information extraction systems! In: EMNLP. pp827832
Ciravegna F (2001) Adaptive information extraction from text by rule induction and generalisation. In: International joint conference on artificial intelligence. pp12511256
Gaizauskas R, Wilks Y (1998) Information extraction: beyond document retrieval. J Doc 54(1):
70105
Gruber TR (1993) A translation approach to portable ontology specifications. Knowl Acquis
5(2):199220
Guarino N (1998) Formal ontology and information systems. In: FOIS 98proceedings of the
international conference on formal ontology in information systems. IOS Press, Amsterdam,
pp315
Guha R, McCool R, Miller E (2003) Semantic search. In: The twelfth international World Wide
Web conference (WWW), Budapest. p779
Hsu C-N, Dung M-T (1998) Generating finite-state transducers for semi-structured data extraction
from the web. Inf Syst 23(8):521538
Jurafsky D, Martin JH (2008) Speech and language processing: an introduction to natural language
processing, computational linguistics, and speech recognition, 2nd edn. Prentice Hall,
NewYork
Kasneci G et al (2008) The YAGO-NAGA approach to knowledge discovery. ACM SIGMOD 37:7
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for
segmenting and labeling sequence data. In: Proceedings of the international conference on
machine learning (ICML-2001)
Lee L (2004) Im sorry Dave, Im afraid I cant do that: linguistics, statistics, and natural
language processing circa 2001. In: Committee on the Fundamentals of Computer Science:
Challenges and Computer Science Opportunities and National Research Council Telecommu
nications Board (ed) Computer science: reflections on the field, reflections from the field. The
National Academies Press, Washington, pp111118
Lehnert W et al (1993) UMass/Hughes: description of the CIRCUS system used for Tipster text.
In: Proceedings of TIPSTER93, 1923 September 1993. pp241256
Makhoul J etal (1999) Performance measures for information extraction. In: Proceedings of
DARPA broadcast news workshop. pp249252
Mrquez L etal (2008) Semantic role labeling: an introduction to the special issue. Comput
Linguist 34(2):145159
McNaught J, Black W (2006) Information extraction. In: Ananiadou S, McNaught J (eds) Text
mining for biology and biomedicine. Artech House, Boston
References
11
Muslea I (1999) Extraction patterns for information extraction tasks: a survey. In: Proceedings of
the AAAI 99 workshop on machine learning for information extraction, Orlando, July 1999.
pp16
Neustein A et al (2014) Application of text mining to biomedical knowledge extraction: analyzing
clinical narratives and medical literature. In: Neustein A (ed) Text mining of web-based medical content. De Gruyter, Berlin, p 50
Piskorski J, Yangarber R (2013) Information extraction: past, present and future. In: Multi-source,
multilingual information extraction and summarization. Springer, Berlin, pp2349
Ratnaparkhi A (1999) Learning to parse natural language with maximum entropy models. Mach
Learn 34(13):151175
Santos D (1992) Natural language and knowledge representation. In: Proceedings of the ERCIM
workshop on theoretical and experimental aspects of knowledge representation. pp195197
Sarawagi S (2008) Information extraction. Found Trends Database 1(3):261377
Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach
Learn 34(13):233272
Sowa JF (2000) Knowledge representation: logical, philosophical, and computational foundations.
Brooks Cole, Pacific Grove
Suchanek F, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: Proceedings
of the 16th international conference on World Wide Web. ACM Press, NewYork, p697
Teixeira A, Ferreira L, Rodrigues M (2014) Online health information semantic search and exploration: reporting on two prototypes for performing extraction on both a hospital intranet and the
world wide web. In: Neustein A (ed) Text mining of web-based medical content. De Gruyter,
Berlin, p50
Viola P, Narasimhan M (2005) Learning to extract information from semi-structured text using a
discriminative context free grammar. In: Proceedings of the 28th annual international ACM
SIGIR conference on research and development in information retrieval. pp330337
Wimalasuriya DC, Dou D (2010) Ontology-based information extraction: an introduction and a
survey of current approaches. J Inf Sci 36(3):306323
Chapter 2
Abstract This chapter presents the domain independent part of the general
architecture of Information Extraction (IE) systems. This first part aims at preparing
documents by the application of several Natural Language processing tasks that
enrich the documents with morphological and syntactic information. This is made
in successive processing steps which start by making contents uniform, and end by
identifying the roles of the words and how they are arranged.
Here are described the most common steps: sentence boundary detection, tokenization, part-of-speech tagging, and syntactic parsing. The description includes
information on a selection of relevant tools available to implement each step.
The chapter ends with the presentation of three very representative software
suites that make easier the integration of the several steps described.
Keywords Information extraction Tokenization Sentence splitting
Morphological analysis Part-of-speech POS Syntactic parsing Tools
2.1
Process Overview
The IE process usually starts by identifying and associating morphosyntactic features to natural language contents that, otherwise, would be quite undistinguishable
character strings. The process is composed of successive NLP steps starting on
making contents uniform, and ending with the identification of the roles of the
words and how they are arranged. The first steps are usually tokenization and sentence boundary detection. Its purpose is to break contents into sentences and define
the limit of each token: word, punctuation mark, or other character clusters such as
currencies. Afterwards, all processing is usually conducted in a per-sentence fashion and tokens are considered atomic. Then, morphological analysis makes tokens
uniform by determining word lemmata, see win and won in Fig. 2.1, and partof-speech tagging assigns a part-of-speech to each token, visible after the slashes.
The final step is usually syntactic parsing which can be done using significantly
different formalisms. These NLP steps prepare the textual contents for subsequent
identification and extraction of relevant information.
13
14
Task
sentence
boundary
detection
+
tokenization
morphological
analysis
+
part-of-speech
tagging
Data
John Bardeen is the only laureate to win the Nobel Prize in Physics
twice in 1956 and 1972. Maria Curie also won two Nobel Prizes, for
physics in 1903 and chemistry in 1911.
<S>[John] [Bardeen] [is] [the] [only] [laureate] [to] [win] [the] [Nobel]
[Prize] [in] [Physics] [twice] [] [in] [1956] [and] [1972] [.]</S>
<S>[Maria] [Curie] [also] [won] [two] [Nobel] [Prizes] [,] [for]
[physics] [in] [1903] [and] [chemistry] [in] [1911] [.]</S>
(dependency)
syntactic
parsing
Fig. 2.1 Representative example of the NLP steps for morphosyntactic data generation relative to
plain text natural language sentences
Figure 2.1 depicts the mentioned successive processing steps and their effect on
data. The processing steps are on the left-hand side and their effect on data is visible
on the right-hand side. The output of one step is the input of the next one, and the
effects are representative as they provide a real example of what can be done but are
not the unique possible formalism or solution. The syntactic parsing result in
Fig. 2.1 is relative to dependency parsing and is depicted as a graph for simplicity.
2.2
15
In the following sections, each of these major steps will be described and
representative tools briefly presented. A bias towards alphabet languages is assumed,
but, whenever possible, some information is provided on other languages, such as
Arabic and Chinese. Representative tools, in general used later in the book, are
given some additional attention. They are described with some detail and relevant
information, such as the way of obtaining the tool and languages supported out of
the box, is presented in tabular form at the end of each section.
2.2
2.2.1
Tools
Tools for tokenizing texts are found in software suites such as Freeling (Padr and
Stanilovsky 2012), NLTK (Bird et al. 2009), OpenNLP (Apache 2014), or
StanfordNLP (Manning et al. 2014). There are no specialized tools exclusively dedicated to this problem since tokenization can be reasonably well done using regular
expressions (regex) when processing languages using Latin alphabet. For languages
not using Latin alphabet there are fewer tools. The tokenizer Stanford Word
16
Punkt
Sentence boundary detection
http://www.nltk.org/_modules/nltk/tokenize/punkt.html
Dutch, English, Estonian, French, German, Italian,
Norwegian, Portuguese, Spanish, Swedish, and Turkish
F1 above 0.95 for most of 11 tested languages
Segmenter1 has models able to handle Arabic and Chinese (Chang et al. 2008;
Monroe et al. 2014).
Regarding the sentence boundary detection problem, several systems addressing
it have been proposed with good results. Here we focus on two proposals that
achieved good results when tested with distinct natural languages: Punkt (Kiss and
Strunk 2006) and iSentenizer (Wong et al. 2014).
2.2.2
Punkt is included in the Natural Language Toolkit (NLTK), a software suite in Python
that provides tools for handling natural languages (see Sect. 2.5.2). Punkt implementation follows the tokenizer interface defined by NLTK in order to be seamlessly
integrated programmatically in a NLP pipeline. It is provided with source code and,
alongside the execution method, the software also includes methods for training new
sentence boundary detection models from corpora (see tested languages in Table 2.1).
Punkt approach is based on unsupervised machine learning. The method assumes
that most of end of sentence ambiguities can be solved if abbreviations are identified as the remaining periods would mark end of sentences (Kiss and Strunk 2006).
It operates in two steps. The first step detects abbreviations by assuming that they
are collocations of a truncated word and a final period, they are short, and they often
contain internal periods. These assumptions are used to estimate the likelihood of a
given period being part of an abbreviation. The second step evaluates if the decisions of the first step should be corrected. The evaluation is based on the word
immediately at the right of the period. It is checked if the word is a frequent sentence starter, if it is capitalized, or if the two tokens surrounding the period do not
form a frequent collocation. Periods are considered sentence boundary markers if
they are not part of abbreviations.
iSentenizer is provided with a Visual C++ application programming interface
(API) and a standalone tool featuring a graphical user interface (GUI). Having these
two interfaces makes easier using the tool. The GUI can be used to easily and conveniently construct and verify a sentence boundary detection system for a specific
language, and the API allows later integration of the constructed model into larger
software systems using Visual C++.
1
http://nlp.stanford.edu/software/segmenter.shtml
17
iSentenizer
Sentence boundary detection
http://nlp2ct.cis.umac.mo/views/utility.html
Danish, Dutch, English, Finnish, French, German, Greek, Italian,
Portuguese, Spanish, Swedish
Detects sentence boundaries of a mixture of different text genres
and languages with high accuracy
F1 above 0.95 for most of 11 tested languages using Europarl corpus
iSentenizer is based on an algorithm, named i+Learning, that constructs a decision tree in two steps (Wong et al. 2014). The first step constructs a decision tree in
a top-down approach based on the training corpus. The second step increments the
tree whenever a new instance or attribute is detected, revising the tree model by
incorporating new knowledge instead of retraining it from scratch. The features
used in tree construction are the words immediately preceding and following the
potential boundary punctuation marks: period, exclamation mark, colon, semicolon,
question mark, quotation marks, brackets, and dash. The inclusion of more punctuation marks than the usual sentence boundariesperiod, and exclamation and question marksis because those punctuation marks may also denote a sentence
boundary depending on the text genre. Features are encoded in a way independent
from corpus and alphabet language to maximize the adaptability of the system for
different languages and text genres (see tested languages in Table 2.2).
2.3
Having texts separated in tokens, the next step is usually morphosyntactic analysis,
in order to identify characteristics as word lemma and parts of speech (Marantz
1997). It is important to distinguish two concepts: lexeme and word form. The difference is well illustrated with two examples: (1) the words book and books
refer to the same concept and thus have the same lexeme and have different word
forms; (2) the words book and bookshelf have different word forms and different lexemes as they refer to two different concepts (Marantz 1997). The form chosen to conventionally represent the canonical form of lexemes is called lemma.
Finding word lemmata brings the advantage of having a single form for all words
that have similar meanings. For example, the words connect, connected, connecting, connection, and connections roughly refer to the same concept and
have the same lemma. Also, this process reduces the total number of terms to handle, which is advantageous from a computer processing point of view, as it reduces
the size and complexity of data in the system (Porter 1980). The complexity of the
task depends of the target natural language. For languages with simple inflectional
morphology, as English, the task is more straightforward than for languages with
more complex inflectional morphology as German (Appelt 1999).
18
2.3.1
Tools
Several approaches have been proposed over the years. It is common that available
implementations are developed for English and trained and evaluated using Penn
Treebank data. Nevertheless, most have the potential to be used for tagging other
languages. Here we privileged implementations that proven good results with
2.3
19
several natural languages, are provided with methods to train a tagger model for
other languages, given POS-annotated training text for that language, and that are
not part of larger software suites. The only exception will be Stanford POS tagger,
from the StanfordNLP suite, because it is provided with tagger models for six different languages, making it very relevant even if the rest of the suite is not used.
2.3.2
Three tools were selected as they represent, respectively, POS tagging implementations using models based on maximum entropy, support vector machines (SVM),
and Markov models. All tools include the POS tagger and methods to create new
tagger models given training data.
The Stanford POS Tagger includes components for command-line invocation,
for running as a server, and for being integrated in software projects using a Java
API. The full download version contains tagger models for six different languages
(see languages list in Table 2.3). It is based on a bidirectional maximum entropy
model that decides the POS tag of a token taking into consideration the preceding
and following tags, and broad lexical features such as join conditioning of multiple
consecutive words. The tagger achieved a precision value above 0.97 with the Penn
Treebank Wall Street Journal (WSJ) corpus (Toutanova et al. 2003).
SVMTool supports standard input and output pipelining, making easier its integration in larger systems. Also it is provided with an C++ API to support embedded
usage. The algorithm is based on support vector machines classifiers and uses a rich
set of features, including: word and POS bigrams and trigrams, surface patterns as
prefixes, suffixes, letter capitalization, word length, and sentence punctuation. The
tagging decisions can be done using a reduced context or at the sentence level. The
tagger achieved accuracy above 0.97 with the English corpus Wall Street Journal,
and above 0.98 with the Spanish corpus LEXEP (Gimnez and Mrquez 2004).
Table 2.4 presents the highlights of SVMTool.
TreeTagger can be run from the command line or using a GUI, and is provided
as a binary package for Intel-Macs, Linux, or Windows operating systems. The
project website includes ready to use models for 16 languages (see language list in
Table 2.5). TreeTagger algorithm is based on n-gram Markov models having transition
probabilities estimated using a binary decision tree. Comparing to other algorithms
20
Name
Task
URL
Languages tested
Performance
SVMTool
Part of speech tagging
http://www.lsi.upc.edu/~nlp/SVMTool/
Catalan, English, and Spanish
Accuracy of 0.9739 for English and
0.9808 for Spanish
TreeTagger
Part of speech tagging
http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
Bulgarian, Dutch, English, Estonian, Finnish, French, Galician, German,
Italian, Portuguese, Mongolian, Polish, Russian, Slovak, Spanish, and Swahili
Accuracy above 0.95 for most languages
using Markov models, this technique needs less data to obtain reliable transition
probabilities as binary decision trees have relatively few parameters to estimate.
Such feature mitigates the sparse data problem (Schmid 1994).
2.4
Syntactic Parsing
Syntactic parsing is usually a computational intensive task that is not as often used
in IE systems as tokenization, sentence boundary detection, or POS tagging. When
information sources are (semi-)structured, or are machine generated, or the output
is coarse grained, other methods less computational intensive such as locating textual patterns can provide similar results (Feldman and Sanger 2007; Huffman 1996).
The goal of syntactic parsing is to analyze sentences in order to produce structures representing how words are arranged in sentences (Langacker 1997). Structures
are produced with respect to a given formal grammar, and over the years were
proposed different formalisms reflecting both linguistic and computational concerns.
In a broad sense, grammars can have two structural formalisms: constituency and
dependency (Jurafsky and Martin 2008; Nugues 2006).
Constituent is a unit within a hierarchical structure that is composed by a word
or a group of words. Although, in a strict formal sense constituent structures can
be observed in dependency grammars, constituency is usually associated to phrase
structure grammars as these are only based on the constituency relation. Phrase structure grammars are composed by sets of syntactic rules that fractionate a phrase into
sub-phrases and hence describe a sentence composition in terms of phrase structure (Chomsky 2002). Figure 2.2 presents a possible parse of the sentence This
book has two authors. using a phrase structure grammar.
Dependency grammars describe sentence structures in terms of links between
words. Each link reflects a relation of dominance/dependence between a headword
21
NounPhrase
DT
NN
This
book
VerbPhrase
AUX
NounPhrase
has
CD
NNS
two
authors
root
punct
det
nsubj
dobj
num
DT
NN
VBZ
CD
NNS
This
book
has
two authors
and a dependent word. The original work of Tesnire (1959) received formal
mathematical definitions thus becoming suitable for automatic processing. As result,
sentence dependencies form graphs that have a single head and usually have three
properties: acyclicity, connectivity, and projectivity (Nivre 2005). Dependency
grammars often prove more efficient to parse texts. Figure 2.3 presents a possible
parse of the same example sentence using a dependency grammar.
Nugues (2006) provides a comprehensive discussion about syntax theories and
parsing techniques that have been proposed over the years. Here the focus will be
on tools that proven able to be adaptable to different languages without need to
rewrite grammars, which is a difficult task and requires some expertise in language models. The first two parsers presented use phrase structure grammarEpic
parser and StanfordParserand the other two use dependency grammars
MaltParser and TurboParser.
2.4.1
Epic is a probabilistic context-free grammar (PCFG) parser that can be used from
the command line or programmatically using a Scala API. Its algorithm uses surface
patterns to reduce the propagation of information through the grammar structure,
thus avoid having too many features in the grammar structure. Having a simpler
22
Epic
Syntactic parsing (phrase structure grammar)
http://www.scalanlp.org/
Ready to use models for Basque, English, French, German, Hungarian,
Korean, Polish, and Swedish
Other languages tested with accuracy over 0.78: Arabic, Basque, and Hebrew
StanfordParser
Syntactic parsing (phrase structure grammar)
http://nlp.stanford.edu/software/lex-parser.shtml
Ready to use models for: Arabic, Chinese, English, French, and German
Other languages tested with accuracy over 0.75: Bulgarian, Italian, and
Portuguese
structural backbone improves the adaptation to new languages (Hall et al. 2014).
Epic parser provides ready to use parser models for eight languages and was tested
with more three languages achieving accuracy results over 0.78 (see Table 2.6).
StanfordParser is also a PCFG parser provided with a command line interface as
well as with a Java API for programmatic usage. It uses an unlexicalized grammar
at its core. Unlexicalized PCFG is a grammar that relies on word categories such as
POS categories that can be more or less broad and does not systematically specifies
rules to the lexical level. However some categories can represent a single word.
This brings the advantage of producing compact and robust grammar representations as there is no need for large structures to store the lexicalized probabilities
(Klein and Manning 2003). StanfordParser is provided with models for five languages and was also used with Bulgarian, Italian, and Portuguese (see Table 2.7).
MaltParser is provided as a JAR package for command line usage, and with the
Java source code for integration into larger software projects. Maltparser is a datadriven dependency parsing system able to induce parsing models from Treebank
data. The parsing model builds dependency graphs in one left to right pass over the
input using a stack to store partially processed tokens, and a history-based feature
model to predict the next parser action (Hall et al. 2010; Nivre et al. 2007). There
are ready to use parsing models for 4 languages and was tested with other 14 languages and results showed an accuracy around 0.75 or more (see Table 2.8).
TurboParser is provided with C++ source code ready to be compiled in systems
complying with the Portable Operating Systems Interface (POSIX) and also in
Windows. The approach followed formulates the problem of non-projective dependency parsing as an optimization problem of integer linear programming of polynomial size. The model supports expert knowledge in form of constraints, and training
data is used to automatically learn soft constraints. Having a model requiring a
polynomial number of constraints as a function of the sentence length, instead of
2.5
23
MaltParser
Syntactic parsing (dependency grammar)
http://www.maltparser.org/
Ready to use models for English, French, Spanish, and Swedish
Other languages tested with accuracy around 0.75 or above: Arabic, Basque,
Catalan, Chinese, Czech, Danish, Dutch, German, Greek, Hungarian, Italian,
Japanese, Portuguese, and Turkish
TurboParser
Syntactic parsing (dependency grammar)
http://www.ark.cs.cmu.edu/TurboParser/
Ready to use models for: Arabic, English, Farsi, Kinyarwanda, and Malagasy
Other languages tested with accuracy above 0.75: Danish, Dutch, Portuguese,
Slovene, Swedish, and Turkish
2.5
NLP software suites make easier the integration of all tasks in a processing pipeline.
They integrate several tools using a coherent data representation designed to allow
directly using the output of a step as input of the following one. The list of suites
available includes Apache OpenNLP, Freeling, GATE, LingPipe, Natural Language
Tool Kit (NLTK), and StanfordNLP, among others. Here will be described Stanford
NLP, as it is used in a tutorial example in Chap. 5; NLTK, as it is very well documented and uses a distinct programming language of StanfordNLP; and GATE for
historical reasons at it was (one of) the first matured suites available.
2.5.1
Stanford NLP
Stanford NLP (Manning et al. 2014) is a machine learning based toolkit for the
processing of natural language text. It includes software for realizing several NLP
tasks such as tokenization, sentence segmentation, part-of-speech tagging, named
entity recognition, parsing, coreference resolution, and relation extraction, that can
be incorporated into applications with human language technology needs.
24
2.5.2
NLTK (Bird et al. 2009) supports a wide range of text processing libraries, including
text classification, tokenization, stemming, tagging, chunking, parsing, and semantic
reasoning. It also provides intuitive interfaces to more than 50 corpora and lexical
resources, including WordNet. It is well documented with tutorials, animated algorithms, problem sets, and is thoroughly discussed in a comprehensive book by Bird
et al. (2009). The suite is developed using Python programming language and an
active community also create Python wrappers for state of the art tools, respecting the
NLTK interfaces. For instance, there is a Python wrapper to use MaltParser in NLTK.
2.5.3
GATE
References
Alusio S, Pelizzoni J, Marchi AR, de Oliveira L, Manenti R, Marquiafvel V (2003) An account
of the challenge of tagging a reference corpus for brazilian portuguese. In: Computational
processing of the Portuguese language. Springer, Berlin, pp 110117
Apache Open NLP Development Community (2014) Apache OpenNLP developer documentation.
www.openlp.apache.org
Appelt DE (1999) Introduction to information extraction. Artif Intell Commun 12:161172
Bird S, Klein E, Loper E (2009) Natural language processing with Python. OReilly, Sebastopol
Brants T (1995) Tagset reduction without information loss. In: Proceedings of the 33rd annual
meeting on Association for Computational Linguistics. pp 287289
Chang AX, Manning CD (2014) TOKENS REGEX: defining cascaded regular expressions over
tokens. Technical report CSTR 201402. Department of Computer Science, Stanford University,
Stanford
References
25
Chang P, Galley M, Manning CD (2008) Optimizing Chinese word segmentation for machine
translation performance. In: Proceedings of the third workshop on statistical machine translation. pp 224232
Chomsky N (2002) Syntactic structures. Walter de Gruyter, New York
Cunningham H, Maynard D, Bontcheva K (2011) Text processing with GATE, Cunningham:2011:
TPG:2018860. Gateway Press, Murphys, CA
Feldman R, Sanger J (2007) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, Cambridge
Giesbrecht E, Evert S (2009) Is part-of-speech tagging a solved task? An evaluation of pos taggers
for the German Web as Corpus. In: Proceedings of the fifth Web as Corpus workshop. pp 2735
Gimnez J, Mrquez L (2004) SVMTool: a general POS tagger generator based on support vector
machines. In: Proceedings of the 4th international conference on Language Resources and
Evaluation (LREC04). Lisbon
Gngr T (2010) Part-of-speech tagging. In: Indurkhya N, Damerau FJ (eds) Handbook of natural
language processing, 2nd edn. CRC/Taylor and Francis Group, Boca Raton
Hall J, Nilsson J, Nivre J (2010) Single malt or blended? A study in multilingual parser optimization. In: Trends in parsing technology. Springer, Berlin, pp 1933
Hall D, Durrett G, Klein D (2014) Less grammar, more features. In: Proceedings of ACL. Baltimore,
pp 228237
Hotho A, Nrnberger A, Paa G (2005) A brief survey of text mining. LDV Forum 20:1962
Huang C-R, imon P, Hsieh S-K, Prvot L (2007) Rethinking Chinese word segmentation: tokenization, character classification, or wordbreak identification. In: Proceedings of the 45th annual
meeting of the ACL on interactive poster and demonstration sessions. pp 6972
Huffman SB (1996) Learning information extraction patterns from examples. In: Wertmer S,
Riloff E, Scheler G (eds) Connectionist, statistical and symbolic approaches to learning for
natural language processing. Springer, Berlin, pp 246260
Jurafsky D, Martin JH (2008) Speech and language processing: an introduction to natural language
processing, computational linguistics, and speech recognition, 2nd edn. Prentice Hall, New York
Kiss T, Strunk J (2006) Unsupervised multilingual sentence boundary detection. Comput Linguist
32:485525
Klein D, Manning CD (2003) Accurate unlexicalized parsing. In: Proceedings of the 41st annual
meeting on Association for Computational Linguistics, vol 1. pp 423430
Langacker RW (1997) Constituency, dependency, and conceptual grouping. Cogn Linguist 8:132
Manning CD (2011) Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In:
Gelbukh A (ed) Computational linguistics and intelligent text processing12th international
conference CICLing. Lecture notes in computer science. Springer, Berlin, pp 171189
Manning CD, Surdeanu M, Bauer J, Finkel J, Bethard SJ, McClosky D (2014) The Stanford
CoreNLP natural language processing toolkit. In: Proceedings of 52nd annual meeting of the
Association for Computational Linguistics: system demonstrations. pp 5560
Marantz A (1997) No escape from syntax: dont try morphological analysis in the privacy of your
own lexicon. University of Pennsylvania working papers in linguistics 4, p 14
Martins AFT, Smith NA, Xing EP (2009) Concise integer linear programming formulations for
dependency parsing. In: Proceedings of the joint conference of the 47th annual meeting of the
ACL and the 4th international joint conference on natural language processing of the AFNLP,
vol 1vol 1. pp 342350
Mcnamee P, Mayfield J (2004) Character n-gram tokenization for European language text retrieval.
Inf Retr 7:7397
Monroe W, Green S, Manning CD (2014) Word segmentation of informal Arabic with domain
adaptation. In: Proceedings of the 52nd annual meeting of the Association for Computational
Linguistics, vol 2 (short papers). ACL, Baltimore, pp 206211
Nivre J (2005) Dependency grammar and dependency parsing. MSI report 5133. pp 132
Nivre J, Hall J, Nilsson J, Chanev A, Eryigit G, Kbler S, Marinov S, Marsi E (2007) MaltParser:
a language-independent system for data-driven dependency parsing. Nat Lang Eng 13:95135
26
Nugues PM (2006) Syntactic formalisms. In: Nugues PM (ed) An introduction to language processing with Perl and Prolog. Springer, Berlin, pp 243275
Padr L, Stanilovsky E (2012) FreeLing 3.0: towards wider multilinguality, In: Proceedings of the
Language Resources and Evaluation Conference (LREC 2012). Istanbul, pp 24732479
Palmer DD, Hearst MA (1997) Adaptive multilingual sentence boundary disambiguation. Comput
Linguist 23:241267
Piskorski J, Yangarber R (2013) Information extraction: past, present and future. In: Poibeau T,
Saggion H, Piskorski J, Yangarber R (eds) Multi-source, multilingual information extraction
and summarization. Springer, Berlin, pp 2349
Porter MF (1980) An algorithm for suffix stripping. Program Electron Libr Inf Syst 14:130137
Reynar JC, Ratnaparkhi A (1997) A maximum entropy approach to identifying sentence boundaries.
In: Proceedings of the fifth conference on applied natural language processing, ANLC97.
ACL, Stroudsburg, pp 1619
Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the
international conference on new methods in language processing. Manchester
Tesnire L (1959) Elments de syntaxe structurale. Librairie C. Klincksieck, Paris
Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a
cyclic dependency network. In: Proceedings of the 2003 conference of the North American
chapter of the Association for Computational Linguistics on human language technology, vol 1.
pp 173180
Wong DF, Chao LS, Zeng X (2014) iSentenizer-: multilingual sentence boundary detection
model. ScientificWorldJournal 2014. doi:10.1155/2014/196574
Chapter 3
Abstract This chapter concludes the presentation of the generic pipelined architecture
of Information Extraction (IE) systems, by presenting its domain dependent part.
After preparation and enrichment, the documents contents are now characterized and suitable to be processed to locate and extract information. This chapter
explains how this can be performed, addressing both extraction of entities and relations between entities.
Identifying entities mentioned in texts is a pervasive task in IE. It is called Named
Entity Recognition (NER) and seeks to locate and classify textual mentions that
refer to specific types of entities, such as, for example, persons, organizations,
addresses and dates.
The chapter also dedicates attention to how to store the extracted information and
how to take advantage of semantics to improve the information extraction process,
presenting the basis of Ontology-Based Information Extraction (OBIE) systems.
Keywords Information extraction Entities Relations Named entity recognition
NER Parse tree Dependencies Ontology-based information extraction OBIE
3.1
After preparation and enrichment, the documents contents are now characterized
and suitable to be processed by algorithms that will locate and extract information
(Ratinov and Roth 2009). The type of information to be extracted depends on the
purpose of the application and can range from the detection of a defined set of relevant entities to an attempt to extract arbitrary information at the Web scale, or
something in between.
The goal is to identify entities in texts and the relations they participate in, which
informally translates to discover who did what to whom, when and why (Mrquez
et al. 2008). Entities to locate include people, organizations, locations, and dates,
while relations can be physical (near, part), personal or social (son, friend, business),
and membership (staff, member-of-group) (Bontcheva et al. 2009).
27
28
29
An example for locating people, using part of speech tags and word capitalization, is setting boundaries in sequences of proper nouns. Considering the example
depicted in Fig. 2.1, this simple method would allow isolating candidate entities for
people names in the sentence:
Nobel
Prize
Wikipedia categories
People from Madison, Wisconsin | American people of Russian descent | 1908
births | 1991 deaths | American agnostics | American electrical engineers | American
Nobel laureates | American physicists | Foreign Members of the Royal
Society | Nobel laureates in Physics | Nobel laureates with multiple Nobel
awards | Oliver E. Buckley Condensed Matter Prize winners | Princeton University
alumni | Quantum physicists | University of WisconsinMadison alumni
Academic awards | Awards established in 1895 | International awards | Science and
engineering awards | Organizations based in Sweden | Nobel Prize
30
The recognition of generic named entities such as people, locations, and dates,
can be done using suites such as OpenNLP, NLTK, or StanfordNLP, presented in
Chap. 2. For named entities relative to more specialized domains it can be difficult
to find a ready to use software package. One exception is the biomedical domain for
which is possible to find named entities recognizers. Becas1 (Nunes et al. 2013) and
KLEIO2 (Nobata et al. 2008) are two relevant examples of such tools.
3.2
Named entity recognition identifies the entities referred in the documents but, by
itself, does not inform in what kind of events those entities were involved, the reason why they were mentioned in the first place. For that, it is necessary to know in
what actions they are involved in, the same is to say to know the relations they
established with other entities (Banko and Etzioni 2008; Schutz and Buitelaar
2005). This is an important task for applications wishing to have a formal structure
about parts of the content of the document. Considering, again, the example of Fig.
2.1, detecting and classifying the entities John Bardeen and Nobel Prize is not
enough to know if both entities are related and, if they are, how they are related.
Already knowing that John Bardeen is a person and that Nobel Prize is an
award, possible relations would be having John Bardeen as the winner, or the
sponsor, or a jury member, or someone that attended to the ceremony of the award
"Nobel Prize.
A relation is a predication about a pair of entities. Examples of common relations
include relations of the types: (1) physical: located, near, part, etc.; (2) personal or
social: business, family, friend, etc.; (3) employment or membership: member of,
employee, staff, etc.; (4) agent to artifact: user, owner, inventor, etc.; and (5) affiliation: citizen, resident, ideology, ethnicity, etc. In the example of John Bardeen and
the Nobel Prize, a relation between the entities is John Bardeen winner_of Nobel
Prize. Differently from named entity recognition, relation extraction is not a process of annotating a sequence of tokens of the original document. Relationships
express associations between two entities represented by distinct text segments
(Sarawagi 2008). Relations involving two or more objects and subjects are known
as events.
Approaches to relation extraction tend to steer away from using corpus annotated data due to the cost of creating such resources, and because there are other
sources available that, not having the quality of an annotated corpus, can provide
high quality results when algorithms take advantage of large volumes of data.
Wikipedia is a popular learning source for relation extraction because its pages
often have some structured informationthe infoboxesresuming the content of
1
2
http://bioinformatics.ua.pt/becas/#!/
http://www.nactem.ac.uk/Kleio/
3.2
31
It is possible to complement the surface information with POS tags. The advantage of taking POS into account is due to verbs playing a central role in the type of
the relation. This allows improving the surface based methods by better identifying
the words to be targeted by a relation extraction pattern. In the example of Fig. 2.1,
<PERSON> is the only laureate to win the <AWARD>, if the relation extraction
pattern takes into account POS tags, the patterns would value the word win more
than all other in the decision process.
Some approaches use deep syntactic information for detecting relations. Such
approaches use a parse tree, whether a constituency or dependency tree, as the basis
for the relation pattern to be matched (Miller et al. 2000; Rodrigues et al. 2011;
Suchanek et al. 2006). This type of approach usually allows extracting relations
from more complex and longer sentences. Its main disadvantage is the time necessary to compute the tree. To illustrate how this approach works, let us consider the
sentences John Bardeen won the Nobel Prize twice and John Bardeen, an
American physicist and electrical engineer won the Nobel Prize twice. Although,
at the surface sentences are distinct, the dependency structures that strictly relate the
person with the award are the same (see Fig. 3.1).
Relation extraction can be readily done with StanfordNLP. This software suite
recognizes the following relations out of the box: Live_In, Located_In, OrgBased_In,
and Work_For (Angeli et al. 2014). As for relations of more specialized domains,
again, it can be difficult to find a ready to use software package, and again one
32
Fig. 3.1 Two sentences with the same dependencies relating John Bardeen and the Nobel Prizes won
exception is the biomedical domain where PIE the Search3 (Kim et al. 2012), MEDIE4
(Miyao et al. 2006), and MedInx (Ferreira et al. 2012; Teixeira et al. 2014) are relevant examples of such tools.
3.3
Having extracted the entities of the text and their respective relations is then necessary to store this information for later use in the context of the application (Cowie
and Lehnert 1996). For applications targeting fixed types of entities and relations it
is suitable to store contents in a relational database. However, for applications targeting dynamic sets of relations, a more flexible framework is desirable: a knowledge base conforming to an ontology (Wimalasuriya and Dou 2010). Moreover,
several approaches showed that ontology classes, properties, and restrictions, could
be used to significantly improve the performance of the information extraction process. The success of this type of approach motivated the creation of the term
Ontology-Based Information Extraction (OBIE). For this, and because relational
databases are well known technology, here will only be discussed the storage of
information using a knowledge base.
3.3.1
Ontology
3
4
http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/PIE/
http://www.nactem.ac.uk/tsujii/medie/
33
3.3.2
A system is said to implement OBIE when its process of IE takes advantage of ontologies to improve the performance of the extraction. Typical OBIE systems use the
ontological properties to guide the IE process, whether by restricting the possible
arguments of a relatione.g. the ontology can define that a person can receive a
Nobel Prize but a location cannot receive itor by inferring and confirming concealed information from the extracted facts. Still using the previous example, even if
the entity John Bardeen was not identified as a person, the ontology could force the
entity to be a person as it received a Nobel Prize. Another characteristic of OBIE systems is that the information extracted is represented using ontologies (Wimalasuriya
and Dou 2010).
The approaches of OBIE can be distinct regarding the IE process itself and the
role played by the ontology in the extraction process. Regarding the IE process, the
different approaches were discussed earlier and can be based on surface patterns,
shallow approaches, or on morphosyntactic information, ranging from POS tags to
syntactic trees (deeper approaches). The role played by the ontology includes
defining the extracted information semantics and controlling the IE process. When
5
http://protege.stanford.edu/
34
the ontology is pre-built, the task is usually to fill the knowledge base with instance
and property values just as defined by the ontology (Cimiano et al. 2004; Saggion
et al. 2007; Wu et al. 2008). Neverthless, the ontology plays an active role as it
helps restricting relation arguments or discovering concealed information by means
of its properties and using semantic reasoning. Relevant examples of this type of
approach are KIM (Popov et al. 2004), SOBA (Buitelaar et al. 2006), and OntoX
(Yildiz and Miksch 2007).
Adding to the role of controlling the IE process, the ontology can itself encode,
in its structure, the relations found in the documents. In this type of approach, the
ontology is created in runtime and can be updated or not in posterior IE sessions.
Such approach implies that the reasoning process is not defined a priori but instead
is conditioned by what was found in the text sources. This approach is called Open
IE as it allows detecting instance candidates of arbitrary unknown relations (Banko
et al. 2007). A major challenge of Open IE is the higher levels of error when compared with other IE approaches. Despite the common usage of shallow linguistic
analysis, heuristics based on lexical features, and frequency analysis, is not easy to
filter out noisy or irrelevant information due to difficulties in estimating the confidence of the learned rules (Moro et al. 2013). Most of the confidence estimation
approaches rely on redundant data and some also use negative examples to filter out
wrong assumptions. Relevant examples of this type of approach are TextRunner
(Yates et al. 2007), Kylin (Wu et al. 2008), and NELL (Carlson et al. 2010).
References
Angeli G, Tibshirani J, Wu JY, Manning CD (2014) Combining distant and partial supervision for
relation extraction. In: Proceedings of the 2014 conference on empirical methods in natural
language processing (EMNLP)
Antoniou G, van Harmelen F (2009) Web Ontology Language: OWL. In: Staab S, Studer R (eds)
Handbook on ontologies, 2nd edn. International handbooks on information systems. Springer,
Berlin, pp 91110
Babych B, Hartley A (2003) Improving machine translation quality with automatic named entity
recognition. In: Proceedings of the 7th international EAMT workshop on MT and other language technology tools, improving MT through other language technology tools: resources and
tools for building MT. pp 18
Bach N, Badaskar S (2007) A review of relation extraction. In: Literature review for language and
statistics II
Banko M, Etzioni O (2008) The tradeoffs between open and traditional relation extraction. In:
Proceedings of ACL-08: HLT. pp 2836
Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction
for the web. In: IJCAI. pp 26702676
Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) DBpediaa
crystallization point for the web of data. Web Semant 7:154165
Bontcheva K, Davis B, Funk A, Li Y, Wang T (2009) Human language technologies. In: Davies J,
Grobelnik M, Mladenic D (eds) Semantic knowledge management: integrating ontology management, knowledge discovery and human language technology. Springer, Berlin/Heidelberg,
pp 3749
References
35
36
Chapter 4
4.1
Introduction
37
38
this topic is also to show and discuss the preparation of tools for languages not
provided out of the box.
The implementation described here was already used to build applications that
extract and display information from local government public documents, and later
health information in a specific domain relative to Alzheimer, Huntington, and
Parkinson diseases (Rodrigues 2013; Teixeira et al. 2014).
4.2
4.3 Architecture
39
4.3
Architecture
40
Docs
Natural
Language
Processing
Domain
Representation
Sentence split +
POS tagging +
NER +
syntatic parsing
Ontology editor
Example
annotation
Sentence split +
POS tagging +
NER +
syntatic parsing
Semantic
Extraction &
Integration
Extraction
model training
Semantic
extraction
External
structured
sources
Knowledge
base
Fig. 4.1 Architecture of the system. The top left to right arrow represents the training flow and the
bottom left to right arrow the runtime flow
4.4
4.4
41
performance as well as the easiness to train and use. The selection presented here
does not intend to represent the single best solution. It is a good solution considering
the target natural language.
4.4.1
http://www.linguateca.pt/floresta/CoNLL-X/
FORM
Um
revivalismo
refrescante
O
7_e_Meio
um
ex-libris
de
a
noite
algarvia
.
ID
1
2
3
1
2
3
4
5
6
7
8
9
10
o
7_e_Meio
ser
um
ex-libris
de
o
noite
algarvio
.
LEMMA
um
revivalismo
refrescante
art
prop
v
art
n
prp
art
n
adj
punc
CPOSTAG
art
n
adj
art
prop
v-fin
art
n
prp
art
n
adj
punc
POSTAG
art
n
adj
<artd>|M|S
M|S
PR|3S|IND
<arti>|M|S
M|P
<sam->
<-sam>|<artd>|S
F|S
F|S
FEATS
<arti>|M|S
M|S
M|S
2
3
0
5
3
5
8
6
8
3
HEAD
2
0
2
<N
SUBJ
STA
>N
SC
N<
>N
P<
N<
PUNC
DEPREL
>N
UTT
N<
PHEAD
PDEPREL
42
4 Extracting Relevant Information Using a Given Semantic
4.4
43
Sentence Splitting
Sentences are separated using the sentence boundary detector Punkt (Kiss and
Strunk 2006). Although Punkt was already tested with Portuguese, it was not possible to obtain a model for splitting Portuguese sentences. Thus, a model was trained
using Punkt tools and around 6,500 sentences randomly selected from Floresta
Sinta(c)tica. Training Punkt is straightforward as it does not have training parameters. The trained model was briefly tested with sentences from the same corpus but
not included in the training corpus. The sentence splitting model was considered
ready as the result obtained was F1 = 0.90 which is in line with the values reported
in literature.
POS Tagging
After split, sentences are enriched with POS tags assigned by TreeTagger (Schmid
1994). There is a publicly available model for Portuguese (Garcia et al. 2014) but
and encoding problem with accentuated words motivated us to train a new model
for Portuguese. Training TreeTagger requires the creation of three files: (1) a lexicon file containing a list of words and respective POS tags; (2) tagged training data
containing sentences with words and respective POS tags, which can vary depending on the word context in the sentence; and (3) an open class file containing the
POS the tagger can assign when guessing tags of unknown words. This file was kept
the same as for English: N ADJ V-FIN ADV.
All parameters controlling the training process were kept with the default values:
number of preceding words forming the tagging context (context length default is
2); threshold of information gain below which a leaf node of the decision tree is
deleted (minimum decision tree gain default is 0.7); and weight of the class of words
with the same tag probabilities in the computation of the probability estimates
(equivalence class weight default is 0.15). The trained model performance was measured against sentences of Floresta Sinta(c)tica and the precision obtained was 0.92.
44
Tempo (time)
Valor (value)
Outro (other)
Types
Disciplina (discipline); estado (state); ideia (idea); nome (name);
outro (other)
Efemeride (unique occurrence, news); evento (event); organizado
(organized); outro (other)
Classe (class); membroclasse (member of class); objecto (object);
substancia (substance); outro (other)
Fisico (physical); humano (human, political); virtual (virtual);
outro (other)
Numeral (numeral); ordinal (ordinal); textual (textual)
Arte (art); plano (plan); reproduzida (reproduced); outro (other)
Administracao (administration); empresa (enterprise); instituicao
(institution); outro (other)
Cargo (job, position); grupocargo (position category); grupoind
(undefined group); grupomembro (group); individual (individual);
membro (member of group); povo (people); outro (other)
Duracao (duration); frequencia (frequency); generico (generic);
tempo_calend (calendar time); outro (other)
Classificacao (classification, ranking); moeda (currency);
quantidade (amount); outro (other)
Syntactic Parsing
The syntactic parsing is done with MaltParser, a dependency parser (Hall et al.
2007). The parsing algorithm used was the same of the Single Malt system, a
pseudo-projective dependency parsing with support vector machines (Hall et al.
2007; Nivre et al. 2006). The parsing model for Portuguese was induced with
Bosque v7.3 used in the CoNLL-X shared task: multi-lingual dependency parsing.
The outputs of POS tagging and NER are used to generate the input for the syntactic parser. Named entities will have their own word forms as lemma and, as POS
tag, the tag relative to proper nouns when their word forms are character strings, or
relative to numbers if their word form is a numeric sequence. After merging the
outputs of POS tagging and NER, sentences are analyzed to determine its grammatical structure.
4.4.2
Domain Representation
The creation of an application domain starts by the design or adaptation of an ontology. Ontologies can be difficult to build because they are formal models of human
domain knowledge that is often tacit, and often there is more than one possible mapping of that knowledge into formal, discrete structures. Although some rules of thumb
4.4
45
exist to help in ontology design, it is more productive to have tools that, at least,
identify simple conflicts and allow rapid re-design of ontology parts. In this work
the selected ontology editor was Protg (Knublauch et al. 2004).
Ontology editors are tools that provide assistance in the process of creation,
manipulation, and maintenance of ontologies. They can work with various representation formats and, among other things, ontology editors provide ways to
merge, visualize, and check the semantic consistence of ontologies (Noy et al.
2000). Protg is an open-source tool developed at Stanford. Relevant features are
the ability to assist users in ontology construction including importing and merging ontologies, the existence of several plugins that include alternative visualization mechanisms and alternative inference engines.
After having the application ontology defined it is necessary to provide examples
of its classes and relation in representative texts. The prototype uses a version of
AKTiveMedia2 ontology based annotation system that was customized to generate
outputs with the same format of the inputs used by the relation learning algorithm
developed. Considering a relation triplesubject, relation, objectthe format
defined is:
4.4.3
The model training process generates one semantic extraction model for each
ontology class and one semantic model for each ontology relation found in the
seed examples. This way a model represents a specific ontology class or relation.
A model is a set of syntactic structure examples and counterexamples that were
found to encode and restrict the meaning represented by the model. It also contains
a statistical classifier that measures the similarity between a given structure to be
evaluated and the model internal examples. The model is said to have positively
evaluated a sentence fragment if the similarity is higher than a given threshold.
2
http://ftp.jaist.ac.jp/pub/sourceforge/a/ak/aktivemedia/
46
Fig. 4.2 Screenshot of AKTiveMedia annotation interface. The top left pane shows the ontology
classes and the pane below shows the possible properties for the selected class. The larger pane
shows the document and the annotations highlighted according to the class and property selected
In runtime, each sentence is evaluated by all models and the fragments positively
evaluated by a model are assigned the ontology class or relation represented by that
model (Rodrigues et al. 2011a, b).
Unlike the previous two parts of the prototype, which have single usage
sequences, this part have two different and interrelated ways of being used. The way
it is used depends on the task to be performed: (1) creation of semantic extraction
models; (2) usage of semantic extraction models to feed a knowledge base.
4.4
47
The second work is about improving entity and relation extraction when the
process is learned from a small number of labelled examples, using linguistic
information and ontological properties (Carlson et al. 2009). Improvements are
done using class and relation hierarchy, information about disjunctions, and facts
confidence scores. This information is used to bootstrap more examples generating
more data to train statistical classifiers. For instance, when the system is confident
about a fact, as when it was annotated by a person, this fact is used as an instance
of the annotated class and/or relation. This fact can also be used as a counterexample of all classes/relations disjoint with the annotated class/relation, and as an
instance of the super-class/super-relation. Moreover, facts discovered by the system with high confidence score can be promoted to examples and be included in a
new round of training. This creation of more examples is not active by default as it
can lead to data over fitting and should be used carefully.
A semantic extraction model contains a collection of partial syntactic structures
relative to either examples or counterexamples of the ontology class or property
encoded by the model. To obtain these structures, the sentence that originated the
examples are located and processed by the NLP part of the prototype. Then, each
annotated example has the format <subject-class: subject-text> <relation-name>
<object-class: object-text> and originates three facts:
subject-text is an individual of class subject-class;
object-text is an individual of class object-class;
subject-text has relation relation-name with object-text.
The partial syntactic structures associated to the first two facts will associate
subjects and/or objects to their ontological classes based on the syntactic dependencies between the subject/object token and the other tokens of the sentence (Rodrigues
et al. 2011b). These models store a collection of pairs for each token that represents
the subject/object. Two entities are regarded as equivalent if they connect to the
same lemmata using the same dependencies, graph edges, although lemmata of
nouns and adjectives are allowed to differ. Using the previous example of John
Bardeen, Fig. 4.3 depicts the data stored by the model that characterizes John
Bardeen as a person. In this case, every entity that is the subject of verb win is a
candidate to be a person.
The third fact, the relation, generates subject/object pairs based on the shortest
graph path between the elements of the pair. Two paths are regarded as equivalent if
they have the same sequence of nodes and edges, although nodes with nouns and
adjectives are allowed to differ. Figure 4.4 depicts the path used by the relation
models to associate John Bardeen and the Nobel Prize win.
Fig. 4.3 Dependency links that are used by the model to characterize John Bardeen as a person
48
The semantic extraction models also contain a statistical classifier that decides if
previously unseen syntactic structures are similar to the ones stored by it. Structures
considered similar enough are assigned the meaning of the model, otherwise are
ignored. The statistical classifiers implemented in the prototype are based on the
k-Nearest Neighbor algorithm, but others could be used (Rodrigues et al. 2011a, b).
The statistical classifiers training process starts by removing duplicate entries.
Then counterexamples are searched. As it is assumed that all relations are marked
in the sample documents, these documents are search for relation counterexamples.
Relation counterexamples are searched by having relation classifiers evaluating all
sentences of the sample documents. The counterexamples are all sentences positively evaluated that are not part of the example set. This process is repeated until
the amount of counterexamples found is below a certain threshold level. Rodrigues
(2013) provides a detailed description about this process.
References
49
References
Afonso S, Bick E, Haber R, Santos D (2002) Floresta sint(c)tica: a treebank for Portuguese. In:
Proceedings of the third international conference on Language Resources and Evaluation
(LREC). pp 16981703
Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) DBpediaa
crystallization point for the Web of Data. Web Semant 7:154165
Cardoso N (2008) REMBRANDTReconhecimento de Entidades Mencionadas Baseado em
Relaes e ANlise Detalhada do Texto. In: Mota C, Santos D (eds) Desafios Na Avaliao
Conjunta Do Reconhecimento de Entidades Mencionadas: O Segundo HAREM. Linguateca,
pp 195211
Carlson A, Betteridge J, Hruschka ER, Mitchell TM (2009) Coupling semi-supervised learning of
categories and relations. In: SemiSupLearn 09: Proceedings of the NAACL HLT 2009 workshop on semi-supervised learning for natural language processing. Association for
Computational Linguistics, Stroudsburg, pp 19
Chakravarthy A, Ciravegna F, Lanfranchi V (2006) Cross-media document annotation and enrichment. In: SAAW2006Proceedings of the 1st Semantic Authoring and Annotation Workshop
Freitas C, Rocha P, Bick E (2008) Floresta Sint(c)tica: bigger, thicker and easier. In: Teixeira A,
de Lima V, de Oliveira L, Quaresma P (eds) PROPOR 2008Proceedings of the international
conference on computational processing of the Portuguese language. Lecture notes in computer science. Springer, Berlin/Heidelberg, pp 216219
Garcia M, Gamallo P, Gayo I, Cruz MAP (2014) PoS-tagging the Web in Portuguese. National
varieties, text typologies and spelling systems. Nat Lang Process 53:95101
Hall J, Nilsson J, Nivre J, Eryigit G, Megyesi B, Nilsson M, Saers M (2007) Single malt or
blended? A study in multilingual parser optimization. In: Proceedings of the CoNLL shared
task session of EMNLP-CoNLL 2007. Association for Computational Linguistics, Prague,
pp 933939
Kiss T, Strunk J (2006) Unsupervised multilingual sentence boundary detection. Comput Linguist
32:485525
Knublauch H, Fergerson R, Noy N, Musen M (2004) The Protg OWL plugin: an open development environment for semantic web applications. In: McIlraith S, Plexousakis D, van Harmelen
F (eds) The Semantic WebISWC 2004Proceedings of the 3rd international Semantic Web
conference. Lecture notes in computer science. Springer, Berlin/Heidelberg, pp 229243
Kbler S, McDonald R, Nivre J (2009) Dependency parsing. In: Synthesis lectures on human
language technologies, vol 2. Morgan & Claypool, San Rafael
Mrquez L, Klein D (eds) (2006) CoNLL-XProceedings of the tenth conference on computational natural language learning. Omnipress, New York
Mota C, Santos D (eds) (2008) Desafios na avaliao conjunta do reconhecimento de entidades
mencionadas: O Segundo HAREM. Linguateca
Nivre J, Hall J, Nilsson J, Atanas C, Eryiqit G, Kbler S, Marinov S, Marsi E (2006) Labeled
pseudo-projective dependency parsing with support vector machines. In: CoNLL-X
Proceedings of the 10th conference on computational natural language learning. Association
for Computational Linguistics, Stroudsburg, pp 221225
Noy N, Fergerson R, Musen M (2000) The knowledge model of Protg-2000: combining interoperability and flexibility. In: Dieng R, Corby O (eds) EKAW 2000Proceedings of the 12th
international conference on knowledge engineering and knowledge management. Lecture
notes in computer science. Springer, Berlin/Heidelberg, pp 6982
Rodrigues M (2013) Model of access to natural language sources is electronic government.
Ph.D. Thesis, University of Aveiro
Rodrigues M, Dias GP, Teixeira A (2011a) Criao e acesso a informao semntica aplicada ao
governo eletrnico. Linguamtica 3:5568
Rodrigues M, Dias GP, Teixeira A (2011b) Ontology driven knowledge extraction system with
application in e-government. In: Proceedings of the 15th Portuguese conference on artificial
intelligence, Lisboa. pp 760774
50
Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the
international conference on new methods in language processing, Manchester
Sirin E, Parsia B (2004) Pellet: an OWL DL reasoner. In: Haarslev V, Mller R (eds) DL 2004
Proceedings of the 2004 international workshop on description logics, CEUR workshop proceedings. pp 212213
Suchanek FM, Ifrim G, Weikum G (2006) LEILA: learning to extract information by linguistic
analysis. In: Proceedings of the 2nd workshop on ontology learning and population: bridging
the gap between text and knowledge. Association for Computational Linguistics, Sydney,
pp 1825
Teixeira A, Ferreira L, Rodrigues M (2014) Online health information semantic search and exploration: reporting on two prototypes for performing extraction on both a hospital intranet and the
world wide web. In: Neustein A (ed) Text mining of web-based medical content. De Gruyter,
Berlin, pp 4973
Weibel S, Kunze J, Lagoze C, Wolf M (1998) Dublin core metadata for resource discovery. Internet
Engineering Task Force RFC 2413
Chapter 5
Application Examples
5.1
A Tutorial Example
http://en.wikipedia.org/wiki/Robert_Andrews_Millikan
51
52
Application Examples
Fig. 5.1 Screenshot of Wikipedia page about Robert Andrews Millikan. The content used in the
example is inside the dashed area (http://en.wikipedia.org/wiki/Robert_Andrews_Millikan)
5.1
A Tutorial Example
5.1.1
53
The example uses Stanford CoreNLP since it is a NLP pipeline featuring a command
line interface. This is a desirable feature since it allows rapidly assessing the performance of a prototype, and it is important for this tutorial example as not having to
implement custom code makes the steps clearer and easier to understand. Stanford
CoreNLP can be obtained at the download area of its web page.2 As reference, in
December 2014 the downloaded filename was stanford-corenlp-full-2014-10-31.
zip and the file size was around 251 MB.
Apache OpenNLP is another NLP pipeline featuring a command line interface.
OpenNLP was not preferred because its process of identifying named entities does
not take advantage of features such as part-of-speech tags, although OpenNLP also
implements a POS tagger, which prevents it of performing as well as Stanford
CoreNLP. For instance, when tested with the example document, OpenNLP fails to
detect the single token Millikan as a person in the sentence Millikan graduated
from . As syntactic parsing is done based on POS tags, such design option
implies writing custom software in order to have named entities alongside with POS
tags and included in the syntactic parses.
Other software suites such as NLTK and Freeling were not selected for this
example as they require writing some programming code, and thus the complexity
of the example would increase without a clear benefit. For instance, NLTK needs to
be invoked in Python and Freeling requires C++.
5.1.2
Tools Setup
The latest version of Stanford CoreNLP requires a Java Virtual Machine (JVM) able
to run Java 8. Recent operating systems should already have this version installed or
a more recent one. If not, a recent JVM can be obtained in Java download page at
Oracle website.3
After downloading Stanford CoreNLP is necessary to unzip it to a desired location.
It will be assumed that the unzipped folder is the working directory. Before starting to
do intensive processing on large documents, which will take some time to complete,
is possible and recommended to check if everything is working as it should by doing
some processing over a small text file. Let us use the sample file input.txt provided
with Stanford CoreNLP. As there is already a correspondent output file, input.txt.xml,
let us preserve it and copy the file input.txt to a file named testinput.txt and run
the following command (see command explanation in Table 5.1):
2
3
http://nlp.stanford.edu/software/corenlp.shtml
http://www.oracle.com/technetwork/java/javase/downloads/index-jsp-138363.html
54
Application Examples
edu.stanford.nlp.pipeline.
StanfordCoreNLP
-annotators tokenize,ssplit,
pos,lemma,ner
-file testinput.txt
Explanation
Invocation of the Java Virtual Machine with two parameters:
(1) -Xmx3g, to limit the maximum heap size to 3 GB which is
enough but can be increased if necessary and if there is more
RAM in the system; (2) -cp *, to say that the classpath of
java should be expanded to all JAR files in the current folder.
The classpath is a parameter that indicates where the
user-defined classes and packages can be found
The Java class implementing the CoreNLP controller
Parameter to specify the annotators to be used. In the same
order of the command: tokenizer (tokenize), sentence splitter
(ssplit), part of speech tagger (pos), lemmatizer (lemma), and
named entity recognizer (ner)
Parameter specifying the file to be processed. The output file
will have the same name plus the suffix .xml
A computer with a Core2 processor and 4 GB of RAM takes around 50 s to complete the command. The result is a file named testinput.txt.xml containing some
parts that are similar to the file input.txt.xml. The files are not completely equal as
the command did not use the full NLP pipeline to avoid the tasks that are more time
consuming.
5.1.3
This example will consider that the textual content of the Wikipedia page about
Robert Andrews Millikan is saved in a text file named r-millikan.txt. Repeating
the previous command with the target document it takes around 5 min with the same
machine:
The result is a file named r-millikan.txt.xml. Figure 5.2 shows the result corresponding to the beginning of the file when opened in a web browser. In the leftmost
column are the Ids of tokens, and the second column presents the respective token
5.1
A Tutorial Example
55
Fig. 5.2 Beginning of the file r-millikan.txt.xml when opened in a web browser corresponding
to the beginning of the first paragraph of the document
http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
56
Application Examples
For example, with some XSLT writing skills it is possible to obtain the named entity
list presented in Fig. 5.3 by changing parts of the file CoreNLP-to-HTML.xsl included
in CoreNLP (see Fig. 5.4 for details). It is strongly recommended to do a backup of
the file CoreNLP-to-HTML.xsl before doing any change to it. After backup, open
CoreNLP-to-HTML.xsl with a text editor and do the following changes:
In line 48 replace Sentences with Named Entities Found. This is the title of
the results table;
Comment lines 5760 using <!- - for beginning of comment block and -->
for ending the comment block;
Replace lines 67189 with the code explained in Fig. 5.4.
These named entities can be further isolated by removing all HTML formatting
from the file CoreNLP-to-HTML.xsl. Also, instead of viewing the file in a web
browser, it is possible to generate a file with that content by using a XSLT transformation software such as SAXON.5 The home edition is free and the java version,
after download and unzipping the folder, can be used as (command executed inside
the unzipped folder, see command explanation in Table 5.2):
java cp * net.sf.saxon.Transform s:r-millikan.txt.xml xsl:CoreNLP-tp5
http://saxon.sourceforge.net
HTML.xsl o:result.txt
5.1
57
A Tutorial Example
Fig. 5.4 XSLT code inserted in CoreNLP-to-HTML.xsl to isolate the named entities. Lines
6771 define that for each sentence is applied the XSLT template tokens. Lines 73105 define
the template tokens
net.sf.saxon.Transform
-s:r-millikan.txt.xml
-xsl:CoreNLP-to-HTML.xsl
-o:result.txt
Explanation
Invocation of the Java Virtual Machine with a parameter (-cp
*) to say that the classpath of java should be expanded to
all JAR files in the current folder
The Java class implementing the XSLT processor
Parameter to specify the input file
Parameter to specify the XSL file
Parameter to specify the output file
58
Application Examples
The list of named entities obtained can then be used for several tasks including
automatic creation of webpage meta tags to improve its visibility.
5.1.4
At the time of writing, the Stanford CoreNLP is provided with models for processing English and is also possible to download and use models for Chinese. To process
other languages is necessary to download the individual components (Stanford POS
tagger, StanfordParser, etc.) and assemble the pipeline. Extending this example in
order to have syntactic structures is just necessary to add that annotator in the pipeline. For having syntactic structures based on phrase structure grammar or based on
dependency grammar, the commands are respectively:
java -Xmx3g -cp * edu.stanford.nlp.pipeline.StanfordCoreNLP
-annotators tokenize,ssplit,pos,lemma,ner,parse -file r-millikan.txt
5.2
This example illustrates how it is possible to build applications that extract structured
information, with semantic annotations about entities and respective relations, by
processing batches of documents. The importance of extracting, structuring, and
associating semantic meaning to information is related with the possibility of developing computer algorithms that able to automatically manipulate such data. This
makes possible to create applications that are able to meet end users needs by presenting information in several distinct formats, in intuitive and appealing ways.
5.2.1
Goals
The application described here presents the information extracted from natural
language sources in distinct formats such as maps and tables, and is able to perform
accurate data searches, all by benefitting from the semantic annotations. This example
has two objectives. The first is to illustrate how the architecture presented in Chap. 4
can be instantiated. Besides the existence of a NLP pipeline, the core of the first
example, it is necessary to add components to allow defining the application semantics and others that learn to make the correspondence between the morphosyntactic
59
data and the semantic concepts. Having these components in place is then possible
to use out of the box semantic reasoners that are able to infer new information from
the information extracted from texts.
The second objective is related with using software tools from different sources.
Quite often, the most appropriate software for one task, e.g. part of speech tagging,
was developed by one team whereas tools for other tasks, e.g. syntactic parsing,
were developed by other teams. If one wants to use the best components, on top of
the software engineering challenges, that here are out of the topic, it is often necessary to train the software to process other natural languages. Targeting contents
written in Portuguese, or in any other language which models are not usually provided with natural language processing tools, involves obtaining adequate corpora
and preparing and conducting models training sessions.
5.2.2
Documents
The natural language documents to be processed in this example are the minutes of
municipal meetings of the municipalities belonging to Aveiro district, in Portugal.
This set of documents was selected for the following reasons:
NLP models for processing Portuguese are not usually included in the provided
software. Portuguese is the sixth most spoken language in the world. Therefore
is a relevant choice as example on the creation of NLP models. Moreover,
Portuguese is the native language of the authors which is important to assess the
quality of the NLP results, and Aveiro is the area of their affiliation.
In Portugal, municipalities have responsibilities regarding land management, the
grant of subsidies, and establish protocols with local organizations. It is important to have this kind of information readily available to the public to foster local
government transparency (Rodrigues et al. 2010, 2013).
Minutes of municipal meetings are usually made available in PDF format and
often contain long and complex sentences. These characteristics makes them
ambitious targets to process and, in principle, a system able to process such data
should also be able to handle data sources with shorter and less complex
sentences.
5.2.3
Documents can be obtained by entering the uniform resource locator (URL) of the
page in a web browser and saving it to a local file, or by using software such as cURL
or Wget to download and save them in local files. cURL is standard in Linux and
MacOS and for Windows can be obtained freely.6 However cURL is designed to download a single document and not batches of documents as necessary in this application.
6
http://curl.haxx.se/download.html
60
Application Examples
--accept pdf
--limit-rate=20k -D
cm-sjm.pt
http://www.cm-sjm.
pt/34
Explanation
Use recursion up to two hops from this point. The level can be
increased and that should cause the download of files less probably
related with this page
Download only files ending with pdf. More endings can be added by
separating them with commas. For example --accept pdf,docx would
download files ending with pdf or ending with docx
Limit the download rate to 20 kb per second and only download files
from the web domain cm-sjm.pt. The speed limitation is used to not
overload the host server and limiting the domain is to avoid having
files from other websites
The starting point for downloading
The best option is to use Wget that can be obtained freely from Linux repositories or
from the project website.7 For Windows there is a port at Sourceforge.8
To retrieve a batch of documents is better to use Wget in recursive mode. In this
mode, Wget retrieves a web page and follows its links to a desired level of recursion. It is necessary to take special care to avoid downloading unrelated documents
and avoid overloading website hosts. To retrieve all PDF files from the page http://
www.cm-sjm.pt/34 one possible command is (see explanation in Table 5.3):
wget -r --level 2 --accept pdf --limit-rate=20k -D cm-sjm.pt http://www.
cm-sjm.pt/34
https://www.gnu.org/software/wget/
http://gnuwin32.sourceforge.net/packages/wget.htm
9
https://pdfbox.apache.org/
10
http://poi.apache.org/
8
5.2.4
61
Application Setup
Ontology Creation
Without pretending to discuss in detail the state of the art and the challenges
of knowledge engineering, this application uses an ontology with a reduced level of
expressiveness which means that uses a reduced set of ontological classes and properties. The reasons are twofold: (1) a higher expressiveness of the ontology requires
a bigger set of seed examples to cover all hypotheses. Rules and relations not covered by the examples are as if they did not exist, and example annotation can be a
burdensome task if too many are required. Also, increasing the detail of the ontology can lead to a performance decrease in the learning process. The reason is that
the learning process bases the decision on the NLP data which is bound to the NLP
pipeline, and thus does not increase its granularity by means of the ontology;
(2) considering the current state of the art in ontological reasoning and data access,
it is well known that the performance in terms of speed decreases with the increase
of the data volume stored in a knowledge base conforming an ontology. Moreover,
that performance decrease is more pronounced when ontologies are more expressive (Mller et al. 2013; Peters et al. 2013). It is possible to mitigate this effect but
that will not be addressed here.
The information to be handled by the application implies that the ontology
includes concepts about people, locations, and specific concepts relative to municipalities. It is a good practice to use, as much as possible, standardized concepts in
the spirit of open data and application interoperability. Thus the application ontology will be composed by three classes added specifically to handle the municipal
concepts plus four publicly available ontologies to handle information about people,
places, and to refer to the documents here the information was extracted. These four
ontologies are described in Table 5.4 and the three classes for municipal subjects, all
defined as subclasses of the top level call Thing, are:
Build permitsIt represents construction contracts in execution, whether public
or private.
ProtocolIt represents protocols signed with local institutions such as schools
or sports clubs.
SubsidyIt represents subsidies requested by any entity whether granted or not.
Were also defined object type properties and data type properties. Object type
properties are references to other objects and thus they represent relations
between ontology objects. Data type properties are properties that will have a
WGS84
Geo-Net-PT
Friend of a friend
Ontology
Dublin core
Description
Allows describing resources such as documents and video (Weibel et al. 1998). For this work the classes used were Title
and Source. It can be obtained at http://dublincore.org/documents/dc-rdf/
Defines terms such as people, groups, and documents (Brickley and Miller 2010). For this application the relevant classes
are Person and Organization, and the relevant property is name. It can be obtained at http://www.foaf-project.org/
It is a geographic ontology of Portugal which encodes the organization of spaces, for instance which streets belong to a
neighborhood, that in turn belongs to a city, that belongs to a municipality, etc. (Lopez-Pellicer et al. 2009). For this
application the relevant class is Municipality. It can be obtained at http://www.linguateca.pt/GeoNetPt/
It is the geodetic reference system used in global positioning system (GPS). For this application the relevant class is
SpatialThing and the relevant properties are lat, long, and alt representing latitude, longitude, and altitude,
respectively. It can be obtained at http://www.w3.org/2003/01/geo/
Table 5.4 Description of the four ontologies used in the composition of the application ontology
62
5
Application Examples
63
value and thus work as values specific to a given object. These classes have two
object type properties and four data type properties. The object type properties
are: (1) place, referring to the address of the entity who signed the protocol or
requested a subsidy; (2) requester, the reference of the entity or entities, excluding the municipality, involved in the request. The four data type properties are:
(1) deliberation, representing the outcome of the request; (2) identifier, is the
unique identifier given by municipal services; (3) money amount, the amount of
money involved in the protocol or subsidy; (4) motivation, the motive of the
protocol or for requesting the subsidy.
Merging these four ontologies plus the three specific classes is straightforward
when using an ontology editing tool such as Protg.11 As Friend of a Fried defines
the aligned with WGS84 and Dublin Core, using the standard Simple Knowledge
Organization System (SKOS), Protg automatically aligns them correctly.
Considering Geo-Net-PT, as it does not share concepts with the other ontologies,
it is placed in an independent subclass of Thing, and the same applies for the
classes Build Permit, Protocol, and Subsidy. Figure 5.5 presents a perspective of
the ontology.
11
http://protege.stanford.edu/
64
Application Examples
This excerpt mentions a subsidy and a protocol. The subsidy is for an institution
named GIP, for paying a technician, amounts to 2,499, and was approved.
The subsidy is justified by a protocol established with an association named
Associao de Jovens Ecos. The annotations corresponding to this seed example
are presented and explained in Table 5.5.
It is possible to use a graphical user interface to assist the annotation process.
When all annotations are done, an automated process finds the syntactic graphs corresponding to the relevant sentences, and selects the parts of the graph that link the
object to the subject of the relation. This ends the application setup and is now possible to generate the semantic extraction models based on the examples. Afterwards
those models are used upon arbitrary documents to extract information of the same
kind of the annotated examples.
65
Table 5.5 Annotations relative to the example excerpt of a document and respective explanation
Annotation
Subsidy:subsdio requester
Organization:GIP
Subsidy:subsdio motivation
motivation:pagamento do tcnico
Subsidy:subsdio moneyAmount
moneyAmount:2,499
Subsidy:subsdio deliberation
deliberation:aprovar
Protocol:protocolo requester
Organization:Associao de Jovens
Ecos Urbanos
Explanation
The word subsdio is an instance of the ontology
class Subsidy. This subsidy has as an object type
property requester with GIP which is an instance
of the ontology class Organization
This subsidy has as a data type property
motivation with value pagamento do tcnico
(technicians payment)
This subsidy has as a data type property
moneyAmount with value 2,499
This subsidy has as a data type property
deliberation with value aprovar (approve)
The word protocolo is an instance of the ontology
class Protocolo. This protocolo has as an object type
property requester with Associao de Jovens
Ecos Urbanos which is an instance of the ontology
class Organization
Fig. 5.6 Example of Java code to query Google Maps about the address of an entity
5.2.5
66
Application Examples
],
formatted_address : GRETUA - Grupo Experimental de Teatro da UA, Universidade de Aveiro,
3810-193 Aveiro, Portugal,
geometry : {
bounds : {
northeast : {
lat : 40.6370947,
lng : -8.6577734
},
southwest : {
lat : 40.6366973,
lng : -8.658345599999999
}
}
location : {
lat : 40.6369157,
lng : -8.658038899999999
},
location_type : ROOFTOP,
viewport : {
Fig. 5.7 Google Maps API response when querying for GrETUA. The response format is in JSON
and the relevant data for this example is encircled
Fig. 5.8 The top pane shows a map with several marks for which information exists. The bottom
pane shows the document context relative to the point indicated by the arrow
67
Fig. 5.9 The top pane shows a map with marks for which information exists and the bottom pane
shows the information extracted relative to the point indicated by the arrow
The first example shows, below the map, the text snippet where the location was
found, and the second example shows the information extracted for that location.
In both figures the arrow indicates which mark was pressed.
5.2.6
Lets consider a person named Maria that wants to check if there is information
about the build permits she requested. Maria is a frequent name in Portugal and a
keyword based query returned 165 results for Maria using the same document set
from which information was extracted. Having the information extracted to a
knowledge base conforming an ontology makes possible queries that take
68
Application Examples
Document
cm-arouca.pt_ACTA_12_2009
cm-arouca.pt_ACTA_22_2009
Id
12/09
153/2008
Outcome
Solicitar (Request)
Solicitar (Request)
References
Brickley D, Miller L (2010) FOAF vocabulary specification 0.98. Namespace document 9
Lopez-Pellicer FJ, Chaves M, Rodrigues C, Silva MJ (2009) Geographic ontologies production in
Grease-II
Mller R, Neuenstadt C, zep , Wandelt S (2013) Advances in accessing big data with expressive ontologies. In: Timm I, Thimm M (eds) KI 2013: advances in artificial intelligence.
Lecture notes in computer science. Springer, Berlin/Heidelberg, pp 118129. doi:10.1007/
978-3-642-40942-4_11
Peters M, Brink C, Sachweh S, Zndorf A (2013) Performance considerations in ontology based
ambient intelligence architectures. In: van Berlo A, Hallenborg K, Rodrguez JMC, Tapia DI,
Novais P (eds) Ambient intelligencesoftware and applications. Advances in intelligent
systems and computing. Springer, Berlin, pp 121128. doi:10.1007/978-3-319-00566-9_16
Rodrigues M, Dias GP, Teixeira A (2010) Knowledge extraction from minutes of Portuguese
municipalities meetings. In: FALA
Rodrigues M, Dias GP, Teixeira A (2013) Towards e-government information platforms for enterprise 2.0. In: Handbook of research on enterprise 2.0: technological, social, and organizational
dimension. IGI, Hershey, pp 676696
References
69
Teixeira A, Ferreira L, Rodrigues M (2014) Online health information semantic search and
exploration: reporting on two prototypes for performing extraction on both a hospital intranet
and the world wide web. In: Neustein A (ed) Text mining of web-based medical content.
De Gruyter, Berlin
Weibel S, Kunze J, Lagoze C, Wolf M (1998) Dublin core metadata for resource discovery. Internet
Engineering Task Force RFC 2413
Chapter 6
Conclusion
In this book was discussed the need to provide formal structures to contents originally
created in unstructured formats using natural language. The volume of relevant
information in such formats increases every day as people use the Internet to communicate, and as organizations create and publish documentation. The contents are
often without formalized markups because marking contents manually can be a
time consuming and error prone task that requires some specialized knowledge. The
objective of information extraction is to analyze these contents and produce fixed
format, unambiguous and formal representations of them, including the identification of the entities involved and the relations they establish among them.
The key concepts of information extraction were explained and illustrated using
representative examples with distinct degrees of complexity. Alongside the examples were presented several high performing and readily available state-of-the-art
tools for implementing information extraction systems, and natural language processing tasks relative to information extraction. The tool selection took into consideration the performance and the ability of providing good results with a wide range
of natural languages.
After the generic pipeline for information extraction and available tools presentation was described an information extraction architecture for developing systems
able to detect and organize relevant information according to an arbitrary ontology.
The architecture was instantiated for Portuguese language. The software tools used
and respective setup were described, and it was explained how the tools work
together in order to have a complete and coherent system. The implemented system
was used for extracting information from local government documents and from
documents relative to healthcare. Key features of the developed system are the ability to process natural language texts, to accept a knowledge domain defined by an
ontology, and to learn from examples how to extract information. The system supports the extension for other natural languages by changing the NLP component,
and without changing the other modules.
71
72
6 Conclusion
Index
A
AKTiveMedia, 45
Application programming interface (API), 16
B
Bosque v7.3 corpus, 4142
Boundary detection. See Sentence boundary
detection
C
CoreNLP, 53
cURL/Wget, 51
D
Data generation, 1315
Dependencies, 3132
Document object model (DOM), 55
Document society, 13
Domain representation, 4446
E
e-Government. See Information extraction
(IE), electronic government
Epic parser, 2122
G
GATE, 24
Generic relation extraction, 3132
I
Identifying entities. See also Named entity
recognition (NER)
generic named entities, 30
goals, 27
named entity recognition, 28
relation, 3032
website pages, 28
wikipedia categories, 29
Information extraction (IE)
approaches, 67
architecture, 8
challenges, 56
documents and information retrieval, 5
electronic government
documents, 5960
goals, 5859
maps, 6567
natural language documents, 59
ontology creation, 6165
semantic information queries, 6768
extraction tasks, 5
generic pipeline, 71
identifying entities, 27
information extraction systems, 89
key concepts, 71
natural language texts, 5
73
74
Information extraction (IE) (cont.)
NLP, 72
OBIE, 7, 39
performance measures, 78
process overview, 1315
iSentenizer, 1617
K
Knowledge representation (KR), 34
L
Lemma/lemmatization, 1718
M
MaltParser, 2324, 44
Markov models. See TreeTagger algorithm
Millikan, Robert Andrews, 5152
Morphological analysis
lemma/lemmatization, 1718
POS tagging, 1718
Stanford POS Tagger, 19
SVMTool, 1920
tools, 1819
TreeTagger algorithm, 1920
word stemming, 1718
N
Named entity recognition (NER)
identifying entities, 2728, 30
natural language processing, 4, 4344
Natural language processing (NLP)
architecture, 3940
Bosque sentences, 4142
coherent data representation, 23
definition, 4
documents, 59
GATE, 24
HAREM categories, 4344
named entity recognition,
4344
NLP steps, 1314
NLTK interfaces, 24
POS tags, 43
sentence splitting, 43
Stanford NLP, 2324
syntactic parsing, 44
tasks in, 45
Index
Natural Language Toolkit (NLTK)
Punkt, 16
text processing libraries, 24
NER. See Named entity recognition
O
OBIE. See Ontology-Based Information
Extraction (OBIE)
Ontologies
annotation, 6465
classes, 62
composition, 6162
definition, 4
perspective, 61, 63
reasons, 61
relational database, 3233
relevant information, 3839
Ontology-Based Information Extraction
(OBIE), 7
approaches, 3334
arbitrary ontology, 39
architecture, 3940
identifying entities, 27
IE process, 33
OpenNLP, 53
OWL. See Web Ontology Language (OWL)
P
Parse tree, 31
Part of speech (POS)
natural language processing, 43
relation extraction, 31
Stanford POS Tagger, 19
tags, 1718, 43
Portuguese, 4041
Probabilistic context-free grammar (PCFG)
parser. See Epic parser
Prototype implementation. See State-of-the-art
tools
Punkt approach, 1516
R
r-millikan.txt.xml, 5455
Relation extraction
approaches, 3031
dependencies, 3132
entities, 30
methods, 31
75
Index
named entity recognition, 30
patterns, 31
POS tags, 31
types, 30
Resource Description Framework Schema
(RDFS), 4, 33
S
SAXON, 5658
Semantic extraction and integration
creation, 4648
ontology relation, 4546
partial syntactic structures, 47
use of, 48
Semantics
advantages of semantic search, 3
architecture, 39
knowledge representation, 34
linguistic expressions, 34
queries, 68
Sentence boundary detection.
See Tokenization
Sentence splitting, 43
Simple Knowledge Organization System
(SKOS), 63
Software architecture, OBIE
domain representation, 3940
NLP component, 39
semantic extraction and integration, 3940
SPARQL query, 68
Stanford CoreNLP, 5354
Stanford NLP, 2324
Stanford POS Tagger, 19
StanfordParser, 2223
State-of-the-art tools
architecture instantiation, 4041
domain representation, 4446
natural language processing, 4144
semantic extraction and integration, 4548
Support vector machine (SVM) tool, 1920
Syntactic parsing
computational intensive, 20
T
Tokenization
boundary detection, 15
definition, 15
representative tools, 1617
tools, 1516
TreeTagger algorithm, 1920
TurboParser, 2324
Tutorials
applications, 72
scenarios, 51
software tools, 53
syntactic parsing, 58
target document, 5458
tools setup, 5354
wikipedia page, 5152
U
Uniform resource locator (URL), 5152
Unstructured documents, 12
W
Web Ontology Language (OWL), 4, 33
Wget command, 5960
Wikipedia page, 5152
Word stemming, 1718
X
XML stylesheet language for transformations
(XSLT), 5557