Sie sind auf Seite 1von 82

SPRINGER BRIEFS IN ELEC TRIC AL AND

COMPUTER ENGINEERING SPEECH TECHNOLOGY

MrioRodrigues
AntnioTeixeira

Advanced Applications
of Natural Language
Processing for
Performing Information
Extraction

123

Mrio Rodrigues
Antnio Teixeira

Advanced Applications
of Natural Language
Processing for Performing
Information Extraction

Mrio Rodrigues
ESTGA/IEETA
University of Aveiro
Portugal

Antnio Teixeira
DETI/IEETA
University of Aveiro
Portugal

ISSN 2191-8112
ISSN 2191-8120 (electronic)
SpringerBriefs in Electrical and Computer Engineering
ISBN 978-3-319-15562-3
ISBN 978-3-319-15563-0 (eBook)
DOI 10.1007/978-3-319-15563-0
Library of Congress Control Number: 2015935192
Springer Cham Heidelberg New York Dordrecht London
The Authors 2015
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, express or implied, with respect to the material contained herein or for any errors
or omissions that may have been made.
Printed on acid-free paper
Springer International Publishing AG Switzerland is part of Springer Science+Business Media
(www.springer.com)

Mrio Rodrigues
Antnio Teixeira

Advanced Applications
of Natural Language
Processing for Performing
Information Extraction

Mrio Rodrigues
ESTGA/IEETA
University of Aveiro
Portugal

Antnio Teixeira
DETI/IEETA
University of Aveiro
Portugal

ISSN 2191-8112
ISSN 2191-8120 (electronic)
SpringerBriefs in Electrical and Computer Engineering
ISBN 978-3-319-15562-3
ISBN 978-3-319-15563-0 (eBook)
DOI 10.1007/978-3-319-15563-0
Library of Congress Control Number: 2015935192
Springer Cham Heidelberg New York Dordrecht London
The Authors 2015
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, express or implied, with respect to the material contained herein or for any errors
or omissions that may have been made.
Printed on acid-free paper
Springer International Publishing AG Switzerland is part of Springer Science+Business Media
(www.springer.com)

Preface

The amount of content available in natural language (English, Italian, Portuguese,


etc.) increases every day. This book provides a timely contribution on how to create
information extraction (IE) applications that are able to tap the vast amount of relevant
information available in natural language sources: web pages, official documents
(such as laws and regulations, books and newspapers), and the social web.
Trends, such as Open Data and Big Data, show that there is value to be added by
effectively processing large amounts of available data. Natural language sources are
usually stored in digital format, searched using keyword-based methods, displayed
as they were stored, and interpreted by the end users. However, it is not common to
have software to manipulate these sources in order to present information in an
adequate manner to fit users context and needs. If such sources would have structured and formal representations (relational and/or with some markup language,
etc.), computer systems would be able to effectively manipulate that data to meet
end users expectations: summarize data, present graphics, etc.
The research community has been very active in producing software tools to support the development of information extraction systems for several natural languages.
These tools are now mature enough to be tested in production systems. To stimulate
the adoption of those technologies by the broad community of software developers,
it is necessary to show their potential and how they can be used. Readers are introduced to the problem of IE and its current challenges and limitations, all supported
with examples. The book discusses the need to fill the gap between data/documents/
people and provides a broad overview of the state-of-the-art technology in IE.
This book presents a description of a generic architecture for developing systems
that are able to learn how to extract relevant information from natural language
documents, and assign semantic meaning to it. We also illustrate how to implement
a working system using, in most parts, state-of-the-art and freely available software
for several languages. Some concrete examples of systems/applications are provided to illustrate how applications can deliver information to end users.
Aveiro, Portugal
December 2014

Mrio Rodrigues
Antnio Teixeira
v

Contents

Introduction ...............................................................................................
1.1 Document Society ..............................................................................
1.2 Problems ............................................................................................
1.3 Semantics and Knowledge Representation ........................................
1.4 Natural Language Processing ............................................................
1.5 Information Extraction .......................................................................
1.5.1 Main Challenges in Information Extraction ..........................
1.5.2 Approaches to Information Extraction...................................
1.5.3 Performance Measures ...........................................................
1.5.4 General Architecture for Information Extraction ..................
1.6 Book Structure ...................................................................................
References ...................................................................................................

1
1
2
3
4
5
5
6
7
8
8
10

Data Gathering, Preparation and Enrichment ......................................


2.1 Process Overview ...............................................................................
2.2 Tokenization and Sentence Boundary Detection ...............................
2.2.1 Tools .......................................................................................
2.2.2 Representative Tools: Punkt and iSentenizer .........................
2.3 Morphological Analysis and Part-of-Speech Tagging .......................
2.3.1 Tools .......................................................................................
2.3.2 Representative Tools: Stanford POS Tagger,
SVMTool, and TreeTagger .....................................................
2.4 Syntactic Parsing ................................................................................
2.4.1 Representative Tools: Epic, StanfordParser,
MaltParser, TurboParser .........................................................
2.5 Representative Software Suites ..........................................................
2.5.1 Stanford NLP .........................................................................
2.5.2 Natural Language Toolkit (NLTK) ........................................
2.5.3 GATE .....................................................................................
References ...................................................................................................

13
13
15
15
16
17
18
19
20
21
23
23
24
24
24

vii

viii

Contents

Identifying Things, Relations, and Semantizing Data ...........................


3.1 Identifying the Who, the Where, and the When ................................
3.2 Relating Who, What, When, and Where............................................
3.3 Getting Everything Together ..............................................................
3.3.1 Ontology ................................................................................
3.3.2 Ontology-Based Information Extraction (OBIE)...................
References ...................................................................................................

27
27
30
32
32
33
34

Extracting Relevant Information Using a Given Semantic...................


4.1 Introduction ........................................................................................
4.2 Defining How and What Information Will Be Extracted...................
4.3 Architecture........................................................................................
4.4 Implementation of a Prototype Using State-of-the-Art Tools............
4.4.1 Natural Language Processing ................................................
4.4.2 Domain Representation..........................................................
4.4.3 Semantic Extraction and Integration ......................................
References ...................................................................................................

37
37
38
39
40
41
44
45
49

Application Examples ...............................................................................


5.1 A Tutorial Example ............................................................................
5.1.1 Selecting and Obtaining Software Tools ................................
5.1.2 Tools Setup.............................................................................
5.1.3 Processing the Target Document............................................
5.1.4 Using for Other Languages and for Syntactic Parsing...........
5.2 Application Example 2: IE Applied to Electronic Government ........
5.2.1 Goals ......................................................................................
5.2.2 Documents .............................................................................
5.2.3 Obtaining the Documents ......................................................
5.2.4 Application Setup...................................................................
5.2.5 Making Available Extracted Information Using a Map .........
5.2.6 Conducting Semantic Information Queries ...........................
References ...................................................................................................

51
51
53
53
54
58
58
58
59
59
61
65
67
68

Conclusion .................................................................................................

71

Index .................................................................................................................

73

Chapter 1

Introduction

AbstractChapter 1 introduces the problem of extracting information from natural


language unstructured documents, which is becoming more and more relevant in
our document society. Despite the many useful applications that the information
in these documents can potentiate, it is harder and harder to obtain the wanted information. Major problems result from the fact that much of the documents are in a
format non usable by humans or machines. There is the need to create ways to
extract relevant information from the vast amount of natural language sources.
After this, the chapter presents, briefly, background information on Semantics,
knowledge representation and Natural Language Processing, to support the presentation of the area of Information Extraction [IE, the analysis of unstructured text in
order to extract information about pre-specified types of events, entities or relationships, such as the relationship between disease and genes or disease and food items; in
so doing value and insight are added to the data. (Text mining of web-based medical
content, Berlin, p 50)], its challenges, different approaches and general architecture,
which is organized as a processing pipeline including domain independent componentstokenization, morphological analysis, part-of-speech tagging, syntactic parsingand domain specific IE componentsnamed entity recognition and co-reference
resolution, relation identification, information fusion, among others.
Keywords Document society Unstructured documents Natural language
Semantics Ontologies Information extraction Natural language processing
NLP Knowledge representation

1.1 Document Society


Our society is a document society (Buckland 2013). Documents have become
the glue that enables societies to cohere. Documents have increasingly become the
means for monitoring, influencing, and negotiating relationships with others
(Buckland 2013). With the advent of the web and other technologies, the concept of
document evolved to include from classical books and reports to complex online
multimedia information incorporating hyperlinks.
The Authors 2015
M. Rodrigues, A. Teixeira, Advanced Applications of Natural Language
Processing for Performing Information Extraction, SpringerBriefs
in Electrical and Computer Engineering, DOI10.1007/978-3-319-15563-0_1

1Introduction

Fig. 1.1 An example of website providing health information (www.womenshealth.gov)

The number of such documents and rate of increase are overwhelming. Some
examples: Governments produce large amounts of documents at the several levels
(local, central) and of many types (laws, regulations, minutes of meetings (public),
etc.); information in companies intranets is increasing; more and more exams,
reports and other medical documents are stored in servers by health institutions. Our
personal documents augment day by day in number and size. As such, health
research is one of the most active areas, resulting in a steady flow of documents (e.g.
medical journals and masters and doctoral theses) reporting on new findings and
results. There are also many portals and web sites with health information such as
the example presented in Fig.1.1.
Much of the information that would be of interest to citizens, researchers, and
professionals is found in unstructured documents. Despite the increasing use of
tables, images, graphs and movies, a relevant part of these documents adopts at least
partially written natural language. The amount of contents available in natural
language (English, Portuguese, Chinese, Spanish, etc.) increases every day. This is
particularly noticeable in the web.

1.2 Problems
Despite the many useful applications that the information on these documents can
potentiate, it is harder and harder to obtain the wanted information. This huge and
increasing amount of documents available in the web, companies intranets and

1.2Problems

accumulated by most of us in our computers and online services potentiate many


applications but also pose several challenges to make those documents really useful.
A major problem results from the fact that much of the documents/data is in a
format non usable by machines. Hence, there is the need to create ways to extract
relevant information from the vast amount of natural language sources. Natural
language is the most comprehensive tool for humans to encode knowledge (Santos
1992), but creating tools to decode this knowledge is far from simple.
The second problem that needs to be solved is how to represent and store the
information extracted. One must also make this information usable by machines.
Regarding the discovery of information, general search engines do not allow the
end-user to obtain a clear and organized presentation of the available information.
Instead, it is more or less of a hit or miss, random return of information on any given
search. Efficient access to this information implies the development of semantic
search systems (Guha etal. 2003) capable of taking into consideration the concepts
involved and not the words.
Semantic search has some advantages over search that directly index text words
(Teixeira etal. 2014): (1) produces smaller sets of results, by being capable of identifying and removing duplicated or irrelevant results; (2) can integrate related information scattered across documents; (3) can produce relevant results even when the
question and answer do not have common words; and (4) makes possible complex
and more natural queries.
To make possible semantic search and other applications based on semantic
information, we need to add semantics to the documents or create semantic descriptions representing or summarizing the original documents. This semantic information must be derived from the documents and this can be done using techniques
from Information Extraction (IE) and Natural Language Processing (NLP) fields, as
will be described and exemplified in this book. In general, to make IE possible, texts
are first pre-processed (ex: to separate into sentences and words) and enriched (ex:
to mark words as nouns or verbs) by applying several NLP methods.

1.3 Semantics andKnowledge Representation


As argued in the previous section, there is the need to extract semantic information
from natural language documents to make possible new semantic based applications, and semantic search, on the information nowadays hidden in natural language
documents. In this section, some background is given on the foundational concepts
of semantics, ontologies and knowledge representation.
Semantics is the study of meaning of linguistic expressions, including the relations
between signifiers, such as words, phrases, signs and symbols, and their meaning.
The language can be an artificial language (e.g. a computer programming language)
or a natural language, such as English or Portuguese. The second kind is directly
related to the topic of this book. Computational semantics addresses the automation
of the processes of constructing representations of meaning and reasoning with them.

1Introduction

Knowledge representation (KR) addresses how to represent information about


the world in forms that are usable by computer systems to solve complex tasks.
Research in KR includes studying how to use symbols to represent a set of facts
within a knowledge domain. As defined by Sowa (2000), knowledge representation is the application of logic and ontology to the task of constructing computable
models for some domain. In general, KR implies creating surrogates that represent real world entities, and endow them with properties and interactions that represent real world properties and interactions. Examples of knowledge representation
formalisms are logic representations, semantic networks, rules, frames, and
ontologies.
Ontology is a central concept in KR. It is formally defined as an explicit specification of a shared conceptualization (Gruber 1993). It describes a hierarchy of concepts related by subsumption relationships, and can include axioms to express other
relationships between concepts and to constrain their intended interpretation. From
the computer science point of view, the usage of an ontology to explicitly define the
application domain brings large benefits regarding information accessibility, maintainability, and interoperability. The ontology formalizes and allows making public
the applications view of the world (Guarino 1998).
Ontologies allow specifying knowledge in machine processable formats since
they can be specified using languages with well-defined syntax, such as the Resource
Description Framework Schema (RDFS) and the Web Ontology Language (OWL).
As ontology specification languages have well defined semantics, specifying knowledge using ontologies prevents the meaning of the knowledge to be open to subjective intuitions and different interpretations (Antoniou and van Harmelen 2009).

1.4 Natural Language Processing


Allen (2000) defines NLP as computer systems that analyze, attempt to understand, or produce one or more human languages, such as English, Japanese, Italian,
or Russian. The input might be text, spoken language, or keyboard input. The task
might be to translate to another language, to comprehend and represent the content
of text, to build a database or generate summaries, or to maintain a dialogue with a
user as part of an interface for database/information retrieval.
The area of NLP can be divided in several subareas, such as Computational
Linguistics, Information Extraction, Information Retrieval, Language Understanding
and Language Generation (Jurafsky and Martin 2008).
From the many tasks integrated in NLP, here is a list of those that are particularly
relevant for this book:
Sentence breaking: Find the sentences boundaries;
Part-of-speech tagging: Given a sentence, determine the part of speech (morphosyntax role) for each word;
Named Entity Recognition (NER): Determine which items in the text map
entities such as people, places or dates;

1.5 Information Extraction

Parsing: Grammatical analysis of a sentence;


Information Extraction: To be described in the next section.

1.5 Information Extraction


Information extraction is a sub-area of Natural Language Processing dedicated to
the general problem of detecting entities referred in natural language texts, the
relations between them and the events they participate in. Informally, the goal
is to detect elements such as who did what to whom, when and where
(Mrquez etal. 2008). Natural language texts can be unstructured, plain texts, and/
or semi-structured machine-readable documents, with some kind of markup.
As Gaizauskas and Wilks (1998) observed, IE may be seen as populating structured information sources from unstructured, free text, information sources.
IE differs from information retrieval, the task of locating relevant documents in
large document sets usually performed by current search engines such as Google or
Bing, as its purpose is to acquire relevant information that can be later manipulated
as needed. IE aims to extract relevant information from documents, and information
retrieval aims to retrieve relevant documents from collections. In IR, after querying
search engines, users must read each document of the result set for knowing the
facts reported. Systems featuring IE would be capable of merging related information scattered across different documents, producing summaries of facts reported in
large amounts of documents, having facts presented in tables, etc.
Early extraction tasks were concentrated around the identification of named entities, like people and company names, and relationship among them from natural
language text (Piskorski and Yangarber 2013). With the developments in recent
years, that made easier online access to both structured and unstructured data, new
applications of IE appeared and, to address the needs of these new applications, the
techniques of structure extraction have evolved considerably over the last decades
(Piskorski and Yangarber 2013).

1.5.1 Main Challenges inInformation Extraction


Two important challenges exist in IE. One derives from the variety of ways of
expressing the same fact. As illustrated by McNaught and Black (2006), the next
statements inform that a woman named Torretta is the new chair-person of a company named BNC Holdings:
BNC Holdings Inc. named Ms. G.Torretta to succeed Mr. N.Andrews as its new
chair-person.
Nicholas Andrews was succeeded by Gina Torretta as chair-person of BNC
Holdings Inc.
Ms. Gina Torretta took the helm at BNC Holdings Inc. She succeeds Nick
Andrews.

1Introduction

To extract the relevant information from each of these alternative formulations it


is required linguistic analysis to cope with grammatical variation (active/passive),
lexical variation (named to/took the helm), and anaphora resolution for cross-
sentence references (Ms. Gina Torretta. She).
The other challenge, shared by almost all NLP tasks, derives from the high
expressiveness of natural languages, which can have ambiguous structure and
meaning. Lee (2004) exemplifies this phenomenon with a McDonnell-Douglas ad
from 1985: At last, a computer that understands you like your mother. This sentence can be interpreted in, at least, three different ways: (1) the computer understands you as well as your mother understands you; (2) the computer understands
that you like your mother; (3) the computer understands you as well as it understand
your mother.

1.5.2 Approaches toInformation Extraction


Over the years several different approaches have been proposed to solve the challenges of IE.They have been classified according different dimensions. Some classifications are relative the type of input documents (Muslea 1999), others to the type
of technology used (Piskorski and Yangarber 2013; Chiticariu etal. 2013), and others to the degree of automation of the system (Hsu and Dung 1998; Chang etal.
2003). The distinct classification schemes reflect the variety of concerns of the proposing authors and also the evolution of IE over time.
Regarding the type of input documents, the methods developed to extract information from unstructured texts differ from the approaches employed when documents have some kind of markup as XML.The methods to extract information from
unstructured sources tend to rely more on deep NLP.The lack of structure in data
implies that one of the most suitable ways to discriminate different concepts
involved in texts is by analyzing them as thoroughly as possible. However, it is also
possible to use superficial patterns targeted at information that is expressed in a
reduced set of sentences such as X was born in Y or X is a Y-born or targeted at
information with well-defined formats such as email addresses, dates, and money
amounts.
When information sources have markups such as XML and/or are machine generated content based on templates, IE methods can take advantage of the markups
and the structure of the document since they provide clues about the type of content.
Markups can occur embedded in the text, e.g. John was born on <date>14th March
1959</date>, or in special places such as Wikipedias page infoboxes. Methods that
extract information from such contents tend to rely on markups and the document
structure since they were produced by the information publisher and thus should be
accurate. It is also common the usage of such information as seed examples for training and improvement of the accuracy of methods looking for information that originally is in unstructured data (Suchanek etal. 2007; Kasneci et al. 2008).

1.5 Information Extraction

Relative to the technology used, earlier IE systems were essentially rule based
approaches, also called knowledge engineered approaches. This type of technology
is still used in modern approaches, at least partially. It uses hard coded rules created
by human experts that encode linguistic knowledge by matching patterns over a
variety of structures: text strings, part-of-speech tags, dictionary entries. The rules
are usually targeted for specific languages and domains and this type of systems are
generically very accurate and ready to use out of the box (Andersen etal. 1992;
Appelt etal. 1993; Lehnert etal. 1993). As manual coding of the rules can become
a time-consuming task, and also because rules rarely remain unchanged when
porting for other languages and/or domains, some implementations introduced
algorithms for automatically learning rules from examples (Soderland 1999;
Califf and Mooney 1999; Ciravegna 2001).
The success IE motivated the broadening of its scope to include more unstructured and noisy sources and, as result, statistical learning algorithms were introduced. Among the most successful approaches are the ones based on Hidden
Markov Models, conditional random fields, and maximum entropy models
(Ratnaparkhi 1999; Lafferty etal. 2001). Later were developed more holistic analyses of the document including techniques for grammar construction and ontologybased IE (Viola and Narasimhan 2005; Wimalasuriya and Dou 2010).
Hybrid approaches, which use a mix of the previous two, combine the best features of each kind of approach: the accuracy of rule based approaches with the
coverage and adaptability of machine learning approaches.
Some IE approaches use ontologies to store and guide the IE process. The success of these approaches motivated the creation of the term Ontology-Based
Information Extraction (OBIE). These approaches will be described in Chap. 4 of
this book.
Despite the different approaches, there is no clear winner. The advent of Semantic
Web and Open Data made ontology-based IE (OBIE) one of the most popular trends
in the field. However OBIE includes other IE algorithms and is not an alternative
method but rather an approach that processes natural language text through a mechanism guided by ontologies and presents the output using ontologies (Wimalasuriya
and Dou 2010).
Comprehensive overviews about IE approaches are provided in (Sarawagi 2008;
Piskorski and Yangarber 2013).

1.5.3 Performance Measures


The metrics commonly used in the evaluation of IE systems are precision, recall
and F-measure (Makhoul etal. 1999). Precision is the ratio between the number of
correct or relevant findings and the number of all findings of the system, recall is
the ratio between the numbers of correct or relevant findings and expected findings, which are the total amount of relevant facts that exist in the documents.

1Introduction

F-measure is the weighted harmonic mean of precision and recall, commonly


calculated as F1 which is when is equal to 1. These definitions can be expressed
as formulas as follows:
Precision =

Recall =

numberof correct findings


numberof findings
numberof correct findings
numberof expected findings

Fmeasure = (1 + b )
F1 = 2

Precision Recall
b 2 Precision + Recall

Precision Recall
Precision + Recall

A difficulty when computing these measures is that it is necessary to know all the
relevant findings of the documents, specifically when calculating recall, and thus
F-measure. This implies having someone reading all documents and annotating
the relevant parts of texts, which is a time consuming task. Ideally, the annotation
should be performed by more than one person and followed by group consensus
about which annotations are the correct ones. It is possible to find some sets of
documents already annotated, named golden collections.

1.5.4 General Architecture forInformation Extraction


Although IE approaches differ significantly, the core process is usually organized as
a processing pipeline that include domain independent componentstokenization,
morphological analysis, part-of-speech tagging, syntactic parsingand domain
specific IE componentsnamed entity recognition and co-reference resolution,
relation identification, information fusion, among others. This general pipeline is
illustrated in Fig.1.2. Having as input documents, the sequence of domain independent and domain specific processing modules extract information (or knowledge)
that is made available for applications, humans or further processing.

1.6 Book Structure


In this first chapter, readers are introduced to the area of IE and its current challenges. The chapter starts by introducing the need to fill the gap between documents
and people and ends with the presentation of a generic architecture for developing

1.6 Book Structure

Documents

Domain independent

Domain Specific

Sentence split
Tokenization
Morphological analysis
POS tagging
Syntatic parsing

Named Entity Recognition


Co-reference resolution
Relation identification
Information fusion

--> see Chapter 2

Information
/
Knowledge

--> see Chapter 3 and 4

Fig. 1.2 The general processing pipeline of information extraction systems

systems that are able to learn how to extract relevant information from natural
language documents, and assigning semantic meaning to it. The chapter also

includes some background information on semantics, ontologies, knowledge representation and Natural Language Processing.
The two main groups of processing modules of the generic architecture are the
subject of the following two chapters. First, in Chap. 2, are presented the domain
independent modules that, in general, split the text into relevant units (sentences and
tokens) and enrich the document by adding morphological and syntactic information. The third chapter presents information on how to extract entities and relations
and create a semantic representation with the extracted information.
As OBIE is a very important trend, a complete chapter, the fourth, is dedicated
to present a proposal of software architecture for performing OBIE using an arbitrary ontology and describing a system developed based on the architecture.
As this book aims at including real applications, Chap. 5 illustrates how to implement working systems. The chapter presents two systems: the first is a tutorial
systemthat we challenge all readers to performdeveloped by almost direct use
of freely available tools and documents; the second one, more complex and for a
language other than English, illustrating a state-of-the-art system and how it can
deliver information to end users.
Book ends with some comments to what was selected as content for the book and
some considerations regarding the future.

10

1Introduction

References
Allen JF (2000) Natural language processing. In: Ralston A, Reilly ED, Hemmendinger D (eds)
Encyclopedia of computer science, 4th edn. Wiley, Chichester, pp12181222
Andersen PM etal (1992) Automatic extraction of facts from press releases to generate news
stories. In: Proceedings of the third conference on applied natural language processing.

pp170177
Antoniou G, van Harmelen F (2009) Web Ontology Language: OWL. In: Staab S, Studer R (eds)
Handbook on ontologies, 2nd edn. International handbooks on information systems. Springer,
Berlin, pp91110
Appelt DE et al (1993) FASTUS: a finite-state processor for information extraction from realworld text. In: IJCAI. pp11721178
Buckland M (2013) The quality of information in the web. BiD: textos universitaris de biblioteconomia i documentaci (31)
Califf ME, Mooney RJ (1999) Relational learning of pattern-match rules for information extraction. In: AAAI/IAAI. pp328334
Chang C-H, Hsu C-N, Lui S-C (2003) Automatic information extraction from semi-structured web
pages by pattern discovery. Decis Support Syst 35(1):129147
Chiticariu L, Li Y, Reiss FR (2013) Rule-based information extraction is dead! Long live rule-
based information extraction systems! In: EMNLP. pp827832
Ciravegna F (2001) Adaptive information extraction from text by rule induction and generalisation. In: International joint conference on artificial intelligence. pp12511256
Gaizauskas R, Wilks Y (1998) Information extraction: beyond document retrieval. J Doc 54(1):
70105
Gruber TR (1993) A translation approach to portable ontology specifications. Knowl Acquis
5(2):199220
Guarino N (1998) Formal ontology and information systems. In: FOIS 98proceedings of the
international conference on formal ontology in information systems. IOS Press, Amsterdam,
pp315
Guha R, McCool R, Miller E (2003) Semantic search. In: The twelfth international World Wide
Web conference (WWW), Budapest. p779
Hsu C-N, Dung M-T (1998) Generating finite-state transducers for semi-structured data extraction
from the web. Inf Syst 23(8):521538
Jurafsky D, Martin JH (2008) Speech and language processing: an introduction to natural language
processing, computational linguistics, and speech recognition, 2nd edn. Prentice Hall,
NewYork
Kasneci G et al (2008) The YAGO-NAGA approach to knowledge discovery. ACM SIGMOD 37:7
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for
segmenting and labeling sequence data. In: Proceedings of the international conference on
machine learning (ICML-2001)
Lee L (2004) Im sorry Dave, Im afraid I cant do that: linguistics, statistics, and natural
language processing circa 2001. In: Committee on the Fundamentals of Computer Science:
Challenges and Computer Science Opportunities and National Research Council Telecommu
nications Board (ed) Computer science: reflections on the field, reflections from the field. The
National Academies Press, Washington, pp111118
Lehnert W et al (1993) UMass/Hughes: description of the CIRCUS system used for Tipster text.
In: Proceedings of TIPSTER93, 1923 September 1993. pp241256
Makhoul J etal (1999) Performance measures for information extraction. In: Proceedings of
DARPA broadcast news workshop. pp249252
Mrquez L etal (2008) Semantic role labeling: an introduction to the special issue. Comput
Linguist 34(2):145159
McNaught J, Black W (2006) Information extraction. In: Ananiadou S, McNaught J (eds) Text
mining for biology and biomedicine. Artech House, Boston

References

11

Muslea I (1999) Extraction patterns for information extraction tasks: a survey. In: Proceedings of
the AAAI 99 workshop on machine learning for information extraction, Orlando, July 1999.
pp16
Neustein A et al (2014) Application of text mining to biomedical knowledge extraction: analyzing
clinical narratives and medical literature. In: Neustein A (ed) Text mining of web-based medical content. De Gruyter, Berlin, p 50
Piskorski J, Yangarber R (2013) Information extraction: past, present and future. In: Multi-source,
multilingual information extraction and summarization. Springer, Berlin, pp2349
Ratnaparkhi A (1999) Learning to parse natural language with maximum entropy models. Mach
Learn 34(13):151175
Santos D (1992) Natural language and knowledge representation. In: Proceedings of the ERCIM
workshop on theoretical and experimental aspects of knowledge representation. pp195197
Sarawagi S (2008) Information extraction. Found Trends Database 1(3):261377
Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach
Learn 34(13):233272
Sowa JF (2000) Knowledge representation: logical, philosophical, and computational foundations.
Brooks Cole, Pacific Grove
Suchanek F, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: Proceedings
of the 16th international conference on World Wide Web. ACM Press, NewYork, p697
Teixeira A, Ferreira L, Rodrigues M (2014) Online health information semantic search and exploration: reporting on two prototypes for performing extraction on both a hospital intranet and the
world wide web. In: Neustein A (ed) Text mining of web-based medical content. De Gruyter,
Berlin, p50
Viola P, Narasimhan M (2005) Learning to extract information from semi-structured text using a
discriminative context free grammar. In: Proceedings of the 28th annual international ACM
SIGIR conference on research and development in information retrieval. pp330337
Wimalasuriya DC, Dou D (2010) Ontology-based information extraction: an introduction and a
survey of current approaches. J Inf Sci 36(3):306323

Chapter 2

Data Gathering, Preparation and Enrichment

Abstract This chapter presents the domain independent part of the general
architecture of Information Extraction (IE) systems. This first part aims at preparing
documents by the application of several Natural Language processing tasks that
enrich the documents with morphological and syntactic information. This is made
in successive processing steps which start by making contents uniform, and end by
identifying the roles of the words and how they are arranged.
Here are described the most common steps: sentence boundary detection, tokenization, part-of-speech tagging, and syntactic parsing. The description includes
information on a selection of relevant tools available to implement each step.
The chapter ends with the presentation of three very representative software
suites that make easier the integration of the several steps described.
Keywords Information extraction Tokenization Sentence splitting
Morphological analysis Part-of-speech POS Syntactic parsing Tools

2.1

Process Overview

The IE process usually starts by identifying and associating morphosyntactic features to natural language contents that, otherwise, would be quite undistinguishable
character strings. The process is composed of successive NLP steps starting on
making contents uniform, and ending with the identification of the roles of the
words and how they are arranged. The first steps are usually tokenization and sentence boundary detection. Its purpose is to break contents into sentences and define
the limit of each token: word, punctuation mark, or other character clusters such as
currencies. Afterwards, all processing is usually conducted in a per-sentence fashion and tokens are considered atomic. Then, morphological analysis makes tokens
uniform by determining word lemmata, see win and won in Fig. 2.1, and partof-speech tagging assigns a part-of-speech to each token, visible after the slashes.
The final step is usually syntactic parsing which can be done using significantly
different formalisms. These NLP steps prepare the textual contents for subsequent
identification and extraction of relevant information.

The Authors 2015


M. Rodrigues, A. Teixeira, Advanced Applications of Natural Language
Processing for Performing Information Extraction, SpringerBriefs
in Electrical and Computer Engineering, DOI 10.1007/978-3-319-15563-0_2

13

14

2 Data Gathering, Preparation and Enrichment

Task

sentence
boundary
detection
+
tokenization

morphological
analysis
+
part-of-speech
tagging

Data
John Bardeen is the only laureate to win the Nobel Prize in Physics
twice in 1956 and 1972. Maria Curie also won two Nobel Prizes, for
physics in 1903 and chemistry in 1911.

<S>[John] [Bardeen] [is] [the] [only] [laureate] [to] [win] [the] [Nobel]
[Prize] [in] [Physics] [twice] [] [in] [1956] [and] [1972] [.]</S>
<S>[Maria] [Curie] [also] [won] [two] [Nobel] [Prizes] [,] [for]
[physics] [in] [1903] [and] [chemistry] [in] [1911] [.]</S>

<S>[John/NNP] [Bardeen/NNP] [be/VBZ] [the/DT] [only/JJ]


[laureate/NN] [to/TO] [win/VB] [the/DT] [Nobel/NNP] [Prize/NNP]
[in/IN] [Physics/NNP] [twice/RB] [/:] [in/IN] [1956/CD] [and/CC]
[1972/CD] [./.]</S>
<S>[Maria/NNP] [Curie/NNP] [also/RB] [win/VBD] [two/CD]
[Nobel/NNP] [Prizes/NNS] [,/,] [for/IN] [physics/NN] [in/IN]
[1903/CD] [and/CC] [chemistry/NN] [in/IN] [1911/CD] [./.]</S>

(dependency)
syntactic
parsing

Fig. 2.1 Representative example of the NLP steps for morphosyntactic data generation relative to
plain text natural language sentences

Figure 2.1 depicts the mentioned successive processing steps and their effect on
data. The processing steps are on the left-hand side and their effect on data is visible
on the right-hand side. The output of one step is the input of the next one, and the
effects are representative as they provide a real example of what can be done but are
not the unique possible formalism or solution. The syntactic parsing result in
Fig. 2.1 is relative to dependency parsing and is depicted as a graph for simplicity.

2.2

Tokenization and Sentence Boundary Detection

15

In the following sections, each of these major steps will be described and
representative tools briefly presented. A bias towards alphabet languages is assumed,
but, whenever possible, some information is provided on other languages, such as
Arabic and Chinese. Representative tools, in general used later in the book, are
given some additional attention. They are described with some detail and relevant
information, such as the way of obtaining the tool and languages supported out of
the box, is presented in tabular form at the end of each section.

2.2

Tokenization and Sentence Boundary Detection

Document processing usually starts by separating documents texts in its atomic


units. Breaking a stream of text into tokens (words, numbers, and symbols) is known
as tokenization (Mcnamee and Mayfield 2004). It is a quite straightforward process
for languages that use spaces between words, such as most languages using the
Latin alphabet. Tokenizers often rely on simple heuristics as (1) all contiguous
strings of alphabetic characters are part of one token, the same applies to numbers,
and (2) tokens are separated by whitespace charactersspace and line breakor by
punctuation characters that are not included in abbreviations. For languages that do
not use whitespaces between tokens, as Chinese, this process can be particularly
challenging (Chang and Manning 2014; Huang et al. 2007).
Sentence boundary detection, as its name suggests, addresses the problem of
finding sentence boundaries. The concept of sentence is central in several natural
language processing tasks, since sentences are standard textual units which confine
a variety of linguistic phenomena such as collocations and variable binding.
However, finding these boundaries is not a trivial task since end-of-sentence punctuation marks are ambiguous in many languages. The period is often used as sentence boundary marker and also in ordinal numbers, initials, abbreviations, and even
abbreviations at the end of sentences. Like the period, other punctuations such as
exclamation points and question marks can mark the end of sentences and also
occur within quotation or parenthesis in the middle of sentences (Kiss and Strunk
2006; Palmer and Hearst 1997; Reynar and Ratnaparkhi 1997).

2.2.1

Tools

Tools for tokenizing texts are found in software suites such as Freeling (Padr and
Stanilovsky 2012), NLTK (Bird et al. 2009), OpenNLP (Apache 2014), or
StanfordNLP (Manning et al. 2014). There are no specialized tools exclusively dedicated to this problem since tokenization can be reasonably well done using regular
expressions (regex) when processing languages using Latin alphabet. For languages
not using Latin alphabet there are fewer tools. The tokenizer Stanford Word

16

2 Data Gathering, Preparation and Enrichment

Table 2.1 Main features of Punkt


Name
Task
URL
Languages tested
Performance

Punkt
Sentence boundary detection
http://www.nltk.org/_modules/nltk/tokenize/punkt.html
Dutch, English, Estonian, French, German, Italian,
Norwegian, Portuguese, Spanish, Swedish, and Turkish
F1 above 0.95 for most of 11 tested languages

Segmenter1 has models able to handle Arabic and Chinese (Chang et al. 2008;
Monroe et al. 2014).
Regarding the sentence boundary detection problem, several systems addressing
it have been proposed with good results. Here we focus on two proposals that
achieved good results when tested with distinct natural languages: Punkt (Kiss and
Strunk 2006) and iSentenizer (Wong et al. 2014).

2.2.2

Representative Tools: Punkt and iSentenizer

Punkt is included in the Natural Language Toolkit (NLTK), a software suite in Python
that provides tools for handling natural languages (see Sect. 2.5.2). Punkt implementation follows the tokenizer interface defined by NLTK in order to be seamlessly
integrated programmatically in a NLP pipeline. It is provided with source code and,
alongside the execution method, the software also includes methods for training new
sentence boundary detection models from corpora (see tested languages in Table 2.1).
Punkt approach is based on unsupervised machine learning. The method assumes
that most of end of sentence ambiguities can be solved if abbreviations are identified as the remaining periods would mark end of sentences (Kiss and Strunk 2006).
It operates in two steps. The first step detects abbreviations by assuming that they
are collocations of a truncated word and a final period, they are short, and they often
contain internal periods. These assumptions are used to estimate the likelihood of a
given period being part of an abbreviation. The second step evaluates if the decisions of the first step should be corrected. The evaluation is based on the word
immediately at the right of the period. It is checked if the word is a frequent sentence starter, if it is capitalized, or if the two tokens surrounding the period do not
form a frequent collocation. Periods are considered sentence boundary markers if
they are not part of abbreviations.
iSentenizer is provided with a Visual C++ application programming interface
(API) and a standalone tool featuring a graphical user interface (GUI). Having these
two interfaces makes easier using the tool. The GUI can be used to easily and conveniently construct and verify a sentence boundary detection system for a specific
language, and the API allows later integration of the constructed model into larger
software systems using Visual C++.
1

http://nlp.stanford.edu/software/segmenter.shtml

2.3 Morphological Analysis and Part-of-Speech Tagging

17

Table 2.2 Main features of iSentenizer


Name
Task
URL
Languages tested
Performance

iSentenizer
Sentence boundary detection
http://nlp2ct.cis.umac.mo/views/utility.html
Danish, Dutch, English, Finnish, French, German, Greek, Italian,
Portuguese, Spanish, Swedish
Detects sentence boundaries of a mixture of different text genres
and languages with high accuracy
F1 above 0.95 for most of 11 tested languages using Europarl corpus

iSentenizer is based on an algorithm, named i+Learning, that constructs a decision tree in two steps (Wong et al. 2014). The first step constructs a decision tree in
a top-down approach based on the training corpus. The second step increments the
tree whenever a new instance or attribute is detected, revising the tree model by
incorporating new knowledge instead of retraining it from scratch. The features
used in tree construction are the words immediately preceding and following the
potential boundary punctuation marks: period, exclamation mark, colon, semicolon,
question mark, quotation marks, brackets, and dash. The inclusion of more punctuation marks than the usual sentence boundariesperiod, and exclamation and question marksis because those punctuation marks may also denote a sentence
boundary depending on the text genre. Features are encoded in a way independent
from corpus and alphabet language to maximize the adaptability of the system for
different languages and text genres (see tested languages in Table 2.2).

2.3

Morphological Analysis and Part-of-Speech Tagging

Having texts separated in tokens, the next step is usually morphosyntactic analysis,
in order to identify characteristics as word lemma and parts of speech (Marantz
1997). It is important to distinguish two concepts: lexeme and word form. The difference is well illustrated with two examples: (1) the words book and books
refer to the same concept and thus have the same lexeme and have different word
forms; (2) the words book and bookshelf have different word forms and different lexemes as they refer to two different concepts (Marantz 1997). The form chosen to conventionally represent the canonical form of lexemes is called lemma.
Finding word lemmata brings the advantage of having a single form for all words
that have similar meanings. For example, the words connect, connected, connecting, connection, and connections roughly refer to the same concept and
have the same lemma. Also, this process reduces the total number of terms to handle, which is advantageous from a computer processing point of view, as it reduces
the size and complexity of data in the system (Porter 1980). The complexity of the
task depends of the target natural language. For languages with simple inflectional
morphology, as English, the task is more straightforward than for languages with
more complex inflectional morphology as German (Appelt 1999).

18

2 Data Gathering, Preparation and Enrichment

The process of determining the word lemma is called lemmatization. Another


method called word stemming is common due to its simplicity. Word stemming
reduces words to their base form by removing suffixes. The remaining form is not
necessarily a valid root but it is usually sufficient that related words map to the same
stem or to a reduced set of stems if words are irregular. For example the words
mice and mouse have the lemma mouse but some stemmers produce mic
and mous, respectively (Hotho et al. 2005).
Other important features for characterizing word are its morphosyntactic category, or part of speech (POS), such as noun, adjective, verb, preposition, etc. alongside with other properties that depend on the POS. For example verbs have features
such as tense and person that are not applicable to nouns (Piskorski and Yangarber
2013). Finding the part of speech is known as POS tagging and the systems developed for this task usually include algorithms for word lemmatization or stemming
before determining the POS tag.
POS tagging has two main challenges. One challenge is dealing with part-ofspeech ambiguity as words often can have distinct parts of speech depending on its
context in sentences. The other challenge is the assignment of POS to words for
which the system has no knowledge about (Alusio et al. 2003). For solving both
problems, typically is taken into account the context around the target word, within
a sentence, and selected the most probable tag using information provided by the
word and its context (Gngr 2010). POS tag information is commonly taken into
consideration in syntactic parsing, a following processing stage at the sentence level.
POS information is relevant in syntactic parsing since morphosyntactic categories
group words that occur with the same syntactic distribution (Brants 1995). This
implies that replacing a token by another with that same category does not affect the
sentence grammaticality. Considering the next example, is possible to have 24
(2 4 3) sentences by picking one word from each of the three groups between
brackets. More sentences are possible if more words are added to the groups.
[the | a][fast | slow | red | pretty][car | bicycle | plane] passed by.
POS tagging is a step common to most natural language processing (NLP) tasks
and an extensively researched subject. As result, is often considered a solved task, with
baseline precision around 90 % and state of the art systems achieving values around
97 %. However these values are being disputed as the precision is being measured for
uniform text genres and in a word basis. If results are measured in terms of full sentences, i.e. considering the proportion of sentences without a single tag error, the precision values drop to around 5557 % (Giesbrecht and Evert 2009; Manning 2011).

2.3.1

Tools

Several approaches have been proposed over the years. It is common that available
implementations are developed for English and trained and evaluated using Penn
Treebank data. Nevertheless, most have the potential to be used for tagging other
languages. Here we privileged implementations that proven good results with

2.3

Morphological Analysis and Part-of-Speech Tagging

19

several natural languages, are provided with methods to train a tagger model for
other languages, given POS-annotated training text for that language, and that are
not part of larger software suites. The only exception will be Stanford POS tagger,
from the StanfordNLP suite, because it is provided with tagger models for six different languages, making it very relevant even if the rest of the suite is not used.

2.3.2

Representative Tools: Stanford POS Tagger,


SVMTool, and TreeTagger

Three tools were selected as they represent, respectively, POS tagging implementations using models based on maximum entropy, support vector machines (SVM),
and Markov models. All tools include the POS tagger and methods to create new
tagger models given training data.
The Stanford POS Tagger includes components for command-line invocation,
for running as a server, and for being integrated in software projects using a Java
API. The full download version contains tagger models for six different languages
(see languages list in Table 2.3). It is based on a bidirectional maximum entropy
model that decides the POS tag of a token taking into consideration the preceding
and following tags, and broad lexical features such as join conditioning of multiple
consecutive words. The tagger achieved a precision value above 0.97 with the Penn
Treebank Wall Street Journal (WSJ) corpus (Toutanova et al. 2003).
SVMTool supports standard input and output pipelining, making easier its integration in larger systems. Also it is provided with an C++ API to support embedded
usage. The algorithm is based on support vector machines classifiers and uses a rich
set of features, including: word and POS bigrams and trigrams, surface patterns as
prefixes, suffixes, letter capitalization, word length, and sentence punctuation. The
tagging decisions can be done using a reduced context or at the sentence level. The
tagger achieved accuracy above 0.97 with the English corpus Wall Street Journal,
and above 0.98 with the Spanish corpus LEXEP (Gimnez and Mrquez 2004).
Table 2.4 presents the highlights of SVMTool.
TreeTagger can be run from the command line or using a GUI, and is provided
as a binary package for Intel-Macs, Linux, or Windows operating systems. The
project website includes ready to use models for 16 languages (see language list in
Table 2.5). TreeTagger algorithm is based on n-gram Markov models having transition
probabilities estimated using a binary decision tree. Comparing to other algorithms

Table 2.3 Main features of Stanford tagger


Name
Task
URL
Languages tested
Performance

Stanford POS tagger


Part of speech tagging
http://nlp.stanford.edu/software/tagger.shtml
Arabic, Chinese, English, French, German, and Spanish
Accuracy of 0.9724 for English

20

2 Data Gathering, Preparation and Enrichment

Table 2.4 Main features of


SVMTool

Name
Task
URL
Languages tested
Performance

SVMTool
Part of speech tagging
http://www.lsi.upc.edu/~nlp/SVMTool/
Catalan, English, and Spanish
Accuracy of 0.9739 for English and
0.9808 for Spanish

Table 2.5 Main features of TreeTagger


Name
Task
URL
Languages
tested
Performance

TreeTagger
Part of speech tagging
http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
Bulgarian, Dutch, English, Estonian, Finnish, French, Galician, German,
Italian, Portuguese, Mongolian, Polish, Russian, Slovak, Spanish, and Swahili
Accuracy above 0.95 for most languages

using Markov models, this technique needs less data to obtain reliable transition
probabilities as binary decision trees have relatively few parameters to estimate.
Such feature mitigates the sparse data problem (Schmid 1994).

2.4

Syntactic Parsing

Syntactic parsing is usually a computational intensive task that is not as often used
in IE systems as tokenization, sentence boundary detection, or POS tagging. When
information sources are (semi-)structured, or are machine generated, or the output
is coarse grained, other methods less computational intensive such as locating textual patterns can provide similar results (Feldman and Sanger 2007; Huffman 1996).
The goal of syntactic parsing is to analyze sentences in order to produce structures representing how words are arranged in sentences (Langacker 1997). Structures
are produced with respect to a given formal grammar, and over the years were
proposed different formalisms reflecting both linguistic and computational concerns.
In a broad sense, grammars can have two structural formalisms: constituency and
dependency (Jurafsky and Martin 2008; Nugues 2006).
Constituent is a unit within a hierarchical structure that is composed by a word
or a group of words. Although, in a strict formal sense constituent structures can
be observed in dependency grammars, constituency is usually associated to phrase
structure grammars as these are only based on the constituency relation. Phrase structure grammars are composed by sets of syntactic rules that fractionate a phrase into
sub-phrases and hence describe a sentence composition in terms of phrase structure (Chomsky 2002). Figure 2.2 presents a possible parse of the sentence This
book has two authors. using a phrase structure grammar.
Dependency grammars describe sentence structures in terms of links between
words. Each link reflects a relation of dominance/dependence between a headword

21

2.4 Syntactic Parsing


S

Fig. 2.2 Possible


constituency grammar tree
for the sentence this book
has two authors

NounPhrase

Fig. 2.3 Possible


dependency grammar graph
for the sentence this book
has two authors

DT

NN

This

book

VerbPhrase
AUX

NounPhrase

has

CD

NNS

two

authors

root
punct
det

nsubj

dobj
num

DT

NN

VBZ

CD

NNS

This

book

has

two authors

and a dependent word. The original work of Tesnire (1959) received formal
mathematical definitions thus becoming suitable for automatic processing. As result,
sentence dependencies form graphs that have a single head and usually have three
properties: acyclicity, connectivity, and projectivity (Nivre 2005). Dependency
grammars often prove more efficient to parse texts. Figure 2.3 presents a possible
parse of the same example sentence using a dependency grammar.
Nugues (2006) provides a comprehensive discussion about syntax theories and
parsing techniques that have been proposed over the years. Here the focus will be
on tools that proven able to be adaptable to different languages without need to
rewrite grammars, which is a difficult task and requires some expertise in language models. The first two parsers presented use phrase structure grammarEpic
parser and StanfordParserand the other two use dependency grammars
MaltParser and TurboParser.

2.4.1

Representative Tools: Epic, StanfordParser,


MaltParser, TurboParser

Epic is a probabilistic context-free grammar (PCFG) parser that can be used from
the command line or programmatically using a Scala API. Its algorithm uses surface
patterns to reduce the propagation of information through the grammar structure,
thus avoid having too many features in the grammar structure. Having a simpler

22

2 Data Gathering, Preparation and Enrichment

Table 2.6 Main features of Epic parser


Name
Task
URL
Languages
tested

Epic
Syntactic parsing (phrase structure grammar)
http://www.scalanlp.org/
Ready to use models for Basque, English, French, German, Hungarian,
Korean, Polish, and Swedish
Other languages tested with accuracy over 0.78: Arabic, Basque, and Hebrew

Table 2.7 Main features of StanfordParser


Name
Task
URL
Languages tested

StanfordParser
Syntactic parsing (phrase structure grammar)
http://nlp.stanford.edu/software/lex-parser.shtml
Ready to use models for: Arabic, Chinese, English, French, and German
Other languages tested with accuracy over 0.75: Bulgarian, Italian, and
Portuguese

structural backbone improves the adaptation to new languages (Hall et al. 2014).
Epic parser provides ready to use parser models for eight languages and was tested
with more three languages achieving accuracy results over 0.78 (see Table 2.6).
StanfordParser is also a PCFG parser provided with a command line interface as
well as with a Java API for programmatic usage. It uses an unlexicalized grammar
at its core. Unlexicalized PCFG is a grammar that relies on word categories such as
POS categories that can be more or less broad and does not systematically specifies
rules to the lexical level. However some categories can represent a single word.
This brings the advantage of producing compact and robust grammar representations as there is no need for large structures to store the lexicalized probabilities
(Klein and Manning 2003). StanfordParser is provided with models for five languages and was also used with Bulgarian, Italian, and Portuguese (see Table 2.7).
MaltParser is provided as a JAR package for command line usage, and with the
Java source code for integration into larger software projects. Maltparser is a datadriven dependency parsing system able to induce parsing models from Treebank
data. The parsing model builds dependency graphs in one left to right pass over the
input using a stack to store partially processed tokens, and a history-based feature
model to predict the next parser action (Hall et al. 2010; Nivre et al. 2007). There
are ready to use parsing models for 4 languages and was tested with other 14 languages and results showed an accuracy around 0.75 or more (see Table 2.8).
TurboParser is provided with C++ source code ready to be compiled in systems
complying with the Portable Operating Systems Interface (POSIX) and also in
Windows. The approach followed formulates the problem of non-projective dependency parsing as an optimization problem of integer linear programming of polynomial size. The model supports expert knowledge in form of constraints, and training
data is used to automatically learn soft constraints. Having a model requiring a
polynomial number of constraints as a function of the sentence length, instead of

2.5

Representative Software Suites

23

Table 2.8 Main features of MaltParser


Name
Task
URL
Languages
tested

MaltParser
Syntactic parsing (dependency grammar)
http://www.maltparser.org/
Ready to use models for English, French, Spanish, and Swedish
Other languages tested with accuracy around 0.75 or above: Arabic, Basque,
Catalan, Chinese, Czech, Danish, Dutch, German, Greek, Hungarian, Italian,
Japanese, Portuguese, and Turkish

Table 2.9 Main features of TurboParser


Name
Task
URL
Languages
tested

TurboParser
Syntactic parsing (dependency grammar)
http://www.ark.cs.cmu.edu/TurboParser/
Ready to use models for: Arabic, English, Farsi, Kinyarwanda, and Malagasy
Other languages tested with accuracy above 0.75: Danish, Dutch, Portuguese,
Slovene, Swedish, and Turkish

exponential constraints of previous linear programming approaches, eliminates the


need for incremental procedures and impacts the accuracy and processing speed
(Martins et al. 2009). The parser is provided with models for five languages and was
tested with more six languages (see Table 2.9).

2.5

Representative Software Suites

NLP software suites make easier the integration of all tasks in a processing pipeline.
They integrate several tools using a coherent data representation designed to allow
directly using the output of a step as input of the following one. The list of suites
available includes Apache OpenNLP, Freeling, GATE, LingPipe, Natural Language
Tool Kit (NLTK), and StanfordNLP, among others. Here will be described Stanford
NLP, as it is used in a tutorial example in Chap. 5; NLTK, as it is very well documented and uses a distinct programming language of StanfordNLP; and GATE for
historical reasons at it was (one of) the first matured suites available.

2.5.1

Stanford NLP

Stanford NLP (Manning et al. 2014) is a machine learning based toolkit for the
processing of natural language text. It includes software for realizing several NLP
tasks such as tokenization, sentence segmentation, part-of-speech tagging, named
entity recognition, parsing, coreference resolution, and relation extraction, that can
be incorporated into applications with human language technology needs.

24

2 Data Gathering, Preparation and Enrichment

The suite is developed using Java programming language although it is possible


to find binding or translations for other programming languages such as .NET languages, Perl, Python, and Ruby. All tools include methods for training new models
from corpora.

2.5.2

Natural Language Toolkit (NLTK)

NLTK (Bird et al. 2009) supports a wide range of text processing libraries, including
text classification, tokenization, stemming, tagging, chunking, parsing, and semantic
reasoning. It also provides intuitive interfaces to more than 50 corpora and lexical
resources, including WordNet. It is well documented with tutorials, animated algorithms, problem sets, and is thoroughly discussed in a comprehensive book by Bird
et al. (2009). The suite is developed using Python programming language and an
active community also create Python wrappers for state of the art tools, respecting the
NLTK interfaces. For instance, there is a Python wrapper to use MaltParser in NLTK.

2.5.3

GATE

GATE (Cunningham et al. 2011) is a development environment for the creation of


software components designed to process natural languages. More than providing
the end algorithm, it provides specialized data structures and a set of intuitive tools
to assist the development of the algorithm. The tools include document annotation
mechanisms, collocation viewer, finite state machines, support vector machines,
and text extractors from documents in PDF, RTF, and XML. GATE is over 15 years
old and is in active use.

References
Alusio S, Pelizzoni J, Marchi AR, de Oliveira L, Manenti R, Marquiafvel V (2003) An account
of the challenge of tagging a reference corpus for brazilian portuguese. In: Computational
processing of the Portuguese language. Springer, Berlin, pp 110117
Apache Open NLP Development Community (2014) Apache OpenNLP developer documentation.
www.openlp.apache.org
Appelt DE (1999) Introduction to information extraction. Artif Intell Commun 12:161172
Bird S, Klein E, Loper E (2009) Natural language processing with Python. OReilly, Sebastopol
Brants T (1995) Tagset reduction without information loss. In: Proceedings of the 33rd annual
meeting on Association for Computational Linguistics. pp 287289
Chang AX, Manning CD (2014) TOKENS REGEX: defining cascaded regular expressions over
tokens. Technical report CSTR 201402. Department of Computer Science, Stanford University,
Stanford

References

25

Chang P, Galley M, Manning CD (2008) Optimizing Chinese word segmentation for machine
translation performance. In: Proceedings of the third workshop on statistical machine translation. pp 224232
Chomsky N (2002) Syntactic structures. Walter de Gruyter, New York
Cunningham H, Maynard D, Bontcheva K (2011) Text processing with GATE, Cunningham:2011:
TPG:2018860. Gateway Press, Murphys, CA
Feldman R, Sanger J (2007) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, Cambridge
Giesbrecht E, Evert S (2009) Is part-of-speech tagging a solved task? An evaluation of pos taggers
for the German Web as Corpus. In: Proceedings of the fifth Web as Corpus workshop. pp 2735
Gimnez J, Mrquez L (2004) SVMTool: a general POS tagger generator based on support vector
machines. In: Proceedings of the 4th international conference on Language Resources and
Evaluation (LREC04). Lisbon
Gngr T (2010) Part-of-speech tagging. In: Indurkhya N, Damerau FJ (eds) Handbook of natural
language processing, 2nd edn. CRC/Taylor and Francis Group, Boca Raton
Hall J, Nilsson J, Nivre J (2010) Single malt or blended? A study in multilingual parser optimization. In: Trends in parsing technology. Springer, Berlin, pp 1933
Hall D, Durrett G, Klein D (2014) Less grammar, more features. In: Proceedings of ACL. Baltimore,
pp 228237
Hotho A, Nrnberger A, Paa G (2005) A brief survey of text mining. LDV Forum 20:1962
Huang C-R, imon P, Hsieh S-K, Prvot L (2007) Rethinking Chinese word segmentation: tokenization, character classification, or wordbreak identification. In: Proceedings of the 45th annual
meeting of the ACL on interactive poster and demonstration sessions. pp 6972
Huffman SB (1996) Learning information extraction patterns from examples. In: Wertmer S,
Riloff E, Scheler G (eds) Connectionist, statistical and symbolic approaches to learning for
natural language processing. Springer, Berlin, pp 246260
Jurafsky D, Martin JH (2008) Speech and language processing: an introduction to natural language
processing, computational linguistics, and speech recognition, 2nd edn. Prentice Hall, New York
Kiss T, Strunk J (2006) Unsupervised multilingual sentence boundary detection. Comput Linguist
32:485525
Klein D, Manning CD (2003) Accurate unlexicalized parsing. In: Proceedings of the 41st annual
meeting on Association for Computational Linguistics, vol 1. pp 423430
Langacker RW (1997) Constituency, dependency, and conceptual grouping. Cogn Linguist 8:132
Manning CD (2011) Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In:
Gelbukh A (ed) Computational linguistics and intelligent text processing12th international
conference CICLing. Lecture notes in computer science. Springer, Berlin, pp 171189
Manning CD, Surdeanu M, Bauer J, Finkel J, Bethard SJ, McClosky D (2014) The Stanford
CoreNLP natural language processing toolkit. In: Proceedings of 52nd annual meeting of the
Association for Computational Linguistics: system demonstrations. pp 5560
Marantz A (1997) No escape from syntax: dont try morphological analysis in the privacy of your
own lexicon. University of Pennsylvania working papers in linguistics 4, p 14
Martins AFT, Smith NA, Xing EP (2009) Concise integer linear programming formulations for
dependency parsing. In: Proceedings of the joint conference of the 47th annual meeting of the
ACL and the 4th international joint conference on natural language processing of the AFNLP,
vol 1vol 1. pp 342350
Mcnamee P, Mayfield J (2004) Character n-gram tokenization for European language text retrieval.
Inf Retr 7:7397
Monroe W, Green S, Manning CD (2014) Word segmentation of informal Arabic with domain
adaptation. In: Proceedings of the 52nd annual meeting of the Association for Computational
Linguistics, vol 2 (short papers). ACL, Baltimore, pp 206211
Nivre J (2005) Dependency grammar and dependency parsing. MSI report 5133. pp 132
Nivre J, Hall J, Nilsson J, Chanev A, Eryigit G, Kbler S, Marinov S, Marsi E (2007) MaltParser:
a language-independent system for data-driven dependency parsing. Nat Lang Eng 13:95135

26

2 Data Gathering, Preparation and Enrichment

Nugues PM (2006) Syntactic formalisms. In: Nugues PM (ed) An introduction to language processing with Perl and Prolog. Springer, Berlin, pp 243275
Padr L, Stanilovsky E (2012) FreeLing 3.0: towards wider multilinguality, In: Proceedings of the
Language Resources and Evaluation Conference (LREC 2012). Istanbul, pp 24732479
Palmer DD, Hearst MA (1997) Adaptive multilingual sentence boundary disambiguation. Comput
Linguist 23:241267
Piskorski J, Yangarber R (2013) Information extraction: past, present and future. In: Poibeau T,
Saggion H, Piskorski J, Yangarber R (eds) Multi-source, multilingual information extraction
and summarization. Springer, Berlin, pp 2349
Porter MF (1980) An algorithm for suffix stripping. Program Electron Libr Inf Syst 14:130137
Reynar JC, Ratnaparkhi A (1997) A maximum entropy approach to identifying sentence boundaries.
In: Proceedings of the fifth conference on applied natural language processing, ANLC97.
ACL, Stroudsburg, pp 1619
Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the
international conference on new methods in language processing. Manchester
Tesnire L (1959) Elments de syntaxe structurale. Librairie C. Klincksieck, Paris
Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a
cyclic dependency network. In: Proceedings of the 2003 conference of the North American
chapter of the Association for Computational Linguistics on human language technology, vol 1.
pp 173180
Wong DF, Chao LS, Zeng X (2014) iSentenizer-: multilingual sentence boundary detection
model. ScientificWorldJournal 2014. doi:10.1155/2014/196574

Chapter 3

Identifying Things, Relations,


and Semantizing Data

Abstract This chapter concludes the presentation of the generic pipelined architecture
of Information Extraction (IE) systems, by presenting its domain dependent part.
After preparation and enrichment, the documents contents are now characterized and suitable to be processed to locate and extract information. This chapter
explains how this can be performed, addressing both extraction of entities and relations between entities.
Identifying entities mentioned in texts is a pervasive task in IE. It is called Named
Entity Recognition (NER) and seeks to locate and classify textual mentions that
refer to specific types of entities, such as, for example, persons, organizations,
addresses and dates.
The chapter also dedicates attention to how to store the extracted information and
how to take advantage of semantics to improve the information extraction process,
presenting the basis of Ontology-Based Information Extraction (OBIE) systems.
Keywords Information extraction Entities Relations Named entity recognition
NER Parse tree Dependencies Ontology-based information extraction OBIE

3.1

Identifying the Who, the Where, and the When

After preparation and enrichment, the documents contents are now characterized
and suitable to be processed by algorithms that will locate and extract information
(Ratinov and Roth 2009). The type of information to be extracted depends on the
purpose of the application and can range from the detection of a defined set of relevant entities to an attempt to extract arbitrary information at the Web scale, or
something in between.
The goal is to identify entities in texts and the relations they participate in, which
informally translates to discover who did what to whom, when and why (Mrquez
et al. 2008). Entities to locate include people, organizations, locations, and dates,
while relations can be physical (near, part), personal or social (son, friend, business),
and membership (staff, member-of-group) (Bontcheva et al. 2009).

The Authors 2015


M. Rodrigues, A. Teixeira, Advanced Applications of Natural Language
Processing for Performing Information Extraction, SpringerBriefs
in Electrical and Computer Engineering, DOI 10.1007/978-3-319-15563-0_3

27

28

3 Identifying Things, Relations, and Semantizing Data

Identifying entities mentioned in texts is a pervasive task in IE, known as named


entity recognition (NER). Named entity recognition seeks to locate and classify
textual mentions that refer to specific types of individuals such as persons and
organizations, and can also be references to addresses and dates (Nadeau and
Sekine 2007; Tjong Kim Sang and De Meulder 2003). Named entities are often
composed by a sequence of nouns referent to a single entity, e.g. Ban Ki-moon
or The Secretary General of the United Nations. Named entity recognition is
usually an early step that prepares further processing, and is also a relevant task by
itself as there are many applications that just need to detect the entities referred in
the documents.
To illustrate the utility of recognizing named entities, consider a website gathering contributions from several authors (think of Wikipedia or a news website) that
wants to link each authors name to a page with a short biography, or with information about professional interests. If done manually, this task is error prone and time
consuming. Having a method to automatically detect authors can be quite straightforward and advantageous.
Another possibility would be having each person referred in the articles, not just
the author, tracked across the website pages, allowing a way to navigate through
related topics, pointing readers to historical data about that person, such as a politician and how he has performed recently in the polls, or an athlete and his latest
scores and achievements, or to that of the latest gossip concerning a public figure.
Other benefits would be using such data, and also data about locations or products
mentioned in articles, to improve website visibility by introducing those entities
automatically as page metadata, or having advertisements associated with specific
types of entities.
Named entity recognition is also a relevant preprocessing step for language
analyses other than IE. For instance in machine translation it is known that names
translate differently than regular text and thus is important to detect them to allow
applying distinct procedures (Babych and Hartley 2003; Koehn et al. 2007). The
same applies to question answering systems as questions are usually about specific
domains, and names help to discover the domain as it is possible to detect if names
represent a person, a government organization, a sports organization, a location, etc.
(Grishman 1997).
Named entity recognition is often considered a two-step procedure: first are
detected the boundaries of entities, and then is assigned a predefined category such
as person, organization, location, or date. Boundary detection methods, whether
using hand-crafted rules or some probabilistic approach, usually rely on features
such as part of speech tags, word capitalization, and lexical features such as the
values of the preceding, current, and following words (Nadeau and Sekine 2007).
For instance, if a word as the value Mr. the following word(s) likely denotes a
persons name. Adding to these methods it is also common to have gazetteers of
common entities, including people names and well-known companies. In the case
of entities with well-defined shapes, like dates, email addresses, and phone numbers, a widespread technique is to match their patterns using regular expressions.

3.1 Identifying the Who, the Where, and the When

29

An example for locating people, using part of speech tags and word capitalization, is setting boundaries in sequences of proper nouns. Considering the example
depicted in Fig. 2.1, this simple method would allow isolating candidate entities for
people names in the sentence:

<NE>John Bardeen</NE>is the only laureate to win the<NE>Nobel


Prize</NE>in physics twicein 1956 and 1972.
After having the candidates, the next step is to assign a category to each candidate. Considering the example, the goal of this classification is to discriminate the
type of John Bardeen as person and Nobel Prize as an award. Classification
methods of NE include textual patterns for detecting elements as addresses and
dates, the use of gazetteers, and algorithms exploring information sources such as
Wikipedia or Google (Whitelaw et al. 2008).
Although gazetteers can be used to detect boundaries and classify entities, modern approaches avoid relying too much on them as compiling such lists is a time
consuming process, often need to be redone when changing language and/or application domain, and lists rapidly are proven incomplete. Some recent approaches
replace gazetteers by information sources as Wikipedia (Bizer et al. 2009; Suchanek
et al. 2007; Wu et al. 2008). Wikipedia brings the advantage of being updated daily,
having the possibility of querying it online or downloading and using freely available snapshots offline.
Considering the example, a possible classification algorithm using Wikipedia can
be based on querying the page of each named entity candidate and, if found, evaluate
if the Wikipedia categories include one of the pre-defined categories of the application. If the application includes categories for people and awards would be possible
to classify <NE>John Bardeen</NE> as people given that people is included in
Wikipedia categories, and classify <NE>Nobel Prize</NE> as award for the same
reason. Table 3.1 presents the Wikipedia categories found for our example.
Nadeau and Sekine (2007) and Mohit (2014) provide comprehensive surveys of
the methods proposed for NER.
Table 3.1 Wikipedia categories found for each candidate entity in the example presented in Fig. 2.1
NE
candidate
John
Bardeen

Nobel
Prize

Wikipedia categories
People from Madison, Wisconsin | American people of Russian descent | 1908
births | 1991 deaths | American agnostics | American electrical engineers | American
Nobel laureates | American physicists | Foreign Members of the Royal
Society | Nobel laureates in Physics | Nobel laureates with multiple Nobel
awards | Oliver E. Buckley Condensed Matter Prize winners | Princeton University
alumni | Quantum physicists | University of WisconsinMadison alumni
Academic awards | Awards established in 1895 | International awards | Science and
engineering awards | Organizations based in Sweden | Nobel Prize

First are presented the categories relevant for the example

30

3 Identifying Things, Relations, and Semantizing Data

The recognition of generic named entities such as people, locations, and dates,
can be done using suites such as OpenNLP, NLTK, or StanfordNLP, presented in
Chap. 2. For named entities relative to more specialized domains it can be difficult
to find a ready to use software package. One exception is the biomedical domain for
which is possible to find named entities recognizers. Becas1 (Nunes et al. 2013) and
KLEIO2 (Nobata et al. 2008) are two relevant examples of such tools.

3.2

Relating Who, What, When, and Where

Named entity recognition identifies the entities referred in the documents but, by
itself, does not inform in what kind of events those entities were involved, the reason why they were mentioned in the first place. For that, it is necessary to know in
what actions they are involved in, the same is to say to know the relations they
established with other entities (Banko and Etzioni 2008; Schutz and Buitelaar
2005). This is an important task for applications wishing to have a formal structure
about parts of the content of the document. Considering, again, the example of Fig.
2.1, detecting and classifying the entities John Bardeen and Nobel Prize is not
enough to know if both entities are related and, if they are, how they are related.
Already knowing that John Bardeen is a person and that Nobel Prize is an
award, possible relations would be having John Bardeen as the winner, or the
sponsor, or a jury member, or someone that attended to the ceremony of the award
"Nobel Prize.
A relation is a predication about a pair of entities. Examples of common relations
include relations of the types: (1) physical: located, near, part, etc.; (2) personal or
social: business, family, friend, etc.; (3) employment or membership: member of,
employee, staff, etc.; (4) agent to artifact: user, owner, inventor, etc.; and (5) affiliation: citizen, resident, ideology, ethnicity, etc. In the example of John Bardeen and
the Nobel Prize, a relation between the entities is John Bardeen winner_of Nobel
Prize. Differently from named entity recognition, relation extraction is not a process of annotating a sequence of tokens of the original document. Relationships
express associations between two entities represented by distinct text segments
(Sarawagi 2008). Relations involving two or more objects and subjects are known
as events.
Approaches to relation extraction tend to steer away from using corpus annotated data due to the cost of creating such resources, and because there are other
sources available that, not having the quality of an annotated corpus, can provide
high quality results when algorithms take advantage of large volumes of data.
Wikipedia is a popular learning source for relation extraction because its pages
often have some structured informationthe infoboxesresuming the content of

1
2

http://bioinformatics.ua.pt/becas/#!/
http://www.nactem.ac.uk/Kleio/

3.2

Relating Who, What, When, and Where

31

unstructured informationthe page content. This makes possible relate both in


order to infer how the relations of the infoboxes can be expressed in natural language (Carlson et al. 2010; Suchanek et al. 2007).
Generically, relation extraction methods assume that the entities involved in a
relation, the arguments of the relation, are relatively close to each other and are both
explicitly in the sentence where the relation was detected. As other information
extraction tasks, relation extraction can be made using surface patterns or with
methods that use POS tags and/or syntactic structures data (Bach and Badaskar
2007). Surface methods assume that the tokens around and between the entities of
the relation contain clues for the relation extraction. Based on this idea are generated
or trained patterns that reflect the relation. The patterns can be more or less sophisticated and can include wildcards (Giuliano et al. 2006). Also, depending if the
original text is stemmed and if stop words are removed, a pattern can cover more or
less sentences. For instance, to detect a product-company relation the following
patterns can be used:

<PRODUCT> is made by <COMPANY>


<PRODUCT> was created by <COMPANY>
<PRODUCT>, a <COMPANY> creation
The <COMPANY> manufactured product <PRODUCT>
<COMPANY>, the maker of <PRODUCT>

It is possible to complement the surface information with POS tags. The advantage of taking POS into account is due to verbs playing a central role in the type of
the relation. This allows improving the surface based methods by better identifying
the words to be targeted by a relation extraction pattern. In the example of Fig. 2.1,
<PERSON> is the only laureate to win the <AWARD>, if the relation extraction
pattern takes into account POS tags, the patterns would value the word win more
than all other in the decision process.
Some approaches use deep syntactic information for detecting relations. Such
approaches use a parse tree, whether a constituency or dependency tree, as the basis
for the relation pattern to be matched (Miller et al. 2000; Rodrigues et al. 2011;
Suchanek et al. 2006). This type of approach usually allows extracting relations
from more complex and longer sentences. Its main disadvantage is the time necessary to compute the tree. To illustrate how this approach works, let us consider the
sentences John Bardeen won the Nobel Prize twice and John Bardeen, an
American physicist and electrical engineer won the Nobel Prize twice. Although,
at the surface sentences are distinct, the dependency structures that strictly relate the
person with the award are the same (see Fig. 3.1).
Relation extraction can be readily done with StanfordNLP. This software suite
recognizes the following relations out of the box: Live_In, Located_In, OrgBased_In,
and Work_For (Angeli et al. 2014). As for relations of more specialized domains,
again, it can be difficult to find a ready to use software package, and again one

32

3 Identifying Things, Relations, and Semantizing Data

Fig. 3.1 Two sentences with the same dependencies relating John Bardeen and the Nobel Prizes won

exception is the biomedical domain where PIE the Search3 (Kim et al. 2012), MEDIE4
(Miyao et al. 2006), and MedInx (Ferreira et al. 2012; Teixeira et al. 2014) are relevant examples of such tools.

3.3

Getting Everything Together

Having extracted the entities of the text and their respective relations is then necessary to store this information for later use in the context of the application (Cowie
and Lehnert 1996). For applications targeting fixed types of entities and relations it
is suitable to store contents in a relational database. However, for applications targeting dynamic sets of relations, a more flexible framework is desirable: a knowledge base conforming to an ontology (Wimalasuriya and Dou 2010). Moreover,
several approaches showed that ontology classes, properties, and restrictions, could
be used to significantly improve the performance of the information extraction process. The success of this type of approach motivated the creation of the term
Ontology-Based Information Extraction (OBIE). For this, and because relational
databases are well known technology, here will only be discussed the storage of
information using a knowledge base.

3.3.1

Ontology

An ontology is defined as a formal specification of a shared conceptualization


(Gruber 1993). Its purpose is to specify knowledge about a field of action by describing a hierarchy of concepts related by subsumption relationships, and can include
axioms to express other relationships between concepts and to constrain their

3
4

http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/PIE/
http://www.nactem.ac.uk/tsujii/medie/

3.3 Getting Everything Together

33

intended interpretation (Guarino 1998). Ontologies can have different types of


generality: (1) top-level ontologies are domain independent as they describe general
concepts such as space, time, action, etc.; (2) domain ontologies describe a domain
as government or medicine; task ontologies have the same level of generality of
domain ontologies but describe tasks, like selling, instead of a domain; (3) application ontologies describe concepts specific to applications that often are related to the
internal state of application items, such as pending approval. Regarding the level
of generality, most ontologies used in IE applications are domain ontologies.
Ontologies are usually specified using Resource Description Framework Schema
(RDFS) and Web Ontology Language (OWL). These languages have well defined
syntax, allowing ontologies to become machine processable. Well defined semantics prevent the meaning of knowledge they define to become open to subjective and
distinct interpretations (Antoniou and van Harmelen 2009).
There are free and high quality tools to assist the creation, manipulation and
maintenance of ontologies. Protg OWL5 is one of the best examples (Knublauch
et al. 2004; Noy et al. 2000). Other lines of research related to ontologies developed
tools that, using and respecting the axioms that defined the knowledge of the ontology, are able to infer logical consequences from the existing facts and thus discover
relationships that would otherwise remained concealed. Such tools are known
as semantic reasoners. Many reasoners use first-order predicate logic to perform
reasoning but there are also examples of probabilistic reasoners (Klinov 2008; Sirin
and Parsia 2004).

3.3.2

Ontology-Based Information Extraction (OBIE)

A system is said to implement OBIE when its process of IE takes advantage of ontologies to improve the performance of the extraction. Typical OBIE systems use the
ontological properties to guide the IE process, whether by restricting the possible
arguments of a relatione.g. the ontology can define that a person can receive a
Nobel Prize but a location cannot receive itor by inferring and confirming concealed information from the extracted facts. Still using the previous example, even if
the entity John Bardeen was not identified as a person, the ontology could force the
entity to be a person as it received a Nobel Prize. Another characteristic of OBIE systems is that the information extracted is represented using ontologies (Wimalasuriya
and Dou 2010).
The approaches of OBIE can be distinct regarding the IE process itself and the
role played by the ontology in the extraction process. Regarding the IE process, the
different approaches were discussed earlier and can be based on surface patterns,
shallow approaches, or on morphosyntactic information, ranging from POS tags to
syntactic trees (deeper approaches). The role played by the ontology includes
defining the extracted information semantics and controlling the IE process. When
5

http://protege.stanford.edu/

34

3 Identifying Things, Relations, and Semantizing Data

the ontology is pre-built, the task is usually to fill the knowledge base with instance
and property values just as defined by the ontology (Cimiano et al. 2004; Saggion
et al. 2007; Wu et al. 2008). Neverthless, the ontology plays an active role as it
helps restricting relation arguments or discovering concealed information by means
of its properties and using semantic reasoning. Relevant examples of this type of
approach are KIM (Popov et al. 2004), SOBA (Buitelaar et al. 2006), and OntoX
(Yildiz and Miksch 2007).
Adding to the role of controlling the IE process, the ontology can itself encode,
in its structure, the relations found in the documents. In this type of approach, the
ontology is created in runtime and can be updated or not in posterior IE sessions.
Such approach implies that the reasoning process is not defined a priori but instead
is conditioned by what was found in the text sources. This approach is called Open
IE as it allows detecting instance candidates of arbitrary unknown relations (Banko
et al. 2007). A major challenge of Open IE is the higher levels of error when compared with other IE approaches. Despite the common usage of shallow linguistic
analysis, heuristics based on lexical features, and frequency analysis, is not easy to
filter out noisy or irrelevant information due to difficulties in estimating the confidence of the learned rules (Moro et al. 2013). Most of the confidence estimation
approaches rely on redundant data and some also use negative examples to filter out
wrong assumptions. Relevant examples of this type of approach are TextRunner
(Yates et al. 2007), Kylin (Wu et al. 2008), and NELL (Carlson et al. 2010).

References
Angeli G, Tibshirani J, Wu JY, Manning CD (2014) Combining distant and partial supervision for
relation extraction. In: Proceedings of the 2014 conference on empirical methods in natural
language processing (EMNLP)
Antoniou G, van Harmelen F (2009) Web Ontology Language: OWL. In: Staab S, Studer R (eds)
Handbook on ontologies, 2nd edn. International handbooks on information systems. Springer,
Berlin, pp 91110
Babych B, Hartley A (2003) Improving machine translation quality with automatic named entity
recognition. In: Proceedings of the 7th international EAMT workshop on MT and other language technology tools, improving MT through other language technology tools: resources and
tools for building MT. pp 18
Bach N, Badaskar S (2007) A review of relation extraction. In: Literature review for language and
statistics II
Banko M, Etzioni O (2008) The tradeoffs between open and traditional relation extraction. In:
Proceedings of ACL-08: HLT. pp 2836
Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction
for the web. In: IJCAI. pp 26702676
Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) DBpediaa
crystallization point for the web of data. Web Semant 7:154165
Bontcheva K, Davis B, Funk A, Li Y, Wang T (2009) Human language technologies. In: Davies J,
Grobelnik M, Mladenic D (eds) Semantic knowledge management: integrating ontology management, knowledge discovery and human language technology. Springer, Berlin/Heidelberg,
pp 3749

References

35

Buitelaar P, Cimiano P, Racioppa S, Siegel M (2006) Ontology-based information extraction with


SOBA. In: Proceedings of the international conference on Language Resources and Evaluation.
pp 23212324
Carlson A, Betteridge J, Kisiel B, Settles B, Hruschka ER Jr, Mitchell TM (2010) Toward an architecture for never-ending language learning. In: Proceedings of the conference on artificial intelligence (AAAI). pp 13061313
Cimiano P, Handschuh S, Staab S (2004) Towards the self-annotating web. In: Proceedings of the
13th international conference on World Wide Web. pp 462471
Cowie J, Lehnert W (1996) Information extraction. Commun ACM 39:8091
Ferreira L, Teixeira A, Cunha JP (2012) Medical information extractioninformation extraction
from Portuguese hospital discharge letters. Lambert Academic, Saarbrcken
Giuliano C, Lavelli A, Romano L (2006) Exploiting shallow linguistic information for relation
extraction from biomedical literature. In: Proceedings of the eleventh conference of the
European chapter of the Association for Computational Linguistics. EACL, pp 401408
Grishman R (1997) Information extraction: capabilities and challenges. In: Information extraction:
a multidisciplinary approach to an emerging information technology. Springer, Berlin, pp 1027
Gruber TR (1993) A translation approach to portable ontology specifications. Knowl Acquis
5:199220
Guarino N (1998) Formal ontology and information systems. In: FOIS 98Proceedings of the international conference on formal ontology in information systems. IOS Press, Amsterdam, pp 315
Kim S, Kwon D, Shin S-Y, Wilbur WJ (2012) PIE the search: searching PubMed literature for protein
interaction information. Bioinformatics 28:597598. doi:10.1093/bioinformatics/btr702
Klinov P (2008) Pronto: a non-monotonic probabilistic description logic reasoner. In: Bechhofer
S, Hauswirth M, Hoffmann J, Koubarakis M (eds) The Semantic Web: research and applicationsProceedings of the 5th European Semantic Web conference. Lecture notes in computer
science. Springer, Berlin/Heidelberg, pp 822826
Knublauch H, Fergerson R, Noy N, Musen M (2004) The Protg OWL plugin: an open development environment for semantic web applications. In: McIlraith S, Plexousakis D, van Harmelen
F (eds) The Semantic WebISWC 2004Proceedings of the 3rd international Semantic Web
conference. Lecture notes in computer science. Springer, Berlin/Heidelberg, pp 229243
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran
C, Zens R, et al (2007) Moses: open source toolkit for statistical machine translation. In:
Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration
sessions. pp 177180
Mrquez L, Carreras X, Litkowski KC, Stevenson S (2008) Semantic role labeling: an introduction
to the special issue. Comput Linguist 34:145159
Miller S, Fox H, Ramshaw L, Weischedel R (2000) A novel use of statistical parsing to extract
information from text. In: Proceedings of the 1st North American chapter of the Association for
Computational Linguistics conference. pp 226233
Miyao Y, Ohta T, Masuda K, Tsuruoka Y, Yoshida K, Ninomiya T, Tsujii J (2006) Semantic
retrieval for the accurate identification of relational concepts in massive textbases. In:
Proceedings of the 21st international conference on computational linguistics and the 44th
annual meeting of the ACLACL06. pp 10171024. doi:10.3115/1220175.1220303
Mohit B (2014) Named entity recognition. In: Zitouni I (ed) Natural language processing of
Semitic languages. Springer, Berlin, pp 221245
Moro A, Li H, Krause S, Xu F, Navigli R, Uszkoreit H (2013) Semantic rule filtering for web-scale
relation extraction. In: The Semantic WebISWC 2013, LNCS, vol 8218. Springer, Berlin,
pp 347362. doi:10.1007/978-3-642-41335-3_22
Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Linguist
Investig 30:326
Nobata C, Cotter P, Okazaki N, Rea B, Sasaki Y, Tsuruoka Y, Tsujii J, Ananiadou S (2008) Kleio:
a knowledge-enriched information retrieval system for biology. In: SIGIR08: Proceedings of
the 31st annual international ACM SIGIR conference on research and development in information retrieval. pp 787788. doi:10.1145/1390334.1390504

36

3 Identifying Things, Relations, and Semantizing Data

Noy N, Fergerson R, Musen M (2000) The knowledge model of Protg-2000: combining


interoperability and flexibility. In: Dieng R, Corby O (eds) EKAW 2000Proceedings of the
12th international conference on knowledge engineering and knowledge management. Lecture
notes in computer science. Springer, Berlin/Heidelberg, pp 6982
Nunes T, Campos D, Matos S, Oliveira JL (2013) BeCAS: biomedical concept recognition services
and visualization. Bioinformatics 29:19151916. doi:10.1093/bioinformatics/btt317
Popov B, Kiryakov A, Ognyanoff D, Manov D, Kirilov A (2004) KIMa semantic platform for
information extraction and retrieval. Nat Lang Eng 10:375392
Ratinov L, Roth D (2009) Design challenges and misconceptions in named entity recognition. In:
Proceedings of the thirteenth conference on computational natural language learning (CONLL).
pp 147155
Rodrigues M, Dias GP, Teixeira A (2011) Ontology driven knowledge extraction system with
application in e-government. In: Proceedings of the 15th Portuguese conference on artificial
intelligence, Lisboa. pp 760774
Saggion H, Funk A, Maynard D, Bontcheva K (2007) Ontology-based information extraction for
business intelligence. In: The Semantic Web. Lecture notes in computer science. Springer,
Berlin/Heidelberg, pp 843856
Sarawagi S (2008) Information extraction. Found Trends Database 1:261377
Schutz A, Buitelaar P (2005) Relext: a tool for relation extraction from text in ontology extension.
In: The Semantic WebISWC 2005. Springer, Berlin, pp 593606
Sirin E, Parsia B (2004) Pellet: an OWL DL reasoner. In: Haarslev V, Mller R (eds) DL 2004
Proceedings of the 2004 international workshop on description logics, CEUR workshop proceedings. pp 212213
Suchanek FM, Ifrim G, Weikum G (2006) Combining linguistic and statistical analysis to extract
relations from web documents. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. pp 712717
Suchanek F, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: Proceedings
of the 16th international conference on World Wide Web (WWW07). ACM, pp 697706.
doi:10.1145/1242572.1242667
Teixeira A, Ferreira L, Rodrigues M (2014) Online health information semantic search and exploration: reporting on two prototypes for performing extraction on both a hospital intranet and the
World Wide Web. In: Neustein A (ed) Text mining of web-based medical content. De Gruyter,
Berlin, pp 4973
Tjong Kim Sang EF, De Meulder F (2003) Introduction to the CoNLL-2003 shared task: languageindependent named entity recognition. In: Proceedings of the seventh conference on natural
language learning at HLT-NAACL 2003. ACL, pp 142147
Whitelaw C, Kehlenbeck A, Petrovic N, Ungar LH (2008) Web-scale named entity recognition. In:
CIKM 2008Proceedings of the 17th ACM conference on information and knowledge management. ACM, New York, pp 123132
Wimalasuriya DC, Dou D (2010) Ontology-based information extraction: an introduction and a
survey of current approaches. J Inf Sci 36:306323
Wu F, Hoffmann R, Weld DS (2008) Information extraction from Wikipedia: moving down the long
tail. In: Li Y, Liu B, Sarawagi S (eds) KDD08Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 731739
Yates A, Banko M, Broadhead M, Cafarella MJ, Etzioni O, Soderland S (2007) TextRunner: open
information extraction on the web. In: Sidner CL, Schultz T, Stone M, Zhai C (eds) NAACLHLT (demonstrations)Proceedings of human language technologies: the annual conference
of the North American chapter of the Association for Computational Linguistics. Association
for Computational Linguistics, Morristown, pp 2526
Yildiz B, Miksch S (2007) ontoXa method for ontology-driven information extraction. In:
Computational science and its applicationsICCSA 2007. Lecture notes in computer science.
Springer, Berlin/Heidelberg, pp 660673

Chapter 4

Extracting Relevant Information Using


a Given Semantic

Abstract This chapter presents an example of software architecture, developed by


the authors, for performing Ontology Based Information Extraction (OBIE) using an
arbitrary ontology. The goal of the architecture is to allow the deployment of applications for arbitrary domains without need of system reprogramming. For that, human
operator(s) define the semantics of the application and provide some examples of
ontology concepts in target texts; then the system learns how to extract information
according to the defined ontology.
An instantiation of the proposed architecture using freely available and high performance software tools is also presented. This instantiation is made for processing
texts in a natural language, Portuguese, that was not the original target for most of
the tools, showing and discussing the preparation of tools for other languages than
the ones provided out of the box.
Keywords Ontology base information extraction OBIE Ontologies NLP
Semantic extraction Portuguese Maltparser

4.1

Introduction

This chapter presents an example of a software architecture developed by the


authors for performing OBIE using an arbitrary ontology. The architecture is not for
conducting Open IE since it does not seek to identify arbitrary unknown relations
and the ontology is not built nor updated in runtime by the application. The goal of
the architecture is to allow the deployment of applications for arbitrary domains
without need of system reprogramming. For that, a human operator defines the
semantics of the application and the system learns how to extract information
accordingly. The learning process is based on seed examples, provided by the
human operator, of ontology concepts in target texts.
The proposed architecture will be instantiated using freely available and high
performance software tools. Moreover, the architecture was instantiated for a
natural language that was not the original target language of most tools. The language selected is Portuguese since is the sixth most spoken language in the world
and is the native language of the authors. The rationale for having a chapter on
The Authors 2015
M. Rodrigues, A. Teixeira, Advanced Applications of Natural Language
Processing for Performing Information Extraction, SpringerBriefs
in Electrical and Computer Engineering, DOI 10.1007/978-3-319-15563-0_4

37

38

4 Extracting Relevant Information Using a Given Semantic

this topic is also to show and discuss the preparation of tools for languages not
provided out of the box.
The implementation described here was already used to build applications that
extract and display information from local government public documents, and later
health information in a specific domain relative to Alzheimer, Huntington, and
Parkinson diseases (Rodrigues 2013; Teixeira et al. 2014).

4.2

Defining How and What Information Will Be Extracted

Let us start by defining which information will be extracted in order to maximize


the amount of relevant information extracted and avoid capturing information not
needed. The idea is to define what relevant information is given that some design
decisions depend on this definition. Information relevancy is not an absolute value.
Intuitively it is possible to say that topics outside the scope of the application are not
relevant information. For instance, if an application is about places to go on holidays it is not relevant to acquire information about cars. Also, information relevancy
depends on the expertise of target audience. Information should be more detailed if
the audience has a good grasp on the topic and less detailed otherwise. Having this
into consideration implies that the proposed architecture should only extract information about topics explicitly required by the application, and with the same level
of granularity specified by the ontology and seed examples.
Also, due to possible changes on the end users expectations and also on the information sources available, the relevance of information can change over time. As an
example let us consider the electronic consumer market. Just a few years back, in the
cellular phone market no one knew the amount of cores of the central processing
unit. Today is almost mandatory to know it, even for costumers that are not fully
aware of the meaning of this number. This is because this information has become a
key selling point and customers understand that more is better. Knowing the number
of cores helps to assess if the price is right or not. The implication is having a solution with the ability to adapt the information domain as needed. Adapting the information domain means to accept changes in a timely manner, in order to reduce costs
and speed up processes. It also means to acquire relevant information for the new
domain, doing so without significant system reconfiguration or setup. It is important
to address this feature from the beginning because it is usually cost intensive to keep
knowledge bases up-to-date as the domain changes (Bizer et al. 2009).
The final consideration about information relevancy is perhaps more subtle. In
some cases information can be irrelevant even if it is about a sought topic. Still considering the example of cellular phones, the application purpose can be to obtain as
much information possible about the available models and manufacturers to compare prices and specifications. In such scenario, when faced with documents from
regulating authorities, for example the United States Federal Communication
Commission (FCC) that approves if a model is suitable to go to the market in the
United States, the information on those documents can be considered irrelevant for
this application since all devices on the market already have FCC approval.

4.3 Architecture

39

Three conditions need to be met in order to later discriminate which information


is relevant: (1) the seed examples can only include information considered relevant
and all relevant information needs to be an example; (2) the training set containing
the seed examples must also contain information about the topic that is considered
irrelevant; (3) the machine learning algorithm must support negative examples and
assume that all information in the training set that is not a seed example is a negative
example.

4.3

Architecture

The architecture proposed has three components:


1. Natural language processing, for handling information sources;
2. Domain representation, allows human operators to define the application domain;
3. Semantic extraction and integration, that extracts information from the NLP
component output in the way defined by the domain representation component.
The NLP component handles natural language texts as it is for obtaining structured, unambiguous and fixed format data from these sources, as described in Chap.
2. It includes NLP tools to analyze texts and enrich them with morphosyntactic
features. The features should allow extracting information from simple and complex
sentences and thus should include: POS tags, named entities, and syntactic parse
trees. Moreover, the syntactic trees will be generated by a dependency grammar.
Dependency grammars explicitly encode predicate-argument structures, a useful
feature for extracting relations among entities (Kbler et al. 2009). This should be
the only component to change when instantiating the architecture for distinct natural
languages.
The purpose of the domain representation component is to help human operators
to define the application domain. Its purpose is to allow the definition of the kind of
information to be extracted from the outputs of the NLP component. It contains
tools to define the data semantics using ontology, and tools to mark seed examples
of ontology classes and relation in natural language texts. It is advisable that the
ontology supports references to the original information sources to ensure traceability. It is important to trace information up to the original source to verify if the
extracted information is correct. A good option for referencing the information
sources is to use the DCMI Terms created and maintained by the Dublin Core
Metadata Initiative (DCMI) since it is a de facto standard (Weibel et al. 1998).
The task of the third component, named semantic extraction and integration, is
to extract, organize, and store the output of the NLP component according to the
semantic defined by the domain representation. It needs machine learning algorithms to support domain adaptation: the idea that a change in the information
domain should not cause a software reprogramming/re-engineering implies that
the software adapts to, learns, the domain specification. The machine learning
algorithms use the seed examples, the ontology, and the NLP outputs to learn how

4 Extracting Relevant Information Using a Given Semantic

40

Docs

Natural
Language
Processing

Domain
Representation

Sentence split +
POS tagging +
NER +
syntatic parsing

Ontology editor
Example
annotation

Sentence split +
POS tagging +
NER +
syntatic parsing

Semantic
Extraction &
Integration
Extraction
model training

Semantic
extraction
External
structured
sources

Knowledge
base

Fig. 4.1 Architecture of the system. The top left to right arrow represents the training flow and the
bottom left to right arrow the runtime flow

to associate morphosyntactic data to ontology classes and relations. The learnt


associations are deployed as semantic extraction models. This component has two
modes of operation: (1) training, to produce the semantic extraction models, and
(2) runtime, to apply the models to new, previously unseen texts, already enriched
by the NLP component, to extract information from them. Missing information
according to the ontologycan be searched in external structured sources. This
search is conducted via specific connectors. The existence of structure in the external data makes possible to develop connectors that directly assign the appropriate
semantics to the data. A change in the ontology, however, can imply changes in
these connectors. After the IE process, all information is stored in a knowledge
base that conforms to the defined ontology.
Figure 4.1 depicts the architecture with the three components: (1) natural
language processing, (2) domain representation, and (3) semantic extraction and
integration. The top left to right arrow indicates the flow of the training procedure,
starting from the documents (Docs) and ending in the extraction model training.
The lower left to right arrow indicates the flow on runtime, after the extraction models were trained. The NLP is the same but the runtime flow does not include the
domain representation since it is now encoded in the extraction models. The flow
starts on documents and ends on the knowledge base.

4.4

Implementation of a Prototype Using


State-of-the-Art Tools

This section describes a possible architecture instantiation to build a prototype for


extracting information from Portuguese texts, and using state of the art tools. The
tools used in this implementation were selected taking into consideration their

4.4

Implementation of a Prototype Using State-of-the-Art Tools

41

performance as well as the easiness to train and use. The selection presented here
does not intend to represent the single best solution. It is a good solution considering
the target natural language.

4.4.1

Natural Language Processing

The natural language processing component, as in many systems and as described


in Chap. 2, is organized in four sequential steps: sentence boundary detection, POS
tagging, NER, and syntactic parsing.
Before getting into details about the processing pipeline, it will be described the
corpus used to prepare most of the tools for Portuguese. The annotated corpus
used is Bosque, a subset of a publicly available Treebank for Portuguese named
Floresta Sinta(c)tica and built using two daily newspapers corpora: newspaper
Pblico from Portugal, and newspaper Folha de S. Paulo from Brazil (Afonso
et al. 2002). The subset Bosque was fully revised by linguists and contains 9,368
sentences and about 186,000 words (Freitas et al. 2008). The version used was
Bosque v7.31 because it is the only version with syntactic trees using dependency
structures.
Table 4.1 presents the first two sentences of Bosque. The sentences are in the
format defined for the Tenth Conference on Computational Natural Language
Learning (CoNLL-X) (Mrquez and Klein 2006). Each token as ten fields: (1) ID,
token counter starting at 1 for each new sentence; (2) FORM, word form or punctuation symbol; (3) LEMMA, lemma or stemdepending on particular data set
of word form, or an underscore if not available; (4) CPOSTAG, coarse-grained
part-of-speech tag, where the tag set depends on the language; (5) POSTAG, finegrained part-of-speech tag, where the tag set depends on the language, or identical
to the coarse-grained part-of-speech tag if not available; (6) FEATS, unordered set
of syntactic and/or morphological features (depending on the particular language),
separated by a vertical bar (|), or an underscore if not available; (7) HEAD, head of
the current token, which is either a value of ID or zero. Depending on the original
Treebank annotation, there may be multiple tokens with an ID of zero; (8) DEPREL,
dependency relation to the HEAD. The set of dependency relations depends on the
particular language. Note that depending on the original Treebank annotation, the
dependency relation may be meaningful or simply ROOT; (9) PHEAD, projective
head of current token, which is either a value of ID or zero (0), or an underscore
if not available; and (10) PDEPREL, dependency relation to the PHEAD, or an
underscore if not available. The set of dependency relations depends on the particular language. The last to fields, PHEAD and PDEPREL are to be filled by the syntactic parser.

http://www.linguateca.pt/floresta/CoNLL-X/

FORM
Um
revivalismo
refrescante

O
7_e_Meio

um
ex-libris
de
a
noite
algarvia
.

ID
1
2
3

1
2
3
4
5
6
7
8
9
10

o
7_e_Meio
ser
um
ex-libris
de
o
noite
algarvio
.

LEMMA
um
revivalismo
refrescante

art
prop
v
art
n
prp
art
n
adj
punc

CPOSTAG
art
n
adj

Table 4.1 First two sentences of Bosque v7.3

art
prop
v-fin
art
n
prp
art
n
adj
punc

POSTAG
art
n
adj
<artd>|M|S
M|S
PR|3S|IND
<arti>|M|S
M|P
<sam->
<-sam>|<artd>|S
F|S
F|S

FEATS
<arti>|M|S
M|S
M|S
2
3
0
5
3
5
8
6
8
3

HEAD
2
0
2
<N
SUBJ
STA
>N
SC
N<
>N
P<
N<
PUNC

DEPREL
>N
UTT
N<

PHEAD

PDEPREL

42
4 Extracting Relevant Information Using a Given Semantic

4.4

Implementation of a Prototype Using State-of-the-Art Tools

43

Sentence Splitting
Sentences are separated using the sentence boundary detector Punkt (Kiss and
Strunk 2006). Although Punkt was already tested with Portuguese, it was not possible to obtain a model for splitting Portuguese sentences. Thus, a model was trained
using Punkt tools and around 6,500 sentences randomly selected from Floresta
Sinta(c)tica. Training Punkt is straightforward as it does not have training parameters. The trained model was briefly tested with sentences from the same corpus but
not included in the training corpus. The sentence splitting model was considered
ready as the result obtained was F1 = 0.90 which is in line with the values reported
in literature.

POS Tagging
After split, sentences are enriched with POS tags assigned by TreeTagger (Schmid
1994). There is a publicly available model for Portuguese (Garcia et al. 2014) but
and encoding problem with accentuated words motivated us to train a new model
for Portuguese. Training TreeTagger requires the creation of three files: (1) a lexicon file containing a list of words and respective POS tags; (2) tagged training data
containing sentences with words and respective POS tags, which can vary depending on the word context in the sentence; and (3) an open class file containing the
POS the tagger can assign when guessing tags of unknown words. This file was kept
the same as for English: N ADJ V-FIN ADV.
All parameters controlling the training process were kept with the default values:
number of preceding words forming the tagging context (context length default is
2); threshold of information gain below which a leaf node of the decision tree is
deleted (minimum decision tree gain default is 0.7); and weight of the class of words
with the same tag probabilities in the computation of the probability estimates
(equivalence class weight default is 0.15). The trained model performance was measured against sentences of Floresta Sinta(c)tica and the precision obtained was 0.92.

Named Entity Recognition


Named entities are discovered and classified by the publicly available Portuguese
NER system named REMBRANDT (Cardoso 2008). REMBRANDT identifies and
classifies named entities according to the second HAREM directives (see Table 4.2)
(Mota and Santos 2008). Alongside categories and types, it also assigns subtypes
but here will not be used.
Words belonging to a named entity are grouped using underscores. For instance,
the names of the person John Stewart Smith become the single token John_Stewart_
Smith. The advantage of having the whole named entity as a single token is that they
will be seamlessly processed by the parser and later is possible to revert to the original tokens by removing the underscore.

44

4 Extracting Relevant Information Using a Given Semantic

Table 4.2 HAREM categories and types to classify named entities


Categories
Abstraccao (abstraction)
Acontecimento
(occurrence)
Coisa (thing)
Local (place)
Numero (number)
Obra (work)
Organizacao
(organization)
Pessoa (person)

Tempo (time)
Valor (value)
Outro (other)

Types
Disciplina (discipline); estado (state); ideia (idea); nome (name);
outro (other)
Efemeride (unique occurrence, news); evento (event); organizado
(organized); outro (other)
Classe (class); membroclasse (member of class); objecto (object);
substancia (substance); outro (other)
Fisico (physical); humano (human, political); virtual (virtual);
outro (other)
Numeral (numeral); ordinal (ordinal); textual (textual)
Arte (art); plano (plan); reproduzida (reproduced); outro (other)
Administracao (administration); empresa (enterprise); instituicao
(institution); outro (other)
Cargo (job, position); grupocargo (position category); grupoind
(undefined group); grupomembro (group); individual (individual);
membro (member of group); povo (people); outro (other)
Duracao (duration); frequencia (frequency); generico (generic);
tempo_calend (calendar time); outro (other)
Classificacao (classification, ranking); moeda (currency);
quantidade (amount); outro (other)

Syntactic Parsing
The syntactic parsing is done with MaltParser, a dependency parser (Hall et al.
2007). The parsing algorithm used was the same of the Single Malt system, a
pseudo-projective dependency parsing with support vector machines (Hall et al.
2007; Nivre et al. 2006). The parsing model for Portuguese was induced with
Bosque v7.3 used in the CoNLL-X shared task: multi-lingual dependency parsing.
The outputs of POS tagging and NER are used to generate the input for the syntactic parser. Named entities will have their own word forms as lemma and, as POS
tag, the tag relative to proper nouns when their word forms are character strings, or
relative to numbers if their word form is a numeric sequence. After merging the
outputs of POS tagging and NER, sentences are analyzed to determine its grammatical structure.

4.4.2

Domain Representation

The creation of an application domain starts by the design or adaptation of an ontology. Ontologies can be difficult to build because they are formal models of human
domain knowledge that is often tacit, and often there is more than one possible mapping of that knowledge into formal, discrete structures. Although some rules of thumb

4.4

Implementation of a Prototype Using State-of-the-Art Tools

45

exist to help in ontology design, it is more productive to have tools that, at least,
identify simple conflicts and allow rapid re-design of ontology parts. In this work
the selected ontology editor was Protg (Knublauch et al. 2004).
Ontology editors are tools that provide assistance in the process of creation,
manipulation, and maintenance of ontologies. They can work with various representation formats and, among other things, ontology editors provide ways to
merge, visualize, and check the semantic consistence of ontologies (Noy et al.
2000). Protg is an open-source tool developed at Stanford. Relevant features are
the ability to assist users in ontology construction including importing and merging ontologies, the existence of several plugins that include alternative visualization mechanisms and alternative inference engines.
After having the application ontology defined it is necessary to provide examples
of its classes and relation in representative texts. The prototype uses a version of
AKTiveMedia2 ontology based annotation system that was customized to generate
outputs with the same format of the inputs used by the relation learning algorithm
developed. Considering a relation triplesubject, relation, objectthe format
defined is:

subjectClass: subjectText relation objectClass: objectText

AKTiveMedia is an open-source tool which supports annotation of text, images


and HTML documents (Chakravarthy et al. 2006). It supports different types of
annotations like ontology based annotations as well as free comments. The human
annotator starts by highlighting parts of text and assigning ontology classes to the
highlighted parts. Each part can become the subject of an ontology relation, which
domain includes the class correspondent to the selected text. For that, it is necessary
to select the highlighted text, select a relation in the relation panel, and select the
text corresponding to the relation object (Fig. 4.2).

4.4.3

Semantic Extraction and Integration

The model training process generates one semantic extraction model for each
ontology class and one semantic model for each ontology relation found in the
seed examples. This way a model represents a specific ontology class or relation.
A model is a set of syntactic structure examples and counterexamples that were
found to encode and restrict the meaning represented by the model. It also contains
a statistical classifier that measures the similarity between a given structure to be
evaluated and the model internal examples. The model is said to have positively
evaluated a sentence fragment if the similarity is higher than a given threshold.
2

http://ftp.jaist.ac.jp/pub/sourceforge/a/ak/aktivemedia/

46

4 Extracting Relevant Information Using a Given Semantic

Fig. 4.2 Screenshot of AKTiveMedia annotation interface. The top left pane shows the ontology
classes and the pane below shows the possible properties for the selected class. The larger pane
shows the document and the annotations highlighted according to the class and property selected

In runtime, each sentence is evaluated by all models and the fragments positively
evaluated by a model are assigned the ontology class or relation represented by that
model (Rodrigues et al. 2011a, b).
Unlike the previous two parts of the prototype, which have single usage
sequences, this part have two different and interrelated ways of being used. The way
it is used depends on the task to be performed: (1) creation of semantic extraction
models; (2) usage of semantic extraction models to feed a knowledge base.

Creation of Semantic Extraction Models


The algorithm for creating semantic extraction models was inspired in two works.
The first addresses the extraction of instances of binary relations using deep syntactic analysis (Suchanek et al. 2006). In their study, Suchanek et al. (2006) extracted
one-to-one and many-to-one relations such as the birthplace of a person. They have
used custom built decision functions to detect facts for each relation, and a set of
statistical classifiers to decide if new patterns are similar to the learned facts. In the
developed system, this work was extended to include the extraction of one-to-many
and many-to-many relations.

4.4

Implementation of a Prototype Using State-of-the-Art Tools

47

The second work is about improving entity and relation extraction when the
process is learned from a small number of labelled examples, using linguistic
information and ontological properties (Carlson et al. 2009). Improvements are
done using class and relation hierarchy, information about disjunctions, and facts
confidence scores. This information is used to bootstrap more examples generating
more data to train statistical classifiers. For instance, when the system is confident
about a fact, as when it was annotated by a person, this fact is used as an instance
of the annotated class and/or relation. This fact can also be used as a counterexample of all classes/relations disjoint with the annotated class/relation, and as an
instance of the super-class/super-relation. Moreover, facts discovered by the system with high confidence score can be promoted to examples and be included in a
new round of training. This creation of more examples is not active by default as it
can lead to data over fitting and should be used carefully.
A semantic extraction model contains a collection of partial syntactic structures
relative to either examples or counterexamples of the ontology class or property
encoded by the model. To obtain these structures, the sentence that originated the
examples are located and processed by the NLP part of the prototype. Then, each
annotated example has the format <subject-class: subject-text> <relation-name>
<object-class: object-text> and originates three facts:
subject-text is an individual of class subject-class;
object-text is an individual of class object-class;
subject-text has relation relation-name with object-text.
The partial syntactic structures associated to the first two facts will associate
subjects and/or objects to their ontological classes based on the syntactic dependencies between the subject/object token and the other tokens of the sentence (Rodrigues
et al. 2011b). These models store a collection of pairs for each token that represents
the subject/object. Two entities are regarded as equivalent if they connect to the
same lemmata using the same dependencies, graph edges, although lemmata of
nouns and adjectives are allowed to differ. Using the previous example of John
Bardeen, Fig. 4.3 depicts the data stored by the model that characterizes John
Bardeen as a person. In this case, every entity that is the subject of verb win is a
candidate to be a person.
The third fact, the relation, generates subject/object pairs based on the shortest
graph path between the elements of the pair. Two paths are regarded as equivalent if
they have the same sequence of nodes and edges, although nodes with nouns and
adjectives are allowed to differ. Figure 4.4 depicts the path used by the relation
models to associate John Bardeen and the Nobel Prize win.

Fig. 4.3 Dependency links that are used by the model to characterize John Bardeen as a person

48

4 Extracting Relevant Information Using a Given Semantic

Fig. 4.4 Dependency links


that are used by the model to
relate John Bardeen with the
Nobel Prize win

The semantic extraction models also contain a statistical classifier that decides if
previously unseen syntactic structures are similar to the ones stored by it. Structures
considered similar enough are assigned the meaning of the model, otherwise are
ignored. The statistical classifiers implemented in the prototype are based on the
k-Nearest Neighbor algorithm, but others could be used (Rodrigues et al. 2011a, b).
The statistical classifiers training process starts by removing duplicate entries.
Then counterexamples are searched. As it is assumed that all relations are marked
in the sample documents, these documents are search for relation counterexamples.
Relation counterexamples are searched by having relation classifiers evaluating all
sentences of the sample documents. The counterexamples are all sentences positively evaluated that are not part of the example set. This process is repeated until
the amount of counterexamples found is below a certain threshold level. Rodrigues
(2013) provides a detailed description about this process.

Usage of Semantic Extraction Models to Feed the Knowledge Base


The procedure starts by loading all ontology triples. Triples have the format <subject> <relation> <object>, meaning that a given relation exists between the subject
and the object. Following, all sentence graphs are evaluated by the classifiers of all
semantic models, and are collected in the case of forming a triple. To consider that
a sentence fragment forms a triple it is required to be positively evaluated by two
models, one for subject and the other for object, and one relation model binding the
subject and object.
Missing information according to the ontology is searched in external structured
information sources. For instance, unknown locations of entities with a fixed place
(as streets, organizations headquarters, and some events) are queried using Google
Maps API. The information acquired from external structured sources was not
obtained via semantic extraction models. This implies that forming triples from it
involves writing specific code to transform that information in a valid triple for the
ontology. It also implies that a change in the ontology probably implies a change in
the custom built code. This prevents this way of acquiring information of having the
same level of adaptability as the semantic extraction models, and thus should be
used just when is strictly needed.
All collected triples are added to the knowledge base and their coherence is verified by a semantic reasoner. In the system developed, reasoning is performed by an
open source reasoner for OWL-DL named Pellet (Sirin and Parsia 2004). All triples
not coherent with the rest of the knowledge base are discarded, and a warning is
issued. The remaining triples become part of the knowledge base and can be queried
via a SPARQL endpoint.

References

49

References
Afonso S, Bick E, Haber R, Santos D (2002) Floresta sint(c)tica: a treebank for Portuguese. In:
Proceedings of the third international conference on Language Resources and Evaluation
(LREC). pp 16981703
Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) DBpediaa
crystallization point for the Web of Data. Web Semant 7:154165
Cardoso N (2008) REMBRANDTReconhecimento de Entidades Mencionadas Baseado em
Relaes e ANlise Detalhada do Texto. In: Mota C, Santos D (eds) Desafios Na Avaliao
Conjunta Do Reconhecimento de Entidades Mencionadas: O Segundo HAREM. Linguateca,
pp 195211
Carlson A, Betteridge J, Hruschka ER, Mitchell TM (2009) Coupling semi-supervised learning of
categories and relations. In: SemiSupLearn 09: Proceedings of the NAACL HLT 2009 workshop on semi-supervised learning for natural language processing. Association for
Computational Linguistics, Stroudsburg, pp 19
Chakravarthy A, Ciravegna F, Lanfranchi V (2006) Cross-media document annotation and enrichment. In: SAAW2006Proceedings of the 1st Semantic Authoring and Annotation Workshop
Freitas C, Rocha P, Bick E (2008) Floresta Sint(c)tica: bigger, thicker and easier. In: Teixeira A,
de Lima V, de Oliveira L, Quaresma P (eds) PROPOR 2008Proceedings of the international
conference on computational processing of the Portuguese language. Lecture notes in computer science. Springer, Berlin/Heidelberg, pp 216219
Garcia M, Gamallo P, Gayo I, Cruz MAP (2014) PoS-tagging the Web in Portuguese. National
varieties, text typologies and spelling systems. Nat Lang Process 53:95101
Hall J, Nilsson J, Nivre J, Eryigit G, Megyesi B, Nilsson M, Saers M (2007) Single malt or
blended? A study in multilingual parser optimization. In: Proceedings of the CoNLL shared
task session of EMNLP-CoNLL 2007. Association for Computational Linguistics, Prague,
pp 933939
Kiss T, Strunk J (2006) Unsupervised multilingual sentence boundary detection. Comput Linguist
32:485525
Knublauch H, Fergerson R, Noy N, Musen M (2004) The Protg OWL plugin: an open development environment for semantic web applications. In: McIlraith S, Plexousakis D, van Harmelen
F (eds) The Semantic WebISWC 2004Proceedings of the 3rd international Semantic Web
conference. Lecture notes in computer science. Springer, Berlin/Heidelberg, pp 229243
Kbler S, McDonald R, Nivre J (2009) Dependency parsing. In: Synthesis lectures on human
language technologies, vol 2. Morgan & Claypool, San Rafael
Mrquez L, Klein D (eds) (2006) CoNLL-XProceedings of the tenth conference on computational natural language learning. Omnipress, New York
Mota C, Santos D (eds) (2008) Desafios na avaliao conjunta do reconhecimento de entidades
mencionadas: O Segundo HAREM. Linguateca
Nivre J, Hall J, Nilsson J, Atanas C, Eryiqit G, Kbler S, Marinov S, Marsi E (2006) Labeled
pseudo-projective dependency parsing with support vector machines. In: CoNLL-X
Proceedings of the 10th conference on computational natural language learning. Association
for Computational Linguistics, Stroudsburg, pp 221225
Noy N, Fergerson R, Musen M (2000) The knowledge model of Protg-2000: combining interoperability and flexibility. In: Dieng R, Corby O (eds) EKAW 2000Proceedings of the 12th
international conference on knowledge engineering and knowledge management. Lecture
notes in computer science. Springer, Berlin/Heidelberg, pp 6982
Rodrigues M (2013) Model of access to natural language sources is electronic government.
Ph.D. Thesis, University of Aveiro
Rodrigues M, Dias GP, Teixeira A (2011a) Criao e acesso a informao semntica aplicada ao
governo eletrnico. Linguamtica 3:5568
Rodrigues M, Dias GP, Teixeira A (2011b) Ontology driven knowledge extraction system with
application in e-government. In: Proceedings of the 15th Portuguese conference on artificial
intelligence, Lisboa. pp 760774

50

4 Extracting Relevant Information Using a Given Semantic

Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the
international conference on new methods in language processing, Manchester
Sirin E, Parsia B (2004) Pellet: an OWL DL reasoner. In: Haarslev V, Mller R (eds) DL 2004
Proceedings of the 2004 international workshop on description logics, CEUR workshop proceedings. pp 212213
Suchanek FM, Ifrim G, Weikum G (2006) LEILA: learning to extract information by linguistic
analysis. In: Proceedings of the 2nd workshop on ontology learning and population: bridging
the gap between text and knowledge. Association for Computational Linguistics, Sydney,
pp 1825
Teixeira A, Ferreira L, Rodrigues M (2014) Online health information semantic search and exploration: reporting on two prototypes for performing extraction on both a hospital intranet and the
world wide web. In: Neustein A (ed) Text mining of web-based medical content. De Gruyter,
Berlin, pp 4973
Weibel S, Kunze J, Lagoze C, Wolf M (1998) Dublin core metadata for resource discovery. Internet
Engineering Task Force RFC 2413

Chapter 5

Application Examples

Abstract In this chapter are presented two concrete examples of applications.


The first example is a tutorial that is easy to replicate (almost) without requiring
computer programming skills. This example elaborates on extracting information
useful in a wide range of scenarios: detection of people, organizations, and dates.
It shows how to extract information from a Wikipedia page. Most of the system is
implemented using the Stanford CoreNLP suite.
The second example is more complex and instantiates the OBIE architecture
presented in the previous chapter using software tools from different sources that
need to be adapted to work together. The application is related to electronic government, and processes publically available documents of municipalities. This second
example targets contents written in a natural language not often available out of the
box: Portuguese.
Keywords Applications Information extraction Tutorial Wikipedia Stanford
CoreNLP XSLT e-Government Ontologies Semantic queries SPARQL

5.1

A Tutorial Example

This example elaborates on extracting information useful in a wide range of


scenarios: names of people and organizations, and dates. Its objective is to serve as
an introductory example to the subject, illustrating key concepts and issues that usually deserve special attention. To illustrate this example will be used a single web
page as information source. However the work can be easily extended to multiple
pages. The information source is a Wikipedia page about Robert Andrews Millikan,
a physicist that won the Nobel Prize of Physics. The document can be obtained by
entering the uniform resource locator (URL) of the page1 in a web browser and saving it to a local file, or by using a program such as cURL or Wget to download. This
example does not elaborate on how cURL or Wget can be used as that will be done
in the second example.
1

http://en.wikipedia.org/wiki/Robert_Andrews_Millikan

The Authors 2015


M. Rodrigues, A. Teixeira, Advanced Applications of Natural Language
Processing for Performing Information Extraction, SpringerBriefs
in Electrical and Computer Engineering, DOI 10.1007/978-3-319-15563-0_5

51

52

Application Examples

Fig. 5.1 Screenshot of Wikipedia page about Robert Andrews Millikan. The content used in the
example is inside the dashed area (http://en.wikipedia.org/wiki/Robert_Andrews_Millikan)

It is pretended to provide a generic example that is not bound to Wikipedia.


As such it will not be used structures specific to Wikipedia such as infoboxes,
classes or any document markup. The content will be treated as if it was plaintext.
Although the full document is processed, this example focuses the discussion on the
beginning of the first and second paragraphs to make possible a detailed explanation
of what happens in each step of the processing. These particular parts of the document were selected since they have sentences containing several entities as people
and organization names, and dates. Figure 5.1 shows the Wikipedia page with a
dashed box surrounding the parts of the document that will be considered on this
example. The text is:
Robert A. Millikan (March 22, 1868December 19, 1953) was an American experimental
physicist
Millikan graduated from Oberlin College in 1891 and obtained his doctorate at Columbia
University in 1895. In 1896 he became an assistant at the University of Chicago, where he
became a full professor in 1910. In 1909 Millikan began

5.1

A Tutorial Example

5.1.1

53

Selecting and Obtaining Software Tools

The example uses Stanford CoreNLP since it is a NLP pipeline featuring a command
line interface. This is a desirable feature since it allows rapidly assessing the performance of a prototype, and it is important for this tutorial example as not having to
implement custom code makes the steps clearer and easier to understand. Stanford
CoreNLP can be obtained at the download area of its web page.2 As reference, in
December 2014 the downloaded filename was stanford-corenlp-full-2014-10-31.
zip and the file size was around 251 MB.
Apache OpenNLP is another NLP pipeline featuring a command line interface.
OpenNLP was not preferred because its process of identifying named entities does
not take advantage of features such as part-of-speech tags, although OpenNLP also
implements a POS tagger, which prevents it of performing as well as Stanford
CoreNLP. For instance, when tested with the example document, OpenNLP fails to
detect the single token Millikan as a person in the sentence Millikan graduated
from . As syntactic parsing is done based on POS tags, such design option
implies writing custom software in order to have named entities alongside with POS
tags and included in the syntactic parses.
Other software suites such as NLTK and Freeling were not selected for this
example as they require writing some programming code, and thus the complexity
of the example would increase without a clear benefit. For instance, NLTK needs to
be invoked in Python and Freeling requires C++.

5.1.2

Tools Setup

The latest version of Stanford CoreNLP requires a Java Virtual Machine (JVM) able
to run Java 8. Recent operating systems should already have this version installed or
a more recent one. If not, a recent JVM can be obtained in Java download page at
Oracle website.3
After downloading Stanford CoreNLP is necessary to unzip it to a desired location.
It will be assumed that the unzipped folder is the working directory. Before starting to
do intensive processing on large documents, which will take some time to complete,
is possible and recommended to check if everything is working as it should by doing
some processing over a small text file. Let us use the sample file input.txt provided
with Stanford CoreNLP. As there is already a correspondent output file, input.txt.xml,
let us preserve it and copy the file input.txt to a file named testinput.txt and run
the following command (see command explanation in Table 5.1):

2
3

http://nlp.stanford.edu/software/corenlp.shtml
http://www.oracle.com/technetwork/java/javase/downloads/index-jsp-138363.html

54

Application Examples

Table 5.1 Explanation of the command line for running StanfordCoreNLP


Part of the command
java -Xmx3g -cp *

edu.stanford.nlp.pipeline.
StanfordCoreNLP
-annotators tokenize,ssplit,
pos,lemma,ner

-file testinput.txt

Explanation
Invocation of the Java Virtual Machine with two parameters:
(1) -Xmx3g, to limit the maximum heap size to 3 GB which is
enough but can be increased if necessary and if there is more
RAM in the system; (2) -cp *, to say that the classpath of
java should be expanded to all JAR files in the current folder.
The classpath is a parameter that indicates where the
user-defined classes and packages can be found
The Java class implementing the CoreNLP controller
Parameter to specify the annotators to be used. In the same
order of the command: tokenizer (tokenize), sentence splitter
(ssplit), part of speech tagger (pos), lemmatizer (lemma), and
named entity recognizer (ner)
Parameter specifying the file to be processed. The output file
will have the same name plus the suffix .xml

java -Xmx3g -cp * edu.stanford.nlp.pipeline.StanfordCoreNLP


-annotators tokenize,ssplit,pos,lemma,ner -file testinput.txt

A computer with a Core2 processor and 4 GB of RAM takes around 50 s to complete the command. The result is a file named testinput.txt.xml containing some
parts that are similar to the file input.txt.xml. The files are not completely equal as
the command did not use the full NLP pipeline to avoid the tasks that are more time
consuming.

5.1.3

Processing the Target Document

This example will consider that the textual content of the Wikipedia page about
Robert Andrews Millikan is saved in a text file named r-millikan.txt. Repeating
the previous command with the target document it takes around 5 min with the same
machine:

java -Xmx3g -cp * edu.stanford.nlp.pipeline.StanfordCoreNLP


-annotators tokenize,ssplit,pos,lemma,ner -file r-millikan.txt

The result is a file named r-millikan.txt.xml. Figure 5.2 shows the result corresponding to the beginning of the file when opened in a web browser. In the leftmost
column are the Ids of tokens, and the second column presents the respective token

5.1

A Tutorial Example

55

Fig. 5.2 Beginning of the file r-millikan.txt.xml when opened in a web browser corresponding
to the beginning of the first paragraph of the document

that can be a word or a punctuation mark. It is visible that token 7 is an isolated


comma and in token 2 the period (correctly) remains attached to the letter as it is
part of an abbreviation. Also, token 4 is -LRB- that stands for left round bracket.
The fourth column holds the token lemma. The lemma of proper nouns is the noun
itself, and the same applies to punctuation, but token 15 is the verb form was which
lemma is the verb infinitive form be. The sixth column holds the part-of-speech
tag, where NNP stands for singular proper noun, CD stands for cardinal number,
VBD means verb in past tense, and DT means determiner. The tag set used is the
Penn Treebank set and all tags and respective meaning can be consulted at the project
website.4 The seventh column indicates if a token is part of a named entity and which,
and the eighth column presents the normalized form of that named entity, if such
form exists. The results show that tokens Robert A. Millikan form a named entity
of type PERSON, and that tokens March 22, 1868 and December 19, 1953 form
named entities of type DATE which are normalized to 1868-03-22 and 1953-12-19,
respectively. The value O in the NER column indicates that the corresponding
token is outside a named entity and thus is not part of any named entity.
After having the document information in a structured format is possible to
rewrite it in other formats by using some document object model (DOM) library for
manipulating xml contents, by writing some XML stylesheet language for transformations (XSLT), or by some custom code.

http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

56

Application Examples

Fig. 5.3 List of named


entities found in the
document obtained by
opening in a web browser
the file r-millikan.txt.xml
using the edited CoreNLPto-HTML.xsl file

For example, with some XSLT writing skills it is possible to obtain the named entity
list presented in Fig. 5.3 by changing parts of the file CoreNLP-to-HTML.xsl included
in CoreNLP (see Fig. 5.4 for details). It is strongly recommended to do a backup of
the file CoreNLP-to-HTML.xsl before doing any change to it. After backup, open
CoreNLP-to-HTML.xsl with a text editor and do the following changes:
In line 48 replace Sentences with Named Entities Found. This is the title of
the results table;
Comment lines 5760 using <!- - for beginning of comment block and -->
for ending the comment block;
Replace lines 67189 with the code explained in Fig. 5.4.
These named entities can be further isolated by removing all HTML formatting
from the file CoreNLP-to-HTML.xsl. Also, instead of viewing the file in a web
browser, it is possible to generate a file with that content by using a XSLT transformation software such as SAXON.5 The home edition is free and the java version,
after download and unzipping the folder, can be used as (command executed inside
the unzipped folder, see command explanation in Table 5.2):
java cp * net.sf.saxon.Transform s:r-millikan.txt.xml xsl:CoreNLP-tp5

http://saxon.sourceforge.net
HTML.xsl o:result.txt

5.1

57

A Tutorial Example

Fig. 5.4 XSLT code inserted in CoreNLP-to-HTML.xsl to isolate the named entities. Lines
6771 define that for each sentence is applied the XSLT template tokens. Lines 73105 define
the template tokens

Table 5.2 Explanation of the command line for running SAXON


Part of the command
java -cp *

net.sf.saxon.Transform
-s:r-millikan.txt.xml
-xsl:CoreNLP-to-HTML.xsl
-o:result.txt

Explanation
Invocation of the Java Virtual Machine with a parameter (-cp
*) to say that the classpath of java should be expanded to
all JAR files in the current folder
The Java class implementing the XSLT processor
Parameter to specify the input file
Parameter to specify the XSL file
Parameter to specify the output file

58

Application Examples

The list of named entities obtained can then be used for several tasks including
automatic creation of webpage meta tags to improve its visibility.

5.1.4

Using for Other Languages and for Syntactic Parsing

At the time of writing, the Stanford CoreNLP is provided with models for processing English and is also possible to download and use models for Chinese. To process
other languages is necessary to download the individual components (Stanford POS
tagger, StanfordParser, etc.) and assemble the pipeline. Extending this example in
order to have syntactic structures is just necessary to add that annotator in the pipeline. For having syntactic structures based on phrase structure grammar or based on
dependency grammar, the commands are respectively:
java -Xmx3g -cp * edu.stanford.nlp.pipeline.StanfordCoreNLP
-annotators tokenize,ssplit,pos,lemma,ner,parse -file r-millikan.txt

java -Xmx3g -cp * edu.stanford.nlp.pipeline.StanfordCoreNLP


-annotators tokenize,ssplit,pos,lemma,ner,depparse -file r-millikan.txt

5.2

Application Example 2: IE Applied to Electronic


Government

This example illustrates how it is possible to build applications that extract structured
information, with semantic annotations about entities and respective relations, by
processing batches of documents. The importance of extracting, structuring, and
associating semantic meaning to information is related with the possibility of developing computer algorithms that able to automatically manipulate such data. This
makes possible to create applications that are able to meet end users needs by presenting information in several distinct formats, in intuitive and appealing ways.

5.2.1

Goals

The application described here presents the information extracted from natural
language sources in distinct formats such as maps and tables, and is able to perform
accurate data searches, all by benefitting from the semantic annotations. This example
has two objectives. The first is to illustrate how the architecture presented in Chap. 4
can be instantiated. Besides the existence of a NLP pipeline, the core of the first
example, it is necessary to add components to allow defining the application semantics and others that learn to make the correspondence between the morphosyntactic

5.2 Application Example 2: IE Applied to Electronic Government

59

data and the semantic concepts. Having these components in place is then possible
to use out of the box semantic reasoners that are able to infer new information from
the information extracted from texts.
The second objective is related with using software tools from different sources.
Quite often, the most appropriate software for one task, e.g. part of speech tagging,
was developed by one team whereas tools for other tasks, e.g. syntactic parsing,
were developed by other teams. If one wants to use the best components, on top of
the software engineering challenges, that here are out of the topic, it is often necessary to train the software to process other natural languages. Targeting contents
written in Portuguese, or in any other language which models are not usually provided with natural language processing tools, involves obtaining adequate corpora
and preparing and conducting models training sessions.

5.2.2

Documents

The natural language documents to be processed in this example are the minutes of
municipal meetings of the municipalities belonging to Aveiro district, in Portugal.
This set of documents was selected for the following reasons:
NLP models for processing Portuguese are not usually included in the provided
software. Portuguese is the sixth most spoken language in the world. Therefore
is a relevant choice as example on the creation of NLP models. Moreover,
Portuguese is the native language of the authors which is important to assess the
quality of the NLP results, and Aveiro is the area of their affiliation.
In Portugal, municipalities have responsibilities regarding land management, the
grant of subsidies, and establish protocols with local organizations. It is important to have this kind of information readily available to the public to foster local
government transparency (Rodrigues et al. 2010, 2013).
Minutes of municipal meetings are usually made available in PDF format and
often contain long and complex sentences. These characteristics makes them
ambitious targets to process and, in principle, a system able to process such data
should also be able to handle data sources with shorter and less complex
sentences.

5.2.3

Obtaining the Documents

Documents can be obtained by entering the uniform resource locator (URL) of the
page in a web browser and saving it to a local file, or by using software such as cURL
or Wget to download and save them in local files. cURL is standard in Linux and
MacOS and for Windows can be obtained freely.6 However cURL is designed to download a single document and not batches of documents as necessary in this application.
6

http://curl.haxx.se/download.html

60

Application Examples

Table 5.3 Wget command options explanation


Part of the command
wget -r --limit 2

--accept pdf

--limit-rate=20k -D
cm-sjm.pt

http://www.cm-sjm.
pt/34

Explanation
Use recursion up to two hops from this point. The level can be
increased and that should cause the download of files less probably
related with this page
Download only files ending with pdf. More endings can be added by
separating them with commas. For example --accept pdf,docx would
download files ending with pdf or ending with docx
Limit the download rate to 20 kb per second and only download files
from the web domain cm-sjm.pt. The speed limitation is used to not
overload the host server and limiting the domain is to avoid having
files from other websites
The starting point for downloading

The best option is to use Wget that can be obtained freely from Linux repositories or
from the project website.7 For Windows there is a port at Sourceforge.8
To retrieve a batch of documents is better to use Wget in recursive mode. In this
mode, Wget retrieves a web page and follows its links to a desired level of recursion. It is necessary to take special care to avoid downloading unrelated documents
and avoid overloading website hosts. To retrieve all PDF files from the page http://
www.cm-sjm.pt/34 one possible command is (see explanation in Table 5.3):
wget -r --level 2 --accept pdf --limit-rate=20k -D cm-sjm.pt http://www.
cm-sjm.pt/34

Extracting Document Content to a Plain Text File


After download it is necessary to obtain the text of the documents. For that it is
possible to use Apache PDFBox.9 If the files are from office suites it is possible to
obtain their content using Apache POI.10
PDFBox can be integrated in a Java application using its API or can be used from
the command line by downloading pdfbox-app-X.X.X.jar where X.X.X is the version
number. In the case of the later, the command for extracting the content from a file is:

java -jar pdfbox-app-1.8.7.jar ExtractText <name-of-pdf-file-to-read>


<name-of-text-file-to-write>

https://www.gnu.org/software/wget/
http://gnuwin32.sourceforge.net/packages/wget.htm
9
https://pdfbox.apache.org/
10
http://poi.apache.org/
8

5.2 Application Example 2: IE Applied to Electronic Government

5.2.4

61

Application Setup

As explained in Chap. 4, using the pipeline described requires preparing the


application by defining an ontology that will be the semantic framework, and by
annotating some examples of associations between the natural language data and
the ontological concepts. First is described the ontology creation and then the annotation process.

Ontology Creation
Without pretending to discuss in detail the state of the art and the challenges
of knowledge engineering, this application uses an ontology with a reduced level of
expressiveness which means that uses a reduced set of ontological classes and properties. The reasons are twofold: (1) a higher expressiveness of the ontology requires
a bigger set of seed examples to cover all hypotheses. Rules and relations not covered by the examples are as if they did not exist, and example annotation can be a
burdensome task if too many are required. Also, increasing the detail of the ontology can lead to a performance decrease in the learning process. The reason is that
the learning process bases the decision on the NLP data which is bound to the NLP
pipeline, and thus does not increase its granularity by means of the ontology;
(2) considering the current state of the art in ontological reasoning and data access,
it is well known that the performance in terms of speed decreases with the increase
of the data volume stored in a knowledge base conforming an ontology. Moreover,
that performance decrease is more pronounced when ontologies are more expressive (Mller et al. 2013; Peters et al. 2013). It is possible to mitigate this effect but
that will not be addressed here.
The information to be handled by the application implies that the ontology
includes concepts about people, locations, and specific concepts relative to municipalities. It is a good practice to use, as much as possible, standardized concepts in
the spirit of open data and application interoperability. Thus the application ontology will be composed by three classes added specifically to handle the municipal
concepts plus four publicly available ontologies to handle information about people,
places, and to refer to the documents here the information was extracted. These four
ontologies are described in Table 5.4 and the three classes for municipal subjects, all
defined as subclasses of the top level call Thing, are:
Build permitsIt represents construction contracts in execution, whether public
or private.
ProtocolIt represents protocols signed with local institutions such as schools
or sports clubs.
SubsidyIt represents subsidies requested by any entity whether granted or not.
Were also defined object type properties and data type properties. Object type
properties are references to other objects and thus they represent relations
between ontology objects. Data type properties are properties that will have a

WGS84

Geo-Net-PT

Friend of a friend

Ontology
Dublin core

Description
Allows describing resources such as documents and video (Weibel et al. 1998). For this work the classes used were Title
and Source. It can be obtained at http://dublincore.org/documents/dc-rdf/
Defines terms such as people, groups, and documents (Brickley and Miller 2010). For this application the relevant classes
are Person and Organization, and the relevant property is name. It can be obtained at http://www.foaf-project.org/
It is a geographic ontology of Portugal which encodes the organization of spaces, for instance which streets belong to a
neighborhood, that in turn belongs to a city, that belongs to a municipality, etc. (Lopez-Pellicer et al. 2009). For this
application the relevant class is Municipality. It can be obtained at http://www.linguateca.pt/GeoNetPt/
It is the geodetic reference system used in global positioning system (GPS). For this application the relevant class is
SpatialThing and the relevant properties are lat, long, and alt representing latitude, longitude, and altitude,
respectively. It can be obtained at http://www.w3.org/2003/01/geo/

Table 5.4 Description of the four ontologies used in the composition of the application ontology

62
5
Application Examples

5.2 Application Example 2: IE Applied to Electronic Government

63

Fig. 5.5 A perspective of the ontology

value and thus work as values specific to a given object. These classes have two
object type properties and four data type properties. The object type properties
are: (1) place, referring to the address of the entity who signed the protocol or
requested a subsidy; (2) requester, the reference of the entity or entities, excluding the municipality, involved in the request. The four data type properties are:
(1) deliberation, representing the outcome of the request; (2) identifier, is the
unique identifier given by municipal services; (3) money amount, the amount of
money involved in the protocol or subsidy; (4) motivation, the motive of the
protocol or for requesting the subsidy.
Merging these four ontologies plus the three specific classes is straightforward
when using an ontology editing tool such as Protg.11 As Friend of a Fried defines
the aligned with WGS84 and Dublin Core, using the standard Simple Knowledge
Organization System (SKOS), Protg automatically aligns them correctly.
Considering Geo-Net-PT, as it does not share concepts with the other ontologies,
it is placed in an independent subclass of Thing, and the same applies for the
classes Build Permit, Protocol, and Subsidy. Figure 5.5 presents a perspective of
the ontology.

11

http://protege.stanford.edu/

64

Application Examples

Annotation of Seed Examples


Seed examples provide correspondences between sentences fragments and ontological concepts. Using these correspondences, a machine learning algorithm generates
semantic extraction models that associate the syntactic dependency graphs with
classes and properties of the ontology (see Chap. 4 for details).
To have these correspondences it is required to annotate examples in representative documents. The documents to be annotated can, and should, be a subset of the
target documents. The objective is to provide examples that are similar to the target
information. As a rule of thumb we annotate a random subset of 10 % of the document set. This percentage is reduced if the document set is large (>200) and
increased if the set is small (<50). Also, if at the end of the annotation process it is
observed that a particular relation was (almost) not found, the document set is
searched for evidences of that relation and some more representative examples are
included.
For instance, let us consider that the subset of document to annotate includes the
file obtained at the URL http://www.cm-sjm.pt/files/20/20336.pdf. The following
example is an excerpt from page 46 of the document, first being presented the original text and after an informal translation:
Na sequncia da prorrogao de funcionamento do GIPGabinete de Insero Profissional,
(), que tem traduo no protocolo de cooperao com a Associao de Jovens Ecos
Urbanos, solicita-se a atribuio de subsdio no montante de 2,499, para assegurar o pagamento do tcnico afeto a este projeto, nos termos do referido protocolo (). A Cmara
deliberou, por unanimidade, aprovar.
As consequence of extending GIPGabinete de Insero Profissional, (), which is
supported in a cooperation protocol with Associao de Jovens Ecos Urbanos, it is requested
the allocation of a subsidy amounting to 2,499, to ensure payment of the technician working in this project, as defined by the mentioned protocol (). The Chamber has decided,
unanimously, approve it.

This excerpt mentions a subsidy and a protocol. The subsidy is for an institution
named GIP, for paying a technician, amounts to 2,499, and was approved.
The subsidy is justified by a protocol established with an association named
Associao de Jovens Ecos. The annotations corresponding to this seed example
are presented and explained in Table 5.5.
It is possible to use a graphical user interface to assist the annotation process.
When all annotations are done, an automated process finds the syntactic graphs corresponding to the relevant sentences, and selects the parts of the graph that link the
object to the subject of the relation. This ends the application setup and is now possible to generate the semantic extraction models based on the examples. Afterwards
those models are used upon arbitrary documents to extract information of the same
kind of the annotated examples.

5.2 Application Example 2: IE Applied to Electronic Government

65

Table 5.5 Annotations relative to the example excerpt of a document and respective explanation
Annotation
Subsidy:subsdio requester
Organization:GIP

Subsidy:subsdio motivation
motivation:pagamento do tcnico
Subsidy:subsdio moneyAmount
moneyAmount:2,499
Subsidy:subsdio deliberation
deliberation:aprovar
Protocol:protocolo requester
Organization:Associao de Jovens
Ecos Urbanos

Explanation
The word subsdio is an instance of the ontology
class Subsidy. This subsidy has as an object type
property requester with GIP which is an instance
of the ontology class Organization
This subsidy has as a data type property
motivation with value pagamento do tcnico
(technicians payment)
This subsidy has as a data type property
moneyAmount with value 2,499
This subsidy has as a data type property
deliberation with value aprovar (approve)
The word protocolo is an instance of the ontology
class Protocolo. This protocolo has as an object type
property requester with Associao de Jovens
Ecos Urbanos which is an instance of the ontology
class Organization

public GoogMapsWebServResponse (String entity) throws MalformedURLException {


String address = https://maps.googleapis. com/maps/api/geocode/json?address=+entity+,+PT&key=A..
URL u = new URL (address) ;
StringBuilder sb = new StringBuilder () ;
try {
BufferedReader in = new BufferedReader(new InputStreamReader (u.openStream( ), UTF8)) ;
String inputLine ;
while ((inputLine = in. readLine( )) != null) {
sb. append (inputLine) ;
sb. append (newLine) ;
}
in. close () ;
} catch (IOException e) {
e.printStackTrace ( ) ;
}
System.out.println(sb. toString ( )) ;
}

Fig. 5.6 Example of Java code to query Google Maps about the address of an entity

5.2.5

Making Available Extracted Information Using a Map

After having information organized using a semantic framework, it is possible to


discriminate which information is related to a location, e.g. organization headquarters and municipalities. One of the advantages is that, for such cases, the application
can try to obtain the respective the global positioning system (GPS) coordinates in
order to render those points of interest in a map.
However, as the GPS information is not obtained via the semantic extraction
models, it is necessary to write specific code to acquire the information. Figure 5.6
presents a proposal for such code.
Testing the code to obtain information about, for instance, an experimental
theatre group named GrETUA, retrieves a result that includes the postal address and
also the GPS coordinates referring where the group is based. Figure 5.7 presents a
part of the result.
The application can then render a map using the GPS coordinates and, when a
given point in the map is selected the application shows all information related to
that point. Figures 5.8 and 5.9 are screenshots of application examples using a map.

66

Application Examples

],
formatted_address : GRETUA - Grupo Experimental de Teatro da UA, Universidade de Aveiro,
3810-193 Aveiro, Portugal,
geometry : {
bounds : {
northeast : {
lat : 40.6370947,
lng : -8.6577734
},
southwest : {
lat : 40.6366973,
lng : -8.658345599999999
}
}
location : {
lat : 40.6369157,
lng : -8.658038899999999
},
location_type : ROOFTOP,
viewport : {

Fig. 5.7 Google Maps API response when querying for GrETUA. The response format is in JSON
and the relevant data for this example is encircled

Fig. 5.8 The top pane shows a map with several marks for which information exists. The bottom
pane shows the document context relative to the point indicated by the arrow

5.2 Application Example 2: IE Applied to Electronic Government

67

Fig. 5.9 The top pane shows a map with marks for which information exists and the bottom pane
shows the information extracted relative to the point indicated by the arrow

The first example shows, below the map, the text snippet where the location was
found, and the second example shows the information extracted for that location.
In both figures the arrow indicates which mark was pressed.

5.2.6

Conducting Semantic Information Queries

Lets consider a person named Maria that wants to check if there is information
about the build permits she requested. Maria is a frequent name in Portugal and a
keyword based query returned 165 results for Maria using the same document set
from which information was extracted. Having the information extracted to a
knowledge base conforming an ontology makes possible queries that take

68

Fig. 5.10 SPARQL query


for retrieving information
about the outcome of a
build permit requested
by a person named Maria

Application Examples

PREFIX municip: <http://... / municipality.owl#>


SELECT ?person ?document ?id ?outcome
WHERE {
?proc rdf:type municip:BuildPermit; municip:requester ?pret.
?page foaf:topic ?proc; terms:title ?document.
?pret foaf:name ?person.
OPTIONAL {?proc terms:identifier ?id}.
OPTIONAL {?proc municip:deliberation ? outcome}.
FILTER(REGEX(?person,maria)).
}

Table 5.6 Result set relative to the SPARQL query


Person
maria_adel
maria_hele

Document
cm-arouca.pt_ACTA_12_2009
cm-arouca.pt_ACTA_22_2009

Id
12/09
153/2008

Outcome
Solicitar (Request)
Solicitar (Request)

advantage of that semantic framework just as queries to relational databases


take advantage of their structure. Using SPARQL, it is possible to create a semantic
query about build permits requested by someone named Maria. Figure 5.10 presents such query.
The query produces two results, thus restricting the result set, indicating where
the information was found plus the identifier assigned by the municipality and
the outcome so far (Table 5.6): solicitar (request) means that the municipality is
requesting Maria to present more documentation.
The ability of this approach to perform semantic queries was tested in the medical domain. Teixeira et al. (2014) describes two information extraction systems, and
the one named SPHInX has the same architecture discussed here.

References
Brickley D, Miller L (2010) FOAF vocabulary specification 0.98. Namespace document 9
Lopez-Pellicer FJ, Chaves M, Rodrigues C, Silva MJ (2009) Geographic ontologies production in
Grease-II
Mller R, Neuenstadt C, zep , Wandelt S (2013) Advances in accessing big data with expressive ontologies. In: Timm I, Thimm M (eds) KI 2013: advances in artificial intelligence.
Lecture notes in computer science. Springer, Berlin/Heidelberg, pp 118129. doi:10.1007/
978-3-642-40942-4_11
Peters M, Brink C, Sachweh S, Zndorf A (2013) Performance considerations in ontology based
ambient intelligence architectures. In: van Berlo A, Hallenborg K, Rodrguez JMC, Tapia DI,
Novais P (eds) Ambient intelligencesoftware and applications. Advances in intelligent
systems and computing. Springer, Berlin, pp 121128. doi:10.1007/978-3-319-00566-9_16
Rodrigues M, Dias GP, Teixeira A (2010) Knowledge extraction from minutes of Portuguese
municipalities meetings. In: FALA
Rodrigues M, Dias GP, Teixeira A (2013) Towards e-government information platforms for enterprise 2.0. In: Handbook of research on enterprise 2.0: technological, social, and organizational
dimension. IGI, Hershey, pp 676696

References

69

Teixeira A, Ferreira L, Rodrigues M (2014) Online health information semantic search and
exploration: reporting on two prototypes for performing extraction on both a hospital intranet
and the world wide web. In: Neustein A (ed) Text mining of web-based medical content.
De Gruyter, Berlin
Weibel S, Kunze J, Lagoze C, Wolf M (1998) Dublin core metadata for resource discovery. Internet
Engineering Task Force RFC 2413

Chapter 6

Conclusion

In this book was discussed the need to provide formal structures to contents originally
created in unstructured formats using natural language. The volume of relevant
information in such formats increases every day as people use the Internet to communicate, and as organizations create and publish documentation. The contents are
often without formalized markups because marking contents manually can be a
time consuming and error prone task that requires some specialized knowledge. The
objective of information extraction is to analyze these contents and produce fixed
format, unambiguous and formal representations of them, including the identification of the entities involved and the relations they establish among them.
The key concepts of information extraction were explained and illustrated using
representative examples with distinct degrees of complexity. Alongside the examples were presented several high performing and readily available state-of-the-art
tools for implementing information extraction systems, and natural language processing tasks relative to information extraction. The tool selection took into consideration the performance and the ability of providing good results with a wide range
of natural languages.
After the generic pipeline for information extraction and available tools presentation was described an information extraction architecture for developing systems
able to detect and organize relevant information according to an arbitrary ontology.
The architecture was instantiated for Portuguese language. The software tools used
and respective setup were described, and it was explained how the tools work
together in order to have a complete and coherent system. The implemented system
was used for extracting information from local government documents and from
documents relative to healthcare. Key features of the developed system are the ability to process natural language texts, to accept a knowledge domain defined by an
ontology, and to learn from examples how to extract information. The system supports the extension for other natural languages by changing the NLP component,
and without changing the other modules.

The Authors 2015


M. Rodrigues, A. Teixeira, Advanced Applications of Natural Language
Processing for Performing Information Extraction, SpringerBriefs
in Electrical and Computer Engineering, DOI 10.1007/978-3-319-15563-0_6

71

72

6 Conclusion

Two different examples of applications were discussed: a tutorial example and an


application using the architecture presented before. The tutorial example serves as
an introductory example to the subject and also has an example of how it is possible
to enable information extraction features rapidly in any application. The second
example pretends to illustrate how the architecture can be deployed in concrete
applications, and the advantages of assigning a structure to the information by making it possible to display the information spatially, or to make semantic queries that
return a reduced set of useful results. These capabilities are possible when dealing
with structured data, hence the importance of using this type of systems to assign
structure to existing unstructured textual information.
Information extraction and natural language processing technologies are becoming more widespread as the technology matures and people start understanding their
potential. The first systems were almost exclusively developed for English and
nowadays state-of-the-art proposals are tested with several distinct natural languages. Also, search engines that a few years ago exclusively performed keywordbased searchesGoogle, Bing, etc.are using more and more semantic concepts
that are automatically extracted from the web pages. For example, when searching
for pepper, these search engines show alongside the traditional results, a box
containing the usage of the word pepper in different contexts, the same is to say
with distinct semantic meanings. This does not happen for all searches but we foresee that the future will bring much more evidences of information extraction technologies at use.

Index

A
AKTiveMedia, 45
Application programming interface (API), 16

B
Bosque v7.3 corpus, 4142
Boundary detection. See Sentence boundary
detection

C
CoreNLP, 53
cURL/Wget, 51

D
Data generation, 1315
Dependencies, 3132
Document object model (DOM), 55
Document society, 13
Domain representation, 4446

E
e-Government. See Information extraction
(IE), electronic government
Epic parser, 2122

G
GATE, 24
Generic relation extraction, 3132

Global positioning system (GPS), 6567


Google Maps API, 6566
Graphical user interface (GUI), 16
GrETUA, 6566

I
Identifying entities. See also Named entity
recognition (NER)
generic named entities, 30
goals, 27
named entity recognition, 28
relation, 3032
website pages, 28
wikipedia categories, 29
Information extraction (IE)
approaches, 67
architecture, 8
challenges, 56
documents and information retrieval, 5
electronic government
documents, 5960
goals, 5859
maps, 6567
natural language documents, 59
ontology creation, 6165
semantic information queries, 6768
extraction tasks, 5
generic pipeline, 71
identifying entities, 27
information extraction systems, 89
key concepts, 71
natural language texts, 5

The Authors 2015


M. Rodrigues, A. Teixeira, Advanced Applications of Natural Language
Processing for Performing Information Extraction, SpringerBriefs
in Electrical and Computer Engineering, DOI 10.1007/978-3-319-15563-0

73

74
Information extraction (IE) (cont.)
NLP, 72
OBIE, 7, 39
performance measures, 78
process overview, 1315
iSentenizer, 1617

K
Knowledge representation (KR), 34

L
Lemma/lemmatization, 1718

M
MaltParser, 2324, 44
Markov models. See TreeTagger algorithm
Millikan, Robert Andrews, 5152
Morphological analysis
lemma/lemmatization, 1718
POS tagging, 1718
Stanford POS Tagger, 19
SVMTool, 1920
tools, 1819
TreeTagger algorithm, 1920
word stemming, 1718

N
Named entity recognition (NER)
identifying entities, 2728, 30
natural language processing, 4, 4344
Natural language processing (NLP)
architecture, 3940
Bosque sentences, 4142
coherent data representation, 23
definition, 4
documents, 59
GATE, 24
HAREM categories, 4344
named entity recognition,
4344
NLP steps, 1314
NLTK interfaces, 24
POS tags, 43
sentence splitting, 43
Stanford NLP, 2324
syntactic parsing, 44
tasks in, 45

Index
Natural Language Toolkit (NLTK)
Punkt, 16
text processing libraries, 24
NER. See Named entity recognition

O
OBIE. See Ontology-Based Information
Extraction (OBIE)
Ontologies
annotation, 6465
classes, 62
composition, 6162
definition, 4
perspective, 61, 63
reasons, 61
relational database, 3233
relevant information, 3839
Ontology-Based Information Extraction
(OBIE), 7
approaches, 3334
arbitrary ontology, 39
architecture, 3940
identifying entities, 27
IE process, 33
OpenNLP, 53
OWL. See Web Ontology Language (OWL)

P
Parse tree, 31
Part of speech (POS)
natural language processing, 43
relation extraction, 31
Stanford POS Tagger, 19
tags, 1718, 43
Portuguese, 4041
Probabilistic context-free grammar (PCFG)
parser. See Epic parser
Prototype implementation. See State-of-the-art
tools
Punkt approach, 1516

R
r-millikan.txt.xml, 5455
Relation extraction
approaches, 3031
dependencies, 3132
entities, 30
methods, 31

75

Index
named entity recognition, 30
patterns, 31
POS tags, 31
types, 30
Resource Description Framework Schema
(RDFS), 4, 33

S
SAXON, 5658
Semantic extraction and integration
creation, 4648
ontology relation, 4546
partial syntactic structures, 47
use of, 48
Semantics
advantages of semantic search, 3
architecture, 39
knowledge representation, 34
linguistic expressions, 34
queries, 68
Sentence boundary detection.
See Tokenization
Sentence splitting, 43
Simple Knowledge Organization System
(SKOS), 63
Software architecture, OBIE
domain representation, 3940
NLP component, 39
semantic extraction and integration, 3940
SPARQL query, 68
Stanford CoreNLP, 5354
Stanford NLP, 2324
Stanford POS Tagger, 19
StanfordParser, 2223
State-of-the-art tools
architecture instantiation, 4041
domain representation, 4446
natural language processing, 4144
semantic extraction and integration, 4548
Support vector machine (SVM) tool, 1920
Syntactic parsing
computational intensive, 20

Epic parser, 2122


goals, 20
MaltParser, 2324
natural language processing, 44
phrase structure grammar, 2021
StanfordParser, 2223
syntax theories and techniques, 21
TurboParser, 2324

T
Tokenization
boundary detection, 15
definition, 15
representative tools, 1617
tools, 1516
TreeTagger algorithm, 1920
TurboParser, 2324
Tutorials
applications, 72
scenarios, 51
software tools, 53
syntactic parsing, 58
target document, 5458
tools setup, 5354
wikipedia page, 5152

U
Uniform resource locator (URL), 5152
Unstructured documents, 12

W
Web Ontology Language (OWL), 4, 33
Wget command, 5960
Wikipedia page, 5152
Word stemming, 1718

X
XML stylesheet language for transformations
(XSLT), 5557

Das könnte Ihnen auch gefallen