Sie sind auf Seite 1von 18

1(18)

GATE: A Unicode-based
Infrastructure Supporting
Multilingual Information Extraction

Kalina Bontcheva, Diana Maynard,
Valentin Tablan, Hamish Cunningham

Department of Computer Science, University of Sheffield

http://gate.ac.uk/

Structure of the talk:
A brief introduction to GATE
Multilingual infrastructure in GATE
Simple multilingual IE components

2(18)
GATE is...
An architecture A macro-level organisational picture for LE
software systems.
A framework For programmers, GATE is an object-oriented
class library that implements the architecture.
A development environment For language engineers,
computational linguists et al, a graphical development
environment.

GATE comes with...
Some free components... ...and wrappers for other people's
components
Tools for: evaluation; visualise/edit; persistence; IR; IE;
dialogue; ontologies; etc.
Free software (LGPL). Download at
http://gate.ac.uk/download/
3(18)

Architectural principles

Non-prescriptive, theory neutral (strength and weakness)
Re-use, interoperation, not reimplementation (e.g. diverse
XML support, integration of Protg, Jena, Weka...)
(Almost) everything is a component, and component sets
are user-extendable
(Almost) all operations are available both from API and GUI


4(18)
Component-based development

CREOLE
Collection of REusable Objects for Language Engineering:
Java Beans: an OO way of chunking software
GATE components: modified Java Beans with XML
configuration
The minimal component = 10 lines of Java, 10 lines of
XML, 1 URL
Three types: Language Resources, Processing
Resources, Visual Resources

Why bother?
Allows the system to load arbitrary language processing
components
5(18)
Language Resources (LRs)
LRs are documents, ontologies, corpora, lexicons,
LRs can be associated with DataStores (Oracle,
PostgreSQL, XML, Java Serialisation)
Documents / corpora:
Diverse document formats: text, html, XML, email,
RTF, SGML
Optional format-preserving markup analyse / save
Standoff annotation model (start, end, type, features),
derivative of TIPSTER, compatible with ATLAS and
XCES

Coping with diverse character encodings:
New internationalised versions of JVM support >100
different encodings.
Other encodings: developing system for user-entry of
mapping tables (remove programming from the process)
6(18)
Processing Resources (PRs)
Algorithmic components knows as PRs beans
with execute methods.
All PRs can handle Unicode data by default.
Clear distinction between code and data (simple
repurposing).
20-30 freebies with GATE
Controllers: execute a set of PRs
SerialController: sequential run of arbitrary PR set
SerialAnalyserController: analyser PRs over corpus
Conditional controllers: execute depend on features
Parallel controller?
PRs + Controller = Applications
Application parameterisation state can be saved
and restored, and used for embedding / batching

7(18)
V
i
s
u
a
l

R
e
s
o
u
r
c
e
s

(
V
R
s
)

8(18)
VRs (2): Coreference
9(18)
VRs (3): Syntax
10(18)

Displaying Multilingual Data

GATE uses standard (& imperfect) Java rendering engine for displaying text.



11(18)


GATE Unicode Kit (GUK)
Complements Javas facilities
Support for defining
Input Methods (IMs)
Currently 30 IMs
for 17 languages
Pluggable in other
applications (e.g.
JEdit, EUDICO)
Can use virtual kybd
or standard layouts
over QWERTY
IMs defined in plain text files
GUK comes with a
standalone Unicode editor

Editing Multilingual Data
12(18)
Processing Multilingual Data
All processing, visualisation and editing tools use GUK
13(18)
Multilingual IE Components
The ANNIE system a reusable and easily extendable set of
components
14(18)
The Unicode Tokeniser
A very portable component for multliple languages:
splits text into typed tokens based on FSM
dynamically constructed from rules based on
character categories defined by the Unicode, e.g.:
UPPERCASE_LETTER
(LOWERCASE_LETTER|DASH_PUNCTUATION)*
> Token;orth=upperInitial;kind=word;
output generally localised by a later module (e.g.
dont do nt)
23 rules seem able to handle without changes Indo-
European languages.
the English tokeniser: Unicode tokeniser + pattern
grammar FST
15(18)
POS tagging in new languages
TIDES Surprise Language: Hepple tagger but
substituted Cebuano/Hindi lexicon for English
Used empty ruleset since no training data
available
Used default heuristics (e.g. return NNP for
capitalised words)
Very experimental, but reasonable results
67% correctness for Hindi and 75% for
Cebuano
Adaptation time per language - 2 days
16(18)
Porting NE grammars
Most English JAPE rules based on POS tags
and gazetteer lookup
Grammars can be reused for languages with
similar word order, orthography etc.
No time to make detailed study of Cebuano,
but very similar in structure to English
Most of the rules left as for English, but some
adjustments to handle especially dates
Used both English and Cebuano grammars
and gazetteers, because NEs appear in both
languages
17(18)
TIDES Evaluation Results
Cebuano English
Baseline
Entity P R F P R F
Person 71 65 68 36 36 36
Org 75 71 73 31 47 38
Location 73 78 76 65 7 12
Date 83 100 92 42 58 49
Total 76 79 77.5 45 41.7 43
18(18)
Conclusion
GATE a Unicode-based NLP infrastructure,
particularly suitable for multilingual adaptation of
IE systems
Requires little involvement of native speakers
and very little annotated data for a basic job
Future work
Improving multilingual support, e.g.,
morphology support, automatic language and
encoding identification
Learning gazetteer lists from annotated
corpora

Das könnte Ihnen auch gefallen