Beruflich Dokumente
Kultur Dokumente
(,29
2 KNOWLEDGE ACQUISITION Itere, we will demonstrate the usage of a sentence
We have used two schemes to extract knowledge from axis with one sentence. In our real application we,
corpora. Both produce readable patterns t h a t can be of course, use more text to build up a database of
verified by a linguist. In the first scheme, sentences sentence axes. Consider the following sentence a
are handled as units and information about the struc- I S U B J would_+FAUXV also_ADVL
ture of the sentence is extracted. 0 n l y the main con- increase_-FMAINV child NN> benefiLOBa ,
stituents (like subject, objects) of the sentence are give_-FMAINV some_QN> help OBJ
treated at this stage. T h e second scheme works with t0_AI)VL the 1)N> car_NN> industry <P
local context and looks only a few words to the right and CC relax_-FMAINV r~,les OBa
and to the left. It is used to resolve the nmdifier-t,ead governing_<NOM-FMAiNV Iocal AN>
dependencies in the phrases. avthority_NN> capital_AN> reeeipts OBJ ,
First, we form an axis of the sentence using some alIowing_-FMAINV councils SUBJ
given set of syntactic tags. We collect several layers /o_INFMAI{K> spend_-FMAINV more A D V L .
of p a t t e r n s t h a t m a y be partly r e d u n d a n t with each
The axis according to the manually defined set T =
other. For instance, simplifying a little, we can say
t h a t a sentence can be of the form subjecl -- main { SUBJ + F A U X V + F M A I N V }
verb and there m a y be other words before and after is
the subject and m a i n verb. We may also say that
a sentence can be of the form subject - - main verb • .. SUBJ + F A U X V ... SUBJ ...
- - object. T h e latter is totally covered by the former which shows what order the elements of set T ap-
because the former s t a t e m e n t does not prohibit the pear in the sentence above, and where three (lots
appearance of an object but does not require it either. mean t h a t there may be something between words,
T h e r e d u n d a n t p a t t e r n s are collected on purpose. e.g. + F A U X V is not followed (in ttfis c ~ e ) immedi-
During parsing we try to find the strictest frame for ately by SUBJ. When we have more than one. sen-
the sentence. If we can not apply some pattern be- tence, ttm axis contains more than one possible order
cause it conflicts with the sentence, we may use other, for the elements of set T.
possibly more general, pattern. For instance, an axis
The axis we have extracted is quite general. It de-
t h a t describes all accepted combinations of subject,
fines the order in which the finite verbs and subjects
objects and main verbs in the sentence, is stricter
in the sentence may occur but it does not say anything
t h a n an axis t h a t describes all accepted combinations
about nmdlnite verbs in the sentence. Notice t h a t the
of subjects and m a i n verbs.
second subject is not actually tt,e subject of the fi-
After applying the axes, the parser's o u t p u t is usu- nite clause, but the subject of nontinite construction
ally still ambiguous because all syntactic tags are not
councils to spend more. This is inconvenient, and a
taken into account yet (we do not handle, for instance, question arises whether there should be a specific tag
determiners and adjective premodifiers here). The re- to mark suhjects of the nonllnite clauses. Voutilainen
m a i n i n g ambiguity is resolved using local information
and Tapanaincn [1993] argued t h a t the richer set of
derived from a corpus. The second phase has a more tags could make parsing more accurate in a rule-based
probabilistie fiavour, a l t h o u g h no actual probabilities system. It may be true he.re as well.
are computed. We represent information in a readable
We can also specify an axis for verbs of the sentence.
form, where all possible contexts, t h a t are c o m m o n
'Fhus the axis according to tim set
enough, are listed for each syntactic tag. The length
of the contexts m a y vary. T h e c o m m o n contexts arc { +FAUXV +FMAINV
longer t h a n the rare ones. In parsing, we try to find - F M A I N V INFMAI{,K> }
a m a t c h for each word in a maximally long context. is
Briefly, the relation between tim axes and the joints
is following. T h e axes force sentences to comply with . . . . kFAUXV . . . . . FMAINV . . . . . FMAINV
the established frames. If more t h a n one possibility is . . . . FMAINV . . . . . FMAINV .,. INFMAR, K >
found, the joints are used to rank them. - F M A I N V • ..
The nonlinite verbs occur in this axis four times o n e
2.1 The sentence axis after another. We do not want just to list how many
In this section we present a new m e t h o d to collect times a nonllnite verb may occur (or occurs in a cor-
information from a tagged corpus. We define a new pus) in this kind of position, so we clearly need some
concept, a sentence axis. T h e sentence axis is a pat- generalisations.
tern t h a t describes the sentence structure at an ap- The fundamental rule ofgeneralisation t h a t we used
propriate level. We use it to select a group of possible is the following: Anything that is repeated may be
analyses for the sentence. In our implementation, we repeated any number of times.
form a group of sentence axes and the parser selects, We mark this using l)rackets and a plus sign. The
using the axes, those analyses of the sentence t h a t generalised axis for the above axis is
m a t c h all or as m a n y as possible sentence axes.
• .. + F A U X V [ . . . . FMAINV ] +
We define the sentence axis in the following way.
• .. INI,'MARK> - F M A I N V ...
Let S be a set of sentences and T a, set of syntactic
tags. T h e sentence axis of S according to tags T shows aThe tag set is adapted from the Constraint Grammar
the order of appearance of any tag in T for every sen- of English as it is. It is more extensive than commonly
tence in S. used in tagged corpora projects (see Appendix A).
630
We can also repeat longer sequences, for instance the in an ideal case we have only one head in each
set l)hrase, although it may not be in its exact location
yet. r['he following senLencv, fragment (temonstrate.s
{ --FMAINV <N()M-I,'MAINV +FAUXV
this
SUBJ OBJ }
provides the axis They SUlta have...+FAUXV been -VMAINV
much AD-A> less P(X)Mlq,--,qfAD--A>
SUBJ + F A U X V . . . . F M A I N V . . . o B a
attentive <NOM/PCOMPl,-.S
. . . . . F M A I N V . . . o B a . . . . FMAINV OBJ
to < N O M / A I ) V I , theft)N> . . .
• .. < N O M - F M A I N V . . . OBa . . .
- F M A I N V S U B J . . . . FMAINV . . . In tit(.' analysis, the head of the l)hr~me mvch less
And we lbrm a generalisation attentive may be less or altenlive. If it is less the
word attentive is a postn,odifier, and if the head is at-
SUBJ + F A U X V [ . . . . . FMAINV . . . OBJ ]+
tentive then less is a premodilier. Tim sentence is
• .. < N O M - F M A I N V . . . OBJ . . .
represented internally in the parser in such a way
- F M A I N V SUBJ . . . . FMAINV . . .
that if the axes make this distinction, i.e. force
Note t h a t we added silently an extra (tot be.tweeu the.re to be exactly one subject complement, there
one - F M A I N V and OBY in order not to make, dis- are only two possil)le paths which the joints can se-
tinctions between - F M A I N V OBg and -FMAINV. . . lect from: less AD-A> attenlive_l'COMPL- S and
OBJ here. less J'COMPl,- S attentive <NOM.
Another generalisation can be made using equiva- Generating the joints is quite straightforward. We
lence clauses. We can ,assign several syntactic tags to produce different alternative variants for each syntac:
the same equivalence class (for instance -I"MAINV, tic tag and select some of them. Wc use a couple of
< NOM-FMAIN V arrd < P-FMA [N V), and then gen-. parameters to validate possible joint candidates.
crate axes as above. 'l'he result would be
SUBJ + F A U X V [ . . . nonfinv . . . OBJ ]-I- * q'he error margin provides the probability for check-
• .. nontinv SUBJ . . . nonfinv . . . ing if the context is relevant, i.e., there is enottg]t
evidence for it among the existing contexts of the
where nonfinv denotes both -FMAINV ;u,d <NOM tag. This probability may be used in two ways:
FMAINV (and also <P-.1,'MAINV).
The equivalence classes are essential in the present l,'or a syntactic tag, generate all contexts (of
tag set because the syntactic arguments of finite verbs length n) tl, at appear in the corpora. Select all
are not distinguished from the arguments of nontlnite those contexts that are frequent enough. Do this
verbs. Using equivalence classes for the finite attd non- with all n's wdues: 1, 2, ...
finite verbs, we may tmiht an generallsation that ;tl)-
-- First generate all contexts of length t. Select
plies to both types of clauses. Another way to solve
those contexts that are fregnent enough among
the problem, is to add new tags for the arguments of
the generated contexts. Next,, lengtlmn all con-
the nontinite clauses, arid make several axes for them.
texts selected in the previous step by one word.
Select those contexts that are frequent enough
2.2 Local patterns
among the new generated context,s. R.epeat this
In the second phase of the pattern parsing scheme we sulficient malty times.
apply local patterns, the joints. They contain iofor-
mation about what kinds of modifiers have what kinds lloth algorithms l)roduce a set of contexts of differ-
of heads, and vice versa. ent lengths. Characteristic for t)oth the algorithms
For instance, in the following sentence4 the words is that if they haw; gene.rated a context of length
fair and crack are both three ways ambiguous before n that matches a syntactic function in a sentence,
the axes are applied. there is also a context of length n - 1 that matches.
He_SUBJ gives_-t-FMAINV us I-OBa • The absolute, margin mmd)er of cases that is
a l)N> fairAN>/SUBJ/NN> needed for the evidence, of till: generated context.
crack_OBJ/+FMAINV/SUIU theT, fll)VL If therc is less cvidencc, it is not taken into account
we. SUBJ wiII+FAUXV arm a shorter context is generated. '.['his is used to
be_--FMAINV/-FAUXV in_AI)VL with_A1)VL prevent strange behaviour with syntactic tags that
a_l)N> chance<P of < N O M - O F are not very common or with a corpus that is not
c a ~ y i n g < P - F M A I N V off < N O M / A I ) V L big enough.
the DN> World <P/NN> Cvp_<P/Ol3J .
® 'l'he maximum length of the context to be gener-
After the axes have been applied, the noun phr,xse a
ated.
fair crack has the analyses
a DN> f a i r A N > / N N > crack OBJ. l)uring the parsing, longer contexts are preferred
The word fairis still left partly ambiguous. We resolve to shorter ones. The parsing problem is thus a kind
this ambiguity using the joints. of pattern matching problem: we have to match a
pattern (context) arouml each tag and tlnd a sequence
4This analysis is comparable to the output of I'3NCCG. of syntactic tags (analysis of the sentence) that h~m
The ambiguity is marked here using the slash. The mor.- the best score. The scoring fimetion depends on the
phological information is not printed. lengths of the matched patterns.
631
I t e x t . II w o r d s ] a m b i g u i t y rate I error rate ]
_T_Texts ~ s e r s
bbl 1734' 12.4 % 2.4 % [ riB1 .I TODAY
bb2 1674 14.2 % 2.8 %
1599 18.6 % 1.6 % ~ ~ 92.5% [ 9 2 ~
wsj " 2309 16.2 % 2.9 %
]. t o t a l l ] 7316 I 15.3 % ] 2.2 % - 1
%__1 91.9-W-o
Figure 1: Test corpora after syntactical analysis of
ENGCG. Figure 2: Overall parsing success rate in syntactically
amdysed samples
632
4 CONCLUSION ceedings of COLING-9]~. Kyoto, 1994.
We discussed combining a linguistic rule-based parser [Karlsson, 1990] Fred Karlsson. Constraint Grammar
and a corpus-based empirical parser. We divide the as a framework for parsing running text. in tIans
parsing process into two parts: applying linguistic in- Karlgren (editor), COLING-90. Papers presented
formation and applying corpus-based patterns. The to the 13th International Conference on Compv-
linguistic rules are regarded ms more reliable than the tational Linguistics. Vol. 3, pp. 168-173, IIelsinki,
corpus-based generalisations. They are therefore ap- 1990.
plied first.
[Karlsson, 1994] Fred Karlsson. Robust parsing of un-
The idea is to use reliable linguistic information as
constrained text. In Nelleke Oostdijk and Pieter
long as it is possible. After certain phase it comes
de IIaan (eds.), Corpus-based Research Into Lan-
harder and harder to make new linguistic constraints
guage., pp. 121-142, l~odopi, Amsterdam-Atlanta,
to eliminate the remaining ambiguity. Therefore we 1994.
use corpus-based patterns to do the remaining dis-
and)iguation. The overall success rate of the com- [Karlsson et at., 1994] Fred Karlsson, Atro Voutilai-
bination of the linguistic rule-based parser and the hen, Juha Ileikkilii. and Arto Anttila (eds.) Con-
corpus-based pattern parser is good. If some unrc- straint Grammar: a Language-Independent System
solvable ambiguity is left pending (like prepositional for Parsing Unrestricted Text. Mouton de Gruyter,
attachment), the total success rate of our morpho- Berlin, 1994.
logical and surface-syntactic analysis is only slightly [Koskenniemi, 1983] Kimmo Koskenniemi. Two-level
worse than that of many probabilistic part-of-speech morphology: a general computational model tbr
taggers. It is a good result because we do more than word-form recognition and production. Publica-
just label each word with a morphological tags (i.e. tions nro. 11. Dept. of General Linguistics, Univer-
noun, verb, etc.), we label them also with syntactic sity of Ilelsinki. 1983.
fimction tags (i.e. subject, object, subject comple-
ment, etc.). [Koskenniemi, 1990] Kimmo Koskenniemi. Finite-
Some improvements might be achieved by modi- state parsing and disambiguation. In lians Karl-
fying the syntactic tag set of ENGCG. As discussed gren (editor), COLING-90. Papers presented to the
above, the (syntactic) tag set of the ENGCG is not 13th International Conference on Computational
probably optimal. Some ambiguity is not resolvable Linguistics. Vol. 2 pages 229-232, ll[elsinki, 1999.
(like prepositional attachment) and some distinctions [Koskenniemi el al., 1992] Kimmo Koskenniemi, P ~ i
arc not made (like subjects of the finite and the non- Tapanainen and Atro Voutilainen. Compiling and
finite clauses). A better tag set for surface-syntactic using finite-state syntactic rules. In Proceedings of
parsing is presented in [Voutilainen and Tapanainen, the fifteenth International Conference on Computa-
1993]. But we have not modified the present tag set tional Linguistics. COLING-92. Vol. I, pp. 156-102,
because it is not clear whether small changes would Nantes, France. 1992.
improve the result significantly when compared to the [Tapanainen, 1991] Past Tapanainen. ~.Srellisinii au-
effort needed. tomaatteina esitettyjen kielioppis~i£ntSjen sovelta-
Although it is not possible to fully disambiguate the minen hmnnollisen kielen j,isentKj~sK (Natural lan-
syntax in ENGCG, the rate of disambiguation can be guage parsing with finite-state syntactic rules).
improved using a more powerful linguistic rule tbrmal- Master's thesis. Dept. of computer science, Univer-
ism (see [Koskenniemi el al., 1992; Koskenniemi, 1990; sity of Ilelslnki, i991.
Tapanainen, 1991]). The results reported in this sudy
can most likely be improved by writing a syntactic [Voutilainen, 1994] Afro Voutitainen. Three studies
grammar in the finite-state framework. The same of grammar-based surface parsing of unrestricted
kind of pattern parser could then be used for disam- english text. Publications nr. 24. Dept. of General
biguating the resulting analyses. Linguistics. University of llelsinki. 19!)4.
[Voutilainen el al., 1992]
5 ACKNOWLEDGEMENTS Atro Voutilainen, Juha lteikkilii and Arto Anttila.
The Constraint Grammar framework was originally Constraint grammar of English - - A l'erformance-
proposed by Fred Karlsson [1990]. The extensive work Oriented Introduction. Publications nr. 21. Dept.
on the description of English was (tone by Atro Vouti- of General Linguistics, University of Ilelsinki, 1992.
lainen, Juha tleikkil~ and Arto Anttila [1992]. Timo [Voutilainen and Tapanainen, 199q] Atro Voutilainen
J~rvinen [1994] has developed the syntactic constraint and Past Tapanainen. Ambiguity resolution in a re-
system further. ENGCG uses Kimmo Koskenniemi's dnctionistic parser. In Proceedings of of Sixth Con-
[1983] two-level morphological analyser and Past Ta- ference of the European Chapter of the Association
panainen's implementation of Constraint Grammar for Computational Linguistics. EACL-93. pp. 394-
parser. 403, Utrecht, Netherlands. 1993.
We want to thank Fred Karlsson, Lauri Karttunen,
Annie Za~nen, Atro Voutilainen and Gregory Grefen- A TIIE TAG SET
stette for commenting this paper. 2'his appendix contains the syntactic tags we have
used. Tbe list is adopted from [Voutilainen et al.,
References
1992]. To obtain also the morphological "part-of-
[J~rvinen, 1994] Timo J~rvinen. Annotating 200 mil- speech" tags you can send an empty e-mail message
lion words: The Bank of English project. [n pro- to engcg-info@ling.helsinki.fi.
633
+ F A U X V = Finite Auxiliary Predicator: lie ca_~n
read.,
- F A U X V = Nonfinite Auxiliary Predicator: Fie may
have_ read.,
+ F M A I N V = Finite Main Predicator: He reads.,
- F M A I N V = Nonfinite Main Predicator: He has
re a~d.,
NPHIL = Stray NP: Volume I: Syntax,
SUBJ = Subject: H__~ereads.,
F - S U B J = Formal Subject: There was some argu-
ment about that. I_tt is raining.,
OBJ = Object: He read a book.,
I - O B J = Indirect Object: He gave Mary a book.,
P C O M P L - S = Subject C o m p l e m e n t ~ is a fool.,
P C O M P L - O = Object Complement: I eonsid~--him
a fool.,
AI~V-L = Adverbial: He came home late. He is in the
Ca r.,
O - A D V L = Object Adverbial: lie ran two miles.
A P P = Apposition: Helsinki, the capital of Finland,
N = Title: King George and Mr.
DN> - D e t e ~ n e r : He read the book.,
NN> = Premodifying Noun: The car park was full.,
A N > -- Premodifying Adjective: The bh, e car is
mine.,
Q N > --- Premodifying Quantifier: He had two sand-
wiches and some coffee.,
G N > = Premodifying Genitive: M._yy car and flill's
bike are blue.,
A D - A > = Premodifying Ad-Adjective: She is very
intelligent.,
< N O M - O F = Postmodifying Of: Five of you will
pass.,
< N O M - F M A I N V = Postmodifying Nonfinite Verb:
He has the licence lo kill. John is easy to please. The
man drinking coffee is my uncle.,
<AD---A---~=Postmodifying Ad-Adjective: This is good
enough.,
< ~ = Other Postmodifier: The man with glasses
is my uncle. He is the president elect. The man i_~n
the moon fell down too soon.,
INFMAP~K> = Infinitive Marker: John wants t~
read.,
< P - F M A I N V = Nonfinite Verb as Complement of
Preposition: This is a brush for cleaning.,
< P = Other C o m p l e m e n t of P r e l ) ~ He is in tile
car.,
CC --- Coordinator: John and Bill are friends.,
CS = Subordinator: / f Johu is there, we shall go, too.,
634