Beruflich Dokumente
Kultur Dokumente
Vee Satayamas and Asanee Kawtrakul NAiST Research Laboratory Computer Engineering Department Kasetsart University, Thailand { g4665300@ku.ac.th and ak@ku.ac.th }
phrase alignment for supporting statistical machine translation.[1] Parser requires a certain model, which Parsing is an important step for natural language is called grammar, to recognise well-formed senunderstanding, including phrase alignment for sup- tences. Our aim is to enable parser, which needs porting statistical machine translation. Ability on grammar as its core component, to analyse real text. analysing real text by parser strongly depends on By accurately analysing real text , wide-coverage grammar. Treebank could be one of the sources for grammar is necessary. Treebank could be of source grammar extraction. However, treebank construc- for grammar extraction[2][3]. For instance, we can tion largely relies on human annotators intuitions. extract PCFG from treebank by recognising every Dierent intuitions from multiple annotators bring fragments in treebank. Parent constituent is treated inconsistency to treebank construction. In this pa- as right hand-side rule and its immediate children are per, we propose method to construct treebank with put in right-hand side. semi-automatic correction. Furthermore, we utilise Treebank[4][5] is a collection of phrase structure extracted grammar from our corrected treebank to trees, which are constructed by linguists. The obimprove semi-automatic phrase structure annotation jective of treebank is for language phenomena obaccuracy for next incremental treebank construction. servation. Treebank could be one of the sources for Thus, it alleviates wasting labour force on phrase grammar extraction. Treebank is also used as test structure annotation. Using corrected treebanks, we suite for evaluating parser accuracy. In order to cover can extract better wide-coverage grammar from them language phenomena, size of treebank suppose to be for supporting parser. large but treebank construction is a labour intensive task. Therefore, most of treebanks are annotated by multiple annotators, who often have dierent intu1 Introduction itions. Dierent intuitions from multiple annotators Parser is a program that analyse structure of bring treebank construction inconsistency or even sentences. It plays an important role on nat- plain error. Moreover, annotating sophisticate treeural language understanding (NLU), including bank is time intensive therefore most of treebanks are 1
Abstract
compromised among linguistic correctness and ease of annotation. Treebank inconsistency, in this paper, can be categorised into 4 categories that are: 1. Inconsistency from ne/coarse annotation Fine grain annotation is time intensive task. Hence some constituents are left unannotated. However there are not suppose to have the same kind of consituents that are annotated and unannotated otherwise treebank will be inconsistence, as illustrated in gure 1. VP V NP N PP P NP N
We can bracket consituent with dierent boundaries because annotation guide is not cover that kind of consituent or even human error, as illustrated in gure 2 NP NP NP NP
body
medicine
important
Figure 2: Example tree of inconsistency from dierent boundary annotation 3. Inconsistency from dierent attachment Some consistuents can grammatically attach to dierent consituent that cause annotators to make the decision, which each annontator s decision on the same text can be dierent, as illustrated in gure 3. 4. Inconsistency from dierent consituent category PP Human annotators have to assign category to each consituent. They might assign dierent or incorrent category to consituent that bring inconsistency or even errors.
VP V NP N
In Penn Treebank[6], some of inconsistency problems were solved by annotation criteria. Krotov[7] eat rice in house proposes method to compact context free phrase structure grammar (CF-PSG) that are extracted Figure 1: Example of inconsistency from ne/coarse from Penn Treebank by eliminating unnecessary annotation rules, as shown in gure 4 and recalculating probabilistic values. Compaction method proposed by Krotov[7], is non2. Inconsistency from dierent boundary annotapreserving because left-hand side rules can be parsed tion 2
VP V NP N PP P NP N
(1) PP P N (2) PP P NP (3) NP N Rule (1) can be eliminated because rule (2) and rule(3) can be used instead. Figure 4: Unnecessary rules elimination ing group that some small number of constituents are annotated in dierent way from most constituents in the same group. They formalise these by skewvalue. Groups that have high skewvalue are recognised to be suspicious. Skewvalue can calculate by skewvalue(g ) =
cC (fc
VP V NP N P
PP NP N
eat
mean(fC ))2
rice
in
house
if |C | > 1 if |C | = 1
In this paper, we propose treebank correction espeFigure 3: Example tree of inconsistency from dier- cially on inconsistency from ne/coarse annotation, which mostly found in our Thai treebank. We nd ent attachment ne/coarse annotation by the method that similar to , PP can attach to NP or VP. Krotov[7] s method but we use it on tree in treeby other rules, which not guaranteed to respect lin- bank alternative to Krotov s method that work on guistically motivated structure. Genabith[8][9] pro- extracted grammar. We also apply Kaljurand[10] to posed structure preserving CF-PSG compaction by check other types of inconsistency. replacing tags with super-tags and relabelling treebank. Super-tags is more general than previous tags. 2 Our approach It remains linguistically motivated structure as previous treebank except labels are more general. Thus, 2.1 Treebank Construction size of extracted grammar is reduced. Kaljurand[10] made a tool for checking treebank Treebank are constructed by linguists. Automatic consistency by searching for suspicious to be incon- programs are applied to minimise human labour but sistency groups. Group is a sequence of lexical units, linguists need to correct the results from automatic which the part-of-speech of each items in sequence are annotation [4][5]. In this paper, we propose extended the same. They search suspicious groups by select- process to treebank construction, that is treebank 3
correction. We also propose automatic treebank errors and inconsistency detection program. Hence our treebank construction processes consist of: 1. Sentence boundary annotation Unlike English language, which use full stop as sentence boundary marker, There is no explicit sentence boundary marker in Thai language. Anyhow, [11] proposes automatic process to extract Thai sentence boundaries with 79.82% accuracy. Thus, we can use it as pre-process program for sentence boundary annotation to alleviate labour consummative. According to our treebank, we use line breaker to separate to sentences, as shown in gure 5
(ncn )
chicken sh
(vt )
eat salty
(ncn )
egg expensive
4. Phrase structure annotation We use existing free implementation of PCFG parser in NLTK to annotate phrase structure automatically and let linguists x its result later. We also use lisp-like bracket, as shown in gure 8, which equivalent as tree in gure 9. We also reuse extracted grammar from treebank in this step.
5. Treebank correction The treebank correction details are explained in next section.
- common noun, vt - transitive verb, vi - intransitive verb, adj - adjective
1 ncn
chicken
eat
NP
egg
house
cat eat
rice
in
new
S VP NP PP p NP ncn adj
Some constituents may be annotated incorrectly, for instance in gure 12 one of a VP should be cat eat rice in house new annotated as SUBC like in gure 13. This also can be done by the rules that extract from treebank. Accordingly, We apply skewvalue to check Figure 11: Tree that noun phrase in preposition these[10]. Suspicious trees suppose to have high phrase was annotated skewvalue. 5
3
S NP ncn adj VP vt conjncl VP *VP* VP vi NP pref1 vi
Conclusion
smell
We contribute Thai treebank constructing framework and treebank semi-automatic correction method to alleviate treebank inconsistency problem. Hence, we can construct treebank faster by better quality. By our corrected treebank, we can extract better widecoverage grammar for supporting parser. However, our treebank is based on pure context free phrase structure grammar. Annotating more sophisticate grammar makes our work to support more sophisticate parser, which can parse Thai text more accurate.
good
cause -ing feel
Acknowledgement
Thanks Junior Science Talent Project (JSTP) supported by National Science and Technology Development Agency (NSTDA) and The Thailand Research Fund(TRF) and Kasetsart University Research and Development Institute (KURDI) for funding. Thanks Mukda Suktarachan and Pacharee Varasrai for corpora annotating. Thanks Warat Yingsaeree for proof reading.
References
[1] Kenji Imamura, Hierarchical phrase alignment harmonized with parsing, in NLPRS, 2001. [2] Eugene Charniak, Tree-bank grammars, in National Conference on Articial Intelligence (AAAI), 1996. [3] Michael Burke, Aoife Cahill, Ruth ODonovan, Josef Van Genabith, and Andy Way, Treebankbased acquisition of wide-coverage, probabilistic lfg resources: Project overview, result and evaluation, in IJCNLP-04 Workshop, 2004. 6
smell
good
cause -ing feel
traction, in The Proceeding of the SNLP 2000, [4] Mitchell Marcus, Grace Kim, Mary Ann 2000. Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schas[12] Virach Sornlertlamvanich, Word segmentation berger, Building a large annotated corpus of for thai in machine translation system, in english: The penn tree bank, in Computational Machine Translation, National Electronics and Linguistics, 1993, pp. 313330. Computer Technology Center, 1993. [5] Nianwen Xue and Fei Xia, The bracketing [13] Asanee Kawtrakul, Chalatip Thumkanon, Yuen guidelines for penn chinese treebank 3.0, Tech. Poovorawan, Patcharee Varasrai, and Mukda Rep. 00-08, University of Pennsylvania, 2000, Suktarachan, Automatic thai unknown word IRCS Report. recognition, 1997. [6] Mitchell Marcus, Grace Kim, Mary Ann [14] Sutee Sudprasert and Asanee Kawtrakul, Thai Marcinkiewicz, Robert MacIntyre, Ann Bies, word segmentation based on global and local unMark Ferguson, Karen Katz, and Britta Schassuperived machine learning, in NCSEC 2003, berger, The penn treebank: Annotating pred2003. icate argument structure, in ARPA Human [15] Ingo Schroder, Case study in part-of-speech Language Technology Workshop, 1994. tagging using the icopost toolkit, Tech. Rep., [7] Alexander Krotov, Robert Gaizauskas, Mark Department of Computer Science, University of Hepple, and Yorick Wilks, Compacting the Hamburg, 2002. penn treebank grammar, in proceedings of COLING/ACL98, Montreal, Canada, 1998, pp. 699703. [8] Josef van Genabith, Anette Frank, and Andy Way, Treebank vs. xbar-based automatic fstructure annotation, in The LFG01 Conference, 2001. [9] Josef van Genabith, Louisa Sadler, and Andy Way, Structure preserving cf-psg compacting, lfg and treebanks, in Proceeding of ATALA Workshop - Treebanks, 1999. [10] Kaarel Kaljurand, Checking treebank consistency check to nd annotation errors, Report, 2004. [11] Pradit Mittrapiyannuruk and Virach Sornlertlamvanish, The automatic thai sentence ex7