Sie sind auf Seite 1von 7

Wide-Coverage Grammar Extraction from Thai Treebank

Vee Satayamas and Asanee Kawtrakul NAiST Research Laboratory Computer Engineering Department Kasetsart University, Thailand { g4665300@ku.ac.th and ak@ku.ac.th }
phrase alignment for supporting statistical machine translation.[1] Parser requires a certain model, which Parsing is an important step for natural language is called grammar, to recognise well-formed senunderstanding, including phrase alignment for sup- tences. Our aim is to enable parser, which needs porting statistical machine translation. Ability on grammar as its core component, to analyse real text. analysing real text by parser strongly depends on By accurately analysing real text , wide-coverage grammar. Treebank could be one of the sources for grammar is necessary. Treebank could be of source grammar extraction. However, treebank construc- for grammar extraction[2][3]. For instance, we can tion largely relies on human annotators intuitions. extract PCFG from treebank by recognising every Dierent intuitions from multiple annotators bring fragments in treebank. Parent constituent is treated inconsistency to treebank construction. In this pa- as right hand-side rule and its immediate children are per, we propose method to construct treebank with put in right-hand side. semi-automatic correction. Furthermore, we utilise Treebank[4][5] is a collection of phrase structure extracted grammar from our corrected treebank to trees, which are constructed by linguists. The obimprove semi-automatic phrase structure annotation jective of treebank is for language phenomena obaccuracy for next incremental treebank construction. servation. Treebank could be one of the sources for Thus, it alleviates wasting labour force on phrase grammar extraction. Treebank is also used as test structure annotation. Using corrected treebanks, we suite for evaluating parser accuracy. In order to cover can extract better wide-coverage grammar from them language phenomena, size of treebank suppose to be for supporting parser. large but treebank construction is a labour intensive task. Therefore, most of treebanks are annotated by multiple annotators, who often have dierent intu1 Introduction itions. Dierent intuitions from multiple annotators Parser is a program that analyse structure of bring treebank construction inconsistency or even sentences. It plays an important role on nat- plain error. Moreover, annotating sophisticate treeural language understanding (NLU), including bank is time intensive therefore most of treebanks are 1

Abstract

compromised among linguistic correctness and ease of annotation. Treebank inconsistency, in this paper, can be categorised into 4 categories that are: 1. Inconsistency from ne/coarse annotation Fine grain annotation is time intensive task. Hence some constituents are left unannotated. However there are not suppose to have the same kind of consituents that are annotated and unannotated otherwise treebank will be inconsistence, as illustrated in gure 1. VP V NP N PP P NP N

We can bracket consituent with dierent boundaries because annotation guide is not cover that kind of consituent or even human error, as illustrated in gure 2 NP NP NP NP


body


medicine

important

Figure 2: Example tree of inconsistency from dierent boundary annotation 3. Inconsistency from dierent attachment Some consistuents can grammatically attach to dierent consituent that cause annotators to make the decision, which each annontator s decision on the same text can be dierent, as illustrated in gure 3. 4. Inconsistency from dierent consituent category PP Human annotators have to assign category to each consituent. They might assign dierent or incorrent category to consituent that bring inconsistency or even errors.

VP V NP N

In Penn Treebank[6], some of inconsistency problems were solved by annotation criteria. Krotov[7] eat rice in house proposes method to compact context free phrase structure grammar (CF-PSG) that are extracted Figure 1: Example of inconsistency from ne/coarse from Penn Treebank by eliminating unnecessary annotation rules, as shown in gure 4 and recalculating probabilistic values. Compaction method proposed by Krotov[7], is non2. Inconsistency from dierent boundary annotapreserving because left-hand side rules can be parsed tion 2

VP V NP N PP P NP N

(1) PP P N (2) PP P NP (3) NP N Rule (1) can be eliminated because rule (2) and rule(3) can be used instead. Figure 4: Unnecessary rules elimination ing group that some small number of constituents are annotated in dierent way from most constituents in the same group. They formalise these by skewvalue. Groups that have high skewvalue are recognised to be suspicious. Skewvalue can calculate by skewvalue(g ) =
cC (fc

VP V NP N P

PP NP N


eat

mean(fC ))2

rice

in


house

if |C | > 1 if |C | = 1

where C is a set of categories used for g in the treebank and g is group.

In this paper, we propose treebank correction espeFigure 3: Example tree of inconsistency from dier- cially on inconsistency from ne/coarse annotation, which mostly found in our Thai treebank. We nd ent attachment ne/coarse annotation by the method that similar to , PP can attach to NP or VP. Krotov[7] s method but we use it on tree in treeby other rules, which not guaranteed to respect lin- bank alternative to Krotov s method that work on guistically motivated structure. Genabith[8][9] pro- extracted grammar. We also apply Kaljurand[10] to posed structure preserving CF-PSG compaction by check other types of inconsistency. replacing tags with super-tags and relabelling treebank. Super-tags is more general than previous tags. 2 Our approach It remains linguistically motivated structure as previous treebank except labels are more general. Thus, 2.1 Treebank Construction size of extracted grammar is reduced. Kaljurand[10] made a tool for checking treebank Treebank are constructed by linguists. Automatic consistency by searching for suspicious to be incon- programs are applied to minimise human labour but sistency groups. Group is a sequence of lexical units, linguists need to correct the results from automatic which the part-of-speech of each items in sequence are annotation [4][5]. In this paper, we propose extended the same. They search suspicious groups by select- process to treebank construction, that is treebank 3

correction. We also propose automatic treebank errors and inconsistency detection program. Hence our treebank construction processes consist of: 1. Sentence boundary annotation Unlike English language, which use full stop as sentence boundary marker, There is no explicit sentence boundary marker in Thai language. Anyhow, [11] proposes automatic process to extract Thai sentence boundaries with 79.82% accuracy. Thus, we can use it as pre-process program for sentence boundary annotation to alleviate labour consummative. According to our treebank, we use line breaker to separate to sentences, as shown in gure 5

(chicken) (eat) (egg) (sh) (salty) (expensive)


Figure 6: Words that are seperated by spaces

bracket as in Penn Treebank, as illustrated in gure 7 1

(ncn )
chicken sh

(vt )
eat salty

(ncn )
egg expensive

(ncn ) (adj ) (vi )

Figure 7: POS Annotation

( Chicken eat egg. ) ( Salted sh is expensive )


Figure 5: Sentences that are seperated by line breakers 2. Word boundary annotation Unlike English language, there is no explicit word boundary marker in Thai language. There are several attempts to nd Thai word boundary automatically (a.k.a. word segmentation and word breaking ) by dictionary-based[12], annotated corpus based[13] and unannotated corpus based[14]. In this paper, we use Thai Word Segmentation based on Global and Local Unsupervised Machine Learning[14] because we do not have large annotated Thai corpus. 3. Part-of-speech(POS) annotation We use a free implementation of acopost[15] to annotate POS automatically and linguists x its result later. In our treebank, we use lisp-like 4

4. Phrase structure annotation We use existing free implementation of PCFG parser in NLTK to annotate phrase structure automatically and let linguists x its result later. We also use lisp-like bracket, as shown in gure 8, which equivalent as tree in gure 9. We also reuse extracted grammar from treebank in this step.

(NP (ncn ) ) (VP (vt ) (NP (ncn ) ) )


chicken eat egg

Figure 8: Phrase Bracket Annotation

5. Treebank correction The treebank correction details are explained in next section.
- common noun, vt - transitive verb, vi - intransitive verb, adj - adjective
1 ncn

S NP VP S NP ncn vt ncn VP NP PP p ncn adj


chicken


eat

NP


egg

Figure 9: Phrase Structure Tree

2.2 Treebank Correction


Our approach is focus on treebank correction instead of trying to reduce eect from treebank errors. Treebank correction is semi-automatic process. It performs similar theme as spelling checker. Treebank correction system guides human corrector by extracting suspicious trees. This system uses heuristic methods, as the follows: 1. Extract ne/course grain inconsistency trees The phrase (house) (new) in gure 10 was left unannotated but it should be annotated as in gure 11. This can be done by rules that was extraction from treebank. According to this example, the rule NP ncn adj must be extracted from treebank. By the way, it can be applied to tree in gure 10 where (house) (new) was left annotated. 2. Extract other inconsistency trees NP ncn vt ncn


house


cat eat

rice

in

new

Figure 10: Tree that noun phrase was left unannotated

S VP NP PP p NP ncn adj

Some constituents may be annotated incorrectly, for instance in gure 12 one of a VP should be cat eat rice in house new annotated as SUBC like in gure 13. This also can be done by the rules that extract from treebank. Accordingly, We apply skewvalue to check Figure 11: Tree that noun phrase in preposition these[10]. Suspicious trees suppose to have high phrase was annotated skewvalue. 5

3
S NP ncn adj VP vt conjncl VP *VP* VP vi NP pref1 vi

Conclusion


smell

We contribute Thai treebank constructing framework and treebank semi-automatic correction method to alleviate treebank inconsistency problem. Hence, we can construct treebank faster by better quality. By our corrected treebank, we can extract better widecoverage grammar for supporting parser. However, our treebank is based on pure context free phrase structure grammar. Annotating more sophisticate grammar makes our work to support more sophisticate parser, which can parse Thai text more accurate.

good


cause -ing feel

Acknowledgement

Figure 12: Tree that a VP should be annotated SUBC

S NP ncn adj VP vt VP *SUBC* conjncl VP vi NP pref1 vi

Thanks Junior Science Talent Project (JSTP) supported by National Science and Technology Development Agency (NSTDA) and The Thailand Research Fund(TRF) and Kasetsart University Research and Development Institute (KURDI) for funding. Thanks Mukda Suktarachan and Pacharee Varasrai for corpora annotating. Thanks Warat Yingsaeree for proof reading.

References
[1] Kenji Imamura, Hierarchical phrase alignment harmonized with parsing, in NLPRS, 2001. [2] Eugene Charniak, Tree-bank grammars, in National Conference on Articial Intelligence (AAAI), 1996. [3] Michael Burke, Aoife Cahill, Ruth ODonovan, Josef Van Genabith, and Andy Way, Treebankbased acquisition of wide-coverage, probabilistic lfg resources: Project overview, result and evaluation, in IJCNLP-04 Workshop, 2004. 6


smell

good


cause -ing feel

Figure 13: Tree with sub clause

traction, in The Proceeding of the SNLP 2000, [4] Mitchell Marcus, Grace Kim, Mary Ann 2000. Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schas[12] Virach Sornlertlamvanich, Word segmentation berger, Building a large annotated corpus of for thai in machine translation system, in english: The penn tree bank, in Computational Machine Translation, National Electronics and Linguistics, 1993, pp. 313330. Computer Technology Center, 1993. [5] Nianwen Xue and Fei Xia, The bracketing [13] Asanee Kawtrakul, Chalatip Thumkanon, Yuen guidelines for penn chinese treebank 3.0, Tech. Poovorawan, Patcharee Varasrai, and Mukda Rep. 00-08, University of Pennsylvania, 2000, Suktarachan, Automatic thai unknown word IRCS Report. recognition, 1997. [6] Mitchell Marcus, Grace Kim, Mary Ann [14] Sutee Sudprasert and Asanee Kawtrakul, Thai Marcinkiewicz, Robert MacIntyre, Ann Bies, word segmentation based on global and local unMark Ferguson, Karen Katz, and Britta Schassuperived machine learning, in NCSEC 2003, berger, The penn treebank: Annotating pred2003. icate argument structure, in ARPA Human [15] Ingo Schroder, Case study in part-of-speech Language Technology Workshop, 1994. tagging using the icopost toolkit, Tech. Rep., [7] Alexander Krotov, Robert Gaizauskas, Mark Department of Computer Science, University of Hepple, and Yorick Wilks, Compacting the Hamburg, 2002. penn treebank grammar, in proceedings of COLING/ACL98, Montreal, Canada, 1998, pp. 699703. [8] Josef van Genabith, Anette Frank, and Andy Way, Treebank vs. xbar-based automatic fstructure annotation, in The LFG01 Conference, 2001. [9] Josef van Genabith, Louisa Sadler, and Andy Way, Structure preserving cf-psg compacting, lfg and treebanks, in Proceeding of ATALA Workshop - Treebanks, 1999. [10] Kaarel Kaljurand, Checking treebank consistency check to nd annotation errors, Report, 2004. [11] Pradit Mittrapiyannuruk and Virach Sornlertlamvanish, The automatic thai sentence ex7

Das könnte Ihnen auch gefallen