Sie sind auf Seite 1von 6

A u t o m a t i c E n g l i s h - t o - K o r e a n Text Translation of Telegraphic

M e s s a g e s in a Limited D o m a i n

C l i f f o r d Weinstein Dinesh Tummala Young-Suk Lee Stephanie Seneff


Lincoln Laboratory, MIT Lincoln Laboratory, MIT Lincoln Laboratory, MIT MIT LCS,SLS
L e x i n g t o n , M A 02173 L e x i n g t o n , M A 02173 L e x i n g t o n , M A 02173 C a m b r i d g e , M A 02139
USA USA USA USA
c j w@sst, ii. mit. edu tummala@sst. II. mit. edu ysl@sst, ii. mit. edu senef f@Ics, mit. edu

Abstract of producing high quality text/speech translation output. :~


The core of our text translation system consists of an anal
ysis module and a generation nlodule. 'fire analysis module
This paper describes our work-in-progress in au- produces a semantic frame, which is an interlingua represen-
tomatic English-to-Korean text; translation. This tation of the input sentence. The intractat)le ambiguities of
work is an initial step toward the ultimate goal of natural language are overcome by restricting the domain and
text and speech translation for enhanced nmltilin- (;ire grammar rules which specify the semal~tic co-occurrence
gual and multinational operations. For riffs pui- restrictions of head categories. The structural diffexence be-
pose, we have adopted an interlintlua approach tween the som-ce (English) and the target (Korean) language
with natural language understmlding (TINA) and is easily captured by the flexible interlingua representation
generation (GENESIS) modules at the core. We and the strictly modularized target language grammar tem-
tackle the ambiguity problem t)y incorporating plate, external to the core generation system. The simt)licity
syntactic and semantic categories in |;he anal- of the system enables us to detect problems and provide so-
ysis grammar. Our system is capable of pro- lutions easily. Currently the system has a vocabulary of t427
ducing accurate translation of comt)lex sentences words. The system runs on a SPARC 10 workstatiou. The
(38 words) and sentence fragments as well as av- Korean translation outputs are. displayed on a hangul win-
dow running on UNIX. In a(hlition, we are in the process of
erage le.ngth (12 words) grammatical sentences. porting the system to a Pen|iron laptop running on Linux.
Two types of sysl,em ewJuatiou have 1)een car-
ried out: one for grammar coverage and the other This paper is organized as follows: In section 2 we (te-
for overall performance. ],br system robustness, scribe our system architecture, along with the grammar rules
integration of two subsystems is under way: (i) which drive the (:ore systein. In section 3 we suimnarize
a rule-based part-of-speech tagger to handle tin- the characteristics of our source language text comprised of
known words/constructions, and (if) a word-for- naval operational umssages. In section 4 we give our sys-
word translator to handle other system failures. tern evaluation. In section 5 we discuss the integration of
two subsystems for system robustness: rule-based p a r t - o f
speech tagger to handle unknown words/constructions, an(t
a word-for-word translator to produce partial transbttions in
the event of system failure. Finally we summarize the paper
1 Introduction in Section 6.

The overall goal of our translation work is automatic text 2 S y s t e m Description


and speech translation for limited-domain multilingual ap-
plications. The primary target application is enhanced com- The core of om~ translation system consists of two modules:
muilication among military forces in a mull|lingual coalition the understanding/analysis module, TINA, and the genera-
environment where translation utilizes a Common Coalition tion nlodule, GENESIS. 4 These modules are driven by a set
Language as a military infierlingua. Ore' development ef- of files which specify the source and target language gram-
fort was initiated with a speech-to-speech translation sys- mars. The process flow of our text translation system is
tem, called CCLINC (Tummala et al., 19951, which con-
sists of a modular, multilingual structure incmding sI)eech given in Figure 2.
recognition, language understanding, language generation,
and speech synthesis in each language. The system architec- 2.1 Language Understanding
ture of CCLINC is given in Figure 1. Note that the systent The language understanding system, TINA, described at
design provides ti)r verification of the system's understanding length in (Seneff, 1992), integrates key ideas fi'onl context
of each utterance to the originator, in a paraphrase in the free grmmnar, augmented transition network and unification
originator's language, before transmission on the coalition concepts. With the context fi'ee granunar rules of Fnglish as
network. input, the system produces the parse tree of an inlmt sen-
]'his paper describes our current work in automatic tence. The parse tree is then mapped onto a semantic frame.,
English-to-Korean text translation of telegraphic military which plays the role of an interlingua. The parse tree and
messages, u which is an initial step toward the ultimate goat the semantic frame of the input sentence "0819 z uss sterett

rThis work was sponsored by the Defense Advanced Re- 3Refer to (Kim, 1994) for other ongoing efforts in En-
search Projects Agency. Opinions, interi)retations, conclu- glish/Korean text translation im;luding (Choi, 1994). See
sions, and recommendations are those of the authors and (Lee, 1995) for speech translation work with Korean as the
are not necessarily endorsed by the United States Air Force. source language.
'~'vVe are also working on Korean-to-English text transla- 4Both modules are developed under AI{PA sponsorship
tion on the same domain, which we do not include in this by the Spoken I,anguage Systems Group at the MIT Labo-
paper. ratory for Comlmter Science.

705
COMMAND & CONTROL
[,AT* ~A~E ,/)i ......................................................................................................................................
]
@ . . . . .__o. L"oo-- I I i .--°. N
'I 3i_ ' 'J' 'L
COMMON COALITION ~i/~EMANTIC"~ I;CONFIRMAnON) j
L A N G U A G E N E T W O R K I i ~ _ ~ FRAME /] KOREAN

~ ,=,...c. ~J L...u..'~
"'°°°"'"°" I
LINDIEIlSTANDINr" " ~ lllgeOr-NmOK

TO/FROM
ENGLISH/CCL KOREAN~CCL
FRENCH/CCL
TRANSLATION 8 Y a T E M
..........TR a "S, ~A:P0,N, m 's TE,M,.................... TRANSLATION SYSTEM ........................................................................................................................

FRENCH

Figure 1: System structure for multilingual speech-to-speech translation

LANGUAGE LANGUAGE
NDERSTANDING GENERATION ~I~ble 1: S a m p l e English/Korean Bilingual Lexicon
(TINA- (GENESIS- be V i ROOT i PRES i PAST iess
- - I ..,LCSI, I MIT/LCS) V2 V .... [NG goiss PP ess PRES n PAST ess
cause_en V2 cholaytoy
visually AV sikakulo
cap_aircraft N centhwu cengchalki

MUC-II MUC-II
ANALYSISGRAMMAR GENERATIONGRAMMAR sentation makes the TINA system an ideal tool for machine
FORENGLISH FORKOREAN translation of various (i) purposes (i.e., whether a detailed or
rough translation is required), and (ii) languages (e.g., some
languages require a more elaborate tense representation or
F i g u r e 2: P r o c e s s F l o w of T e x t T r a n s l a t i o n honorification than others, and the appropriate categories
can be easily added.)

2.2 Language Generation


taken under fire by kirov with ssn-12's" are given in Figure 3
and Figure 4, respectively. The language generation system, GENESIS (Glass, Polifroni
As is reflected in the parse tree, both syntactic and seman- and Seneff, 1994), produces target language output on the
tic categories are utilized in our grammar specification. Top basis of the semantic frame representation. It is driven by
level categories such as 'sentence,' 'subject,' etc. are syntax- three submodules: a lexicon, a set of message templates, and
based, whereas lower-level categories such as 'ship_name,' a set of rewrite rules. These modules are language-specific
'time_expression,' etc. are semantics-based. The main ad- and external to the core generation system. Consequently,
vantage of adopting semantic categories is that we can easily porting the generation system to a new language is confined
specify the co-occurrence restrictions of head categories (e.g., to developing these submodules. ~
the parse tree specifies that the category ships occurs with
a small subset of nominal modifiers including uss which we
call 'ship_mod'), and therefore reduces the ambiguity of the 2.2.1 Lexicon
input sentence. In addition, it provides for easy access to the Since the semantic frame uses English as its specification
meaning of domain-specific expressions. The parse tree di- language, and is the basis for constructing the target lan-
rectly encodes the knowledge that sterett and kirov are ship guage grammar and lexicon, entries in the lexicon contain
names, ssn-12 a submarine name, and z stands for Green- words and concepts found in the semantic frame, expressed
wich Mean Time. in English, with corresponding surface realization forms in
As for mapping from parse tree to semantic frame, we re- Korean. A sample fragment of a bilingual lexicon is given in
duce all the major parse tree constituents into one of the Table 1.
three syntactic roles, i.e. clause, topic, and predicate. All
clause-level categories including statements, infinitives, etc., 2.2.2 Message Templates
are mapped onto clause. All noun phrases are mapped onto Message templates are target language grammar rules cor-
topic, and all modifiers as well as verb phrases are mapped responding to the input sentence expressions represented in
onto predicate. However, there is no limit to the number of the semantic frame, For instance, the word order constraint
semantic frame categories, and we can easily create new cat- of the target language is specified in this module. A set of
egories for a more elaborate representation. In Figure 4, we message templates used to produce the Korean translation
have additional categories like 'time_expression.' Whether or from the semantic frame in Figure 4 is given in Table 2.
not we add more categories to the semantic frame depends
on how elaborate a translation output is desired. If elaborate
translations are required, we increase the number of semantic 5A pilot study of applying GENESIS to Korean language
frame categories. The flexibility of the semantic frame repre- generation can be found in (Yang, 1995).

706
s t ~ r l t L,rlc~,
P~:EV
Full parse

statement

pl-e acljtlnct ,~ull,leC t


I
pan.rye

gmt ttma ,i a h l p vp_t ~ken urlt3op 1 i i-e,


I
vtake under fll'~, v btl agarll: v v4tsh lr~tl'L,r,~rqt

num~P tc ~rnt us~i ;hi 9 ~laat I ' v..btj q_5hlD v M, l t h q ~;h iF)

~h~ps r~h~p5
i

s h l 0 ~la~lo sLlbmtlrine nat.r,

F i g u r e 3: P a r s e 'lS:ee

{c stotement
:time_expression {p numeric_time
:topic {q g~t

:pred {p numeric
:topic n19 } }
:topic {q ship_home
:hOme "sterett"
:pred {p ship_mud
:topic "uss" } }
:subject I
:pred {p token under_fire
:mode "pp"
:pred {P v_.by
:topic {q ship_nnme
:nome "kirov" } }
:pred {p v u i t h
:topic {q subnorine_nome
:none "ssn-12s" } } } }
0UI9 z uss s t e r e t t token under f i r e by kirov ~it h ssn-12s
PAIIAPHP~SE: UI9 z USS S t e r e t t token under f i r e by glrov uit h SSH-12s
T ~ S t m I O . : 8"I l ~ ~ *I~ "I~ v ~I~oI e l ~ - ~ l ~ I ~ ssn l e s ~

F i g u r e 4: S e m a n t i c F r a m e a n d q i ' a n s l a t i o n

r • logical realizations. In English, the rewrite rules are used to


Table 2: Smnple Message Fcmplates generate the proper form of the indefinite article a or an.
~ ~ statement :topic i :predicate ta This module plays an important role in Korean due to m]-
topic ~ntifier :noun_phrase merous instances of phonologically conditioned allonmrphs
predicate :topic :predicate in the language. For instance, the so-called nominative case
nl)-ship_mod :topic :nounq)hrase marker is realized as i when the preceding morpheme ends
ship_mud :topic with a consonant, and as ka when the preceding morpheme
ends with a vowel, as illustrated below. Similarly, the so-
[ nt)-missih;-name :topic noun_phrase called accusative case marker is realized as ul after a eonso-
missile_nante :topic nan% and as lul after a w)wel.
np-by :topic ey ~ , y :notn@~
np-with :topic lo moun_phrase N~inative (2as eT~d~:~'~t t-ive~T],~
a consonant j dOnll-/ ] John-u/ )
[ t~ouowlng:gv~l [ MariTk~ [ John-lul ]
Template (a) says that a statement consists of a subject
tbtlowed by a verb phrase. Note that all of the entries in a 3 Data Summary
message template are optional so that a statement need not
contain a suhject or a verb phrase. Template (c) says that a Our source language text is called MUC-II data, and consists
verb phrase consists of an object followed by a verb. Other of naval operational report messages. 6 There are 1(}5 mes-
templates can be interpreted in a similar manner. sages for system development and 40 messages set aside for
system evaluation. These messages tieature incidents involv-
2.2.3 Rewrite Rules ing different platforms such as aircraft, surface ships, sub-
The rewrite rules are intended to capture surt'aee phono-
logical constraints and contractions, in particular, the con- ~ M U C - I I stands for tire Second Message Understanding
ditions under which a single morpheme has different phono- Conference.

707
T a b l e 3: _ T r a i n i n g D a t a (Data for System Develop- Table 4: C u r r e n t System Specifications
merit) _ Anal~-sisS i z e of , Size o f g r a m m a r
Data Set o.riginM Data Modified Data Total ' " ' " ?- '
i . lexicon (Ninnber.0f.categorie~ 0
, . '101 " 105 " 206 ,-1~27 ' ' •1297" ' 7-
.............. ;"". . . . . . . - I , . . 154= -. ..[ Generation ...997., : ,76.3' -'..
% ' ~ , t 0 t . , ....... 115...' i ~ • 215 .... '" : ' :"" "- "" ' " : - " .::",.-:: .... :
' : : ' '~: : ::~6] :~' ::" '::" ' ::".-" " :i' ~g,,=.~> ,.~.
Total 303 , .- 337 794 T a b l e 5: T r a n s i a l / i o n S tatistms tor ~amlng Data
[ # of Sentences

m a r i n e s a n d l a n d ~argets.. T h e r e a r e 12 w o r d s / s e n t e n c e ~av- . ~ a t e d Sentences :


erase) and 3sentences/message ( a v e r a g e ) ( S u n d h e i m , 1989).
The original messages.are highly telegraphic with many im
s t a n c e s of s e n t e n c e f / - a g m e , / t s ; a s i l l u s t r b ~ t e d in (11.
m g s e n t e n c e , a n d nP:(tating a p p r o p r i a t e d o . m ~ s ( S e e . (Sefmff,
( t ) A t 1609 h o s t i l e fm-(;es [ a u n c h e d massz~/e recon effort f r o m 1992) for d e t a i l s / T a b l e 4 g i v e s t h e s ~ a t i s t i c s o f t h e c u r r e n t
captured airfield a g a i n s t c t f 177 u n i t s t r a n s i t i n g t o w a r d a s y s t e m i n t e r m s o f t h e size o f t h e a n a l y s i s l e x i c o n / g r a m m a r
neutrM nation• Humint sources indicated 12/3 strike acft and'generation lexicon/gratnmar. 7 The translation s~atistics
h a v e l a u n c h e d (1935z} e n r o u t e b a t t l e force: Have positive for t h e t r a i n i n g d a t a - a r e s h o w n in T a b l e 5.
confirmation t h a t battle ]brce i s t a r g e t e d (2035z): "Consid:
ered hostile act 4 . 2 E v a l u a t i o n -_ - . . . . .

"...... ...... ..- ' . ~ .. .. " ,. W e h a v e c a r r i e d ' ottt .two f y p e s of:6~si, e m e v a l u a t i o n s . T h e


Foti.~}ag.li:."Q};i'g.in•a.'.}.:..~s~g~(.:}t't~er~ l'%,~;c0r~..P.'Opdl~}g, m OdL7 -i..:i3'ml~..'eacgh!a]fi-0n ,~peeificalty t e s ~ : . g ~ : ~ m m a r ~ o v e r a g e . a n d :the
`1ma`::m.ess.a`ge::,~x~-i$.~}~:~}~1.'-a`.`)~L~I`1g~s9~.~[~,):~s~ ~Pe.~pm%~ne~...: .:.~sbeo~ifi.:6v~.l):ia~£d~;,}~egfS:6~rerg.lt ;sys~mh..perro"i.~i]'a}i(~e!? :' ,.:.=.-.. :
c b u f t g e f p a r t - o f the.origiiial., inessTgge .(1).. N d ~ . tI/£t:~ti%qmx'-.:~ :,:.":~,'~:,~,,:'>-:.. ; ;";*..<:!,~- <:-),:: :+:: :<.:, ',¢"- :~,.~~: = : ~!~::" : . : " "
. d e r , h n e d p a r t s . *I.~,t I * e - r o o ( h f f e d rues'sage -~re . m t s m n g . aft:..the'. :. 4 2 " 1 . " ::E~iuti~iOh'~f ~ ' h i ~ , ' : ~ l ~ m n ~ l l ~ . " - ',. , : . . . . .
o'rl~m.al-nie~shge .." '..: ' : - ' : ' - ' ' : .... . , 7 , ; " :~' - ' . . - -.'-:" : ~ :, v . ~ > . . . . . . . : : : ...... ' ..... %,> -:," ,..-..~":~ ..... .:,~-.' 7 . . . . . .
. . . . : : - : . , , < :'. : <:7 ~i< : ) : . i". ;: :i-':; ,* :: : .{ : % : . . , - ! : 7 . >.<, : -< : - : . ::-~, . . , ' % < . . % ~ - o x b ~ ! ; a g ~ . ::! : : .-:- .-: ':,:,::i.::."~. < : , , ::." " ; 2 ,. i ,'.: ,. :....

f r o m _a eaptur:ed airfield agginst~ c t f 177 nHits: [~-gi'!s'i/ihg .to2., :"M:gOaH-!fl~#-:dCtabg~Se"ida~: s.0t:: K?~,. g , h i e h r,cd~4ig<./no, a m :
w a r d a n e u t r M n a t ' o n1, ttumint sources ihdicated 12.sti'ike k n o w f i w o r d s ' i n a n i n - h o u s e e x l ; e i [ m e n t . 8; h { ~ h d e x p e r h n d n t ;
aircraft have launelied (1935z) enroutelo t£_e battle force, w e a s k e d t h e ; s u b j e c t s 'to s t u d ~ / a tist ofldata~ s e t R .MUC.-II
C T F 177 has positive confirmation t h a t the: battle force i s . s e n t e n d e s l a n d t h e n c r e a t e abo{ifi 10 M U C : I I - l i k e s c n t e n c e s o~i
,..q,o-o
~_'~,_ted ,( 20a5z, '~ Thzs,.~s
" " consider ' e"d a hbstile a c t t h e i r o w n S u b j d e t s w e r e . t o l d. t o . c r e a. t e s e h ~ e n c e s Whict~ il-
". . . . . . " - . . l u s t r a t e t h e . g e n e r a l .style 0f t h e M U C / ' I I s6fiterices a;nd w h i c h
M U C - I [ ' ( t a t ~ a]s0 .ha~d o t h e r ..typical f e a i n r e s 'of r a w ~ex~. o n l y u s e t h e ~voccb~l!~}~y i t e m s o q e n < r i n g i n . t h e . e x a m p l e S?.n-
T h e i ' e ~£re * i t d t e a~fe~"{.dst:ance~ of- e ~ m n t e x ~ e ~ ' t e ~ c e s ta~ : m n c e s " W/~ e01teeted: l a . 4 M U C c . l L l i k e . sb.ntefices iii t h i s d x p e r -
• "....... . .'-;i:, ,< ~;~,- "~ ~'a~v ",:~,:a d, ala:.~.,d.n~i;~,.~v~a ~ ~ , a : ~ 2 : ; . . . , ",lmen~:.-;W..{:~'h'erke~o,l u a t C d { t l e ~ y s ~ m S pm'forraa~ce on {lmse
' % i ........ , ..... ~ :::--'-',;'~, :, < :::.. . ~ . . . : ,.:': ......"7:;... :-'. :': - ....... : ::..=. ::,? s . e n u e n ~ S . ~ < ! , . ~ . ~ : . u m , m~g~& •. ~t~es ~ O r r t . E a ~ . S v a , 6 g e r : @ 7 ~.
5.,<::., .i ? ~'i•'. .:: ;, :% .--'>:" ":":<:---"- ~ :,".~ ...- ' : ; . . : v.. -:. ,<:~-,:;:-o{:tlie,K*:~eb~e<~::#e.,ti~:~e.//~g~e{.,g&t. dib:ba/~g;•.g-l'%;~
'•', . , < . :..~':2.,"2q,-2<'.! .. 7":.<~::i -:.: . ~ i : ) , ~ ,-:e, ,-:--.,:-: ,i.i, =.', : / i , - - : ~ { t , e ~ : g i ~ e o i ; ~ l ) . ~ i ; a ~ l & ~ ; d ~ p ;'=-.i?:9'.:"~'e.:. • :: ,: : :',:-; ,-;.
_(3) T~B- ~ S s - g m g t l e ~ 3 ~ e d , s t f i t ~ ~,es¢-~I-.fr~¢.~,~erg.e.ngag>id.~,...;; >5-:'..,.,; :,-,:' - '~:::,:~ . - - : . ~. :_. ,":,.=..".> :.-,:.:" ": ::~.~. :'-. ,: .•-:'." " ' , : :'..:.' ',~.

(~stand
" " -tin
- ' ..f~efl~ty)-
.<~, ~ • . . . . . . ~ . h l l '- e." e ~ f l .d ".', . . . . . .
. . .t m ~ t i ~n 'g %~nke.~agamst~-xxx
- ,, - • - . . -:. . .~...~.,~,~..',~:~.~,~..~.~
.................... .. :, .~. . v. .a. .t .u. ~. w u t -~ . . . . . . . . . , . ..... "' o. •. . . . . . ...
mmrillb, c.amp ' " ' . ,: : : " - : .. - . , . ~.' - :.. ~"-~ - - , ' : -' ~ . J >.- : - ' ~'llh~ - r e s u l ~ s . ' : 6 f . . s y ~ m evM,!.g{aoft On~ .ttie ~-tgNs d a t ~ : {({ore
•. . . . . -. "., - " - , ' ; ". :. : - , ~-'."smtmg-0f 40 m e s s a g e s l i i 1 s e ~ t e h e e ' s ~ ' a r e g h o W n ih":31abte"L.
(4) K i r o v l o c k e d o n w{.th ,fire c0i, t r 0 l radm" and:~fi, e d : t ' m ' p e d d . . : ~l{e firs:t ]~oiii{ g o n 6 e e is-~ta-M;•d~d 'sysedt~i.;,Xrsed 34: 8 % ' 0 ( I h e
m spetmei~on.. --.. . . . . . :. . . . . . . .. ..... - . " . t e s t ' s e n t e n e 6 8 ' . w h ~ e h . ; h a v e . n o .ur/kfiown' ~vordg,: f~his' figm:e is
• :~ : -:: ', " .-:"::, ;.: (.:--."} -"; ,, .' . : i . ; - . " : ' . 7 ; . s o m e w h a t ] o w d r ],f~kn, t h e C o r r e s p 6 n d i t i g f i g u r e :for.data~ s e t
'(5)/' ..... ..~ _-.. ')" ; : . : < . :... : ' i : .> ".. :. ~The: I.,-:A* ( 5 0 . 6 % ) be.~iitise,'~*hdteSt get-is,h'ard~..Y~tm~/'daf, a i sd.t A * i
d e l i b e r a t e - h a r a s s m e n t ' o f u s c g c s p e n c e r bY tiost'ite k i m v • eric :: ?@he: {esi: s e n t e n c e s g ~ / ~ a o m i ~ l ~ { e b ' n e ~ @lieFa~' :A.*-a~,;ka,.~,:oa
da[ngefs £ n a t r e a d y : f t a g i t e . 1 5 6 1 i { i e a l / m i l i t a r y . b a l m ~ e e • b e t w e e n ' w e r e f a b r i e g t e d by' s t n d y i : n g ' A . s e n t e t / e e s . ) T t m S e c ; n d p o i n t
"/:/,,;ii;~"~:~,a ¢..ie]~~Di,.~a;i-~:': --. . . . . , . • 7 -,..-. = 7 : . : -- ' t o n o t e - i s t h a t . s 2 s t e m - f a i t m - e i ~ d ; u e i n - l g r e a r t t o t h e res-
: ...':'..: :" ".' =" ,7, ,i". %.'- :"-::: " ",.':-",'. " ";::/-" ,/.-.-. '.: ~-;:"7':~ .:..i. :' :enC~e:.~zurtkrioWn:w~r(ls,..ln-.:taet,~aS"T~bl~..g-~i~diaates,abmlt

oti~in-al an?t:tl~6'/iGdlfie~t MUU-II~'@~glaa ' W g - d i s c n s s , ~ h e ~ d e - " . :eaff'tiog: tm,n d l e : u n . k n q w n . . w o r d s ' . a % t t l l s : t i : ~ e . . W e . ? d l s c d s g . o u v


't&~s~ £ f :}ilb,,dx{)~ :l~vti~(o'{{~,i~i~t , t*~ainidg" fii-',{he:'f61i0wi'ng:s~e" .7: !gi~g61i~g,effo{ t s <(o.:}@k!e-fht£ p.{10~!~'i}.i>[ii.,ge.{~li?ngi :~,.{:..:.¢:}, .::
tion " . - "-~ - " ' .. :-i~/:. ' : : -:.- % , " . : i ::,- " ; : < : ...... 7~- ;5.! ;.:: _v.;~:,.,<..: ...,. ,;-..<..:. -: i!.,;,!,'-..-~.:<~ 9 : . . ~ / % ? , . , ! : '
• .- . . . .- . .... . i - • "4:3"! Eflldency . " ., : : . ; , . : . . : > > : ' . " - , : ; : . : : - " /;,,.
4 W!mini,~g ' a n d ~ v a ! u a t i o n L~gbly.due toour effort toreduce-t.he ambigulty.of,ti{e/irqmt
: ,• :, ,-. ,_ :-.. : . . . . . . . . . . • . - ..:.,)a~.. . <.:. :,..-. .' -:sen~e~mes,-1,l~es2,stema ru.ns etticicntky F0~: a n a y e r n g e l e n g t h
4 " 1, ; . . . , T r . a i , ~ i n g . . 5 - . , : . . . . . - . . . . . ~ . :. :7': ' i : .~." . , : - :s~nt~ncei: co n/glning.:l-2 :v)06/s.,"%g ,~dkes'_Ul~6/tt .4.~ S ~ 6 f , ~ U t / ,
' W e f i r , t p'ae'(~;i:d}{'ed<'{Im- M U c ~ i i l d d t a : c 6 t p u s i n t a T t h ~ 6 { l a ~ . : •:.. ::.i;:.;:: .: i ::.::',:: :)/::.:.:(::'.,,., ..:.: :! '•'.'<:.',./: ," :: : , : ..,/::..'(< ,: ;::..,.!::-.: ..:-/. : •.
~se'ts .~,-:' ][a ~*:..g d ' ~ t i ' & " -~...
" i l ; e ~tg~"
. . "e..~
h a•@ %, ' . sc4 'of:-:15"4
- . : M U .C = t [- like ' -.' ,..; "Table"~t
. . : :• g i v• e s. . . t h e. ". .n. u m b e ,..
t " :.Of",~,re~-erinin{/f:v~tei"g}ie~
... :. . g ... .in
S' e n" t "e n c es' - ~fro% "w&-e/d~elited
. ....... iff a n : i n - h o_i~se . e xm e rWnent . . -W e, t h.e. . . .a n a l Ym s -g~.v. ~. .m . . .t a a r T• h e a c t t t a l n u m b e r .. o f r i t .t .~. . ~s m u c h
"W~ll dNcus's' • . . . {h6ge . . . :~e~/tgn~beg ..,'(wh'teh, ' - . ~o~imi,
- - . - ' ige ~(a~a~
. . . . ~ t ' A" .---'.
~ ) in "..... • g.re~,t:m;.
• ,,-, '.....•bde~aus~ ,v .v 5FINA~
- - ,-M10.ws
.- ~ ,ci4osa~
. ,.. l ~blli*iatiofi
, . _ ... ~ " ..tlie'g'tid~dn
.... g'
Seet, . . . i. n g '4 2 1 , , , . A' s u m r n a t V of: o u r traini~ . d a t a ~"~s w ':. e. .n. . . .m. . . o f •.e o. r o. t .r t o. n. e l e m e n "t s " o n.~ %he , -V~ g . - ti* h m ~' d : "r o d e ., o f.'. r o~l 'e. s: .. S ~ .
:Table"3;::: g;hera: ~t~6.~{,i'iiid>gis :a(e:~h~ h ~ m b ~ r o f ~ e n t g n c e s i . . ( : -(SefiCff,! 2{J92.~-f0r mibre':detaiis;gb6i~t': @oss-p611ina{ioh];,/.'7:>
we have trained::ehe system onMl the etairiing:datgl-The. ' : ' ' ~ A ~ " w e ~ l l a i s c i ~ ' s ~ in'{i{~{ii6:~l~g~ction:]~.igd:ifficttl};{b iise
a n a l y s i s . r u t e s . a ~ e ~ l e v e l o p e d k / y h a n d , .based: o n o b s e r v e d pat-" a MUC-I[ database to evaluaee gramnigr coverage Uecguse
t e r n s in :the d g t a / T h e s e :rules are. t h e n c o n v e r t e d i n t o a n e t - m a n y M U C : I I s e n t e n c e s fail b a s e d . s t r i c t l v o n tile f a c t t h a t ~
w o r k S t r u c t u r e . , P r 0 b a b i l i t y a s s i g n m e n t s o n all a r c s in t h e t h e y c o n t a i n u n k n o w n Words: i.e., w 0 r d s w h i c h a r e n o t in t h e
network are obtained autorffatically by parsing each train- s y s t e m ' s lexicon•

708
T a b h ; 6: D a t a S e t A* E v a h m t i o n R e s u l t s T a b l e 8: T a g g e r E v a l u a t i o n o n T E S T D a t a
[~FYv~-all Accuracy ~ 2 T n ~ W ~ . d - ]
/ I A~:euracy /
[#}ff CVo~t~'~ted Sentences "~---~-~-~~_71~7g-(91.0%)
~ - ] Before ~IS'aining ~ j ~ 2 8 7 - ~ 8 7 4 ° ~ ~ 8 ~ j % F ~
After Training I | ]249/1287 (97%) ~ 7 0 / 8 2 ~
TrMning II ~ . ~ 1/~_87~98°/~ ~ 1 ~ 2 - ~

T a b l e 7: T e s t E w d u a t i o n R e s u l t s
to part-of-speech prior to a second attenlpt to parse. The
sanle gralllnlar would be used in all cases.
of Sentences with No Unknown Words '' For the solution sketched al)ove, we have evMuated the
of Parsed Sentences ~ ~ ~ Rule-Based Part-of-Speech Tagger (Bri[1, 1992) on the test
of Correctly Translated Sentences data both before and after training on the MUC-II database.
These results are given in Table 8. Tagging statistics q)efore
training' are based on the lexicon an(t rules acquire(l fl'om
tile BROWN CORPUS and the WAI,L S T R E E T JOUR-
NAI~ CORPUS. Tagging statistics 'after training' are divided
translate, it takes an average of 2.28 seconds to translate a into two categories, both of which are based ou the rules ac-
sentence containing 16 words. For a fairly complex sentence quired from training on data sets A, B~ and C of the MUC-
containing 38 words, it takes about 4: seconds to translate. II database. The only difference between the two is that in
Some examples of Korean translation output are given in one case (After Training I) we use a lexicon acquired fi'om
Figure 5. The system runs on a SPARC 10 workstation, and the MUC-II database, and in the other case (After ~[¥ain-
the som:ee ('ode is written in C. Korean translation outputs
are displayed on a hangul window running on UNIX. ing II) we use a lexicon acquired ffoin a combination of the
BROWN CORPUS, the WALL S T R E E T JOURNAL COR-
PUS, and the MUC-11 database. Since the tagging result is
5 Toward Robust Translation quite promising, despite the fact that |;lie training data is of
At the moment, our s~cstem is not capable of dealing with a modest size, we are planning to integrate the tagger into the
sentence containing (i) in]known words, cf. Section 4.2, and analysis module.
(~fi) unkuown constructions, of. Section 4.1. In this section we
iscnss our or>going efforts to overcome these deficiencies: 5.2 Integration of Word-for-Word
Integration of a part-of-speech tagger to handle unknown Translator
words/constructions, mM a word-for-word translator to cope Even though implementii,g tim part-el:speech tagger and ex-
with other system tSihn'es, of. (Frederking and Nirenburg, tending the analysis grantiltar to accept parts-of-speech as
1995). terminal strings will increase tile granmmr coverage, it; is an
ahnost impossible task to write a grammar which covers all
5.1 Integration of Part-of-Speech Tagger freely occurring natural language texts, let alone haw; a re.
Regarding the unknown word problem, an obvious solution bust parser to (teal with this inadequacy, t° Despite this dif-
is to expand the lexicon. Concerning the problem involving ficulty in designing a complete translation system, an ideal
mtknown constructions, we could easily generalize the gram- translation system onght to be able to produce translations
ram: to extend its coverage. However, both of these sohltions which are useflfl under auy circuinstances. Therefore, we
are t)roblematic. Ilandling the unknown word problem by are integrating a word-for-word translator I ~, which provides
increasing the size of the lexicon is not that straightforward tools to akl a human translator, as a fallback system.
given that most unknown words are open class items such as Figure 6 shows the planned robust system architectur%
nouns, verbs, adjectives and adverbs. In addition, one can with the part-of-speech tagger and the word-fl)r-word trans-
not generMize the grammar without side effects. Due to the lator integrated into the core understanding/generation sys-
highly telegraphic nature of the MUC-II data, generMizing tein. Note that the system will provide an indication or flag
tile grammar will increase the ambiguity of an input sentence to the user showing whether the translation is produced by
greatly, of. (Grishman, 1989). :) tIence, we need alternative TINA/GENESIS or by the word-tbr-word fallback system.
solutions to deal with unknown words and unknown con-
structions. Tile most desirM3le solution is to (i) leave the 6 Summary
current grammar intact since it eifieiently parses even highly
telegral)hic messages, and (ii) tackle unknown words and un- In this paper we have described our ongoing work in au-
known constructions by the same mechanism. tomatic English-to-Korean text translation of telegraphic
A potential solution to the unknown word problem is to: messages. This is a part of our overall effort in text and
Do part of speeeh tagging and replace unknown words with speech translation for limited-domMn multilinguM applica-
their parts-of-speech, and bootstrap the parts-of:speech (in- tions. We have described the system architecture (Section
stead of the actual words) to the analysis grammar. The 2), the source language text (Section 3), and the system eval-
unknown words would be replaced in the sentence string uation results {Section 4). We have also discussed ideas on
with their corresponding part-of-speech tag, and the seman- how to make the system robust, and proposed two specific
tk: grammar woukl be auginented to handle generic adjec- solutions: integration of a part-of-speech tagger and a word-
tives, nouns, verbs, etc., intermixed in the rules at appro- for-word translator (Section 5).
priate positions. The idea would be to include just enough
semantic information to solve the ambiguity probleln, effec- 7 Acknowledgements
tively anchoring on words such as ship-name that have high
semantic relevance within the dommn. We would like to acknowledge the following people for their
This approach might also be effective as a backoff mech- contributions to this project: .Jack Lynch~ Beth Carlson,
anism when the system fails to parse a sentence containing Victor Zue, and SungSim Park. We also would like to
only known words. A set of semantically significant vocabu- thank Beth Sundheim of NRaD, Professor Ralph Grishman
lary items could be tagged as "imnmtable", and all the words
in tile sentence except these anchor words would be converted 1°See (Sleator, 1991) for a design of a robust parser which
handles unknown constructions.
"Recall that we resolve the ambiguity problem by con- U The word-for-word translator is being developed by
straining the grammar with semantic categories. GARJAK under a subcontract.

709
at 1609 z h o s t i l e force s launched ~ossive retort e f f o r t from captured a i r f i e l d \
against c t f 177 u n i t s t r a n s i t i n g touard o n e u t r a l notion
PARAPHRASE: 1609 z h o s t i l e forces launched ~ossive reconnaissance e f f o r t fro~ c \
optured a i r f i e l d ugoinst c t f 177 units t r a n s i t i n g to~ord o n e u t r a l notion
TRANSLATION: 16~1 19~'~ . ~ ' ~ x l ~ ~ ~ ~ol ~ t ~ ~I~%~.~@~ @~d\
@?F~ % ~ o1~o~ ~ ctr 1~ -v-c~l ~ c~. ~%~ ~ ~1~
lost hostile gcft in vicinity
PARAPHRASE: lost h o s t i l e a i r c r a f t in v i c i n i t y
~RA.StArZO.: ~ 1 "~ nt~l~ ~ ~ ~1~?1

t~o uss america hosed s t r i k e escort f dash 14 s u e r e engaged by unknown number '
of h o s t i l e su dash 7 a i r c r a f t near land 9 boy lporen i s l a n d 2 t a r g e t f a c i l i t y r'
poren u h i l e conducting s t r i k e against xxx g u e r i l l a camp
PARAPHRASE: 2 USS America-based s t r i k e escort F-14s ~ere engaged u h i l e c o n d u c t i
ng s t r i k e against xxx g u e r i l l a camp by h o s t i l e Su-7 a i r c r a f t unknoun number o f '
near Land9 Boy ( Island2 t a r g e t f a c i l i t y )

xlN) @~q~l;q ul~l~l ~ SU-7 ~ 1 ~ ? 1 ~ 1 5 1 ~ Nl~e~ xxx ~8~1 ~ @\

F i g u r e 5: S a m p l e K o r e a n T r a n s l a t i o n O u t p u t

F i g u r e 6: R o b u s t T r a n s l a t i o n S y s t e m

of NYU, Proh~ssor Key-Sun Choi of KAIST, Korea, and Ralph Grishman. 1989. Analyzing Telegraphic Messages.
Willis Kim of MITRE. Beth Sundheim provided us with the Proceedings of the 1989 DARPA Speech and Natural Lan-
MUC-II data as well as technical reports documenting the guage Workshop. Cape Cod. Massachusetts.
data. Dr. Grishman provided us with his grammar, dic- W. Kim and W. ,thee. 1994. Machine ]]:anslation Evalua-
tionary, and semantic models tbr the MUC-II data. These tion. Seoul, Korea: MITRE.
materials helped us understand the linguistic properties of Youngjik Lee, Young-Sum Kim, Jung-Chul Lee, Joon-Hyung
the MUC-II corpus. Dr. Choi provided us with documen-
tation and software for KAIST'S MATES/EK English-to- Ryoo and Joe-Woo Yang. 1995. Korean-Japanese Speech
Korean machine translation system. Kim provided us with Translation System for Hotel Reservation - Korean front
electronic English/Korean dictionaries as well as a report on desk side. European Conference on Speech Communica-
his work. tion and Teehnoloq.y. Madrid.
Stephanie Scneff. 1992. TINA: A Nat,ral Language Sys-
tem for Spoken Language Applications. Computational
References Linguistics, 18(1): 61-88.
Daniel D.K. Sleator and Davy Temperley. 1991. Parsing
Eric Brill. 1992. A Simple Rule-Based Part of Speech Tag- English with a Link Grammar. CMU-CS-91-196.
Beth M. Sundheim. 1989. Navy Tactical Incident Report-
ger. Proceedings of the Third Co@rence on Applied Nat-
ing in a Highly Construined Sublanguage: Examples and
ural Language Processing. Trento, Italy: ACL.
Analysis. Technical Document 1477. Naval Ocean Sys-
Key-Sun Choi, Seungmi Lee, Hiongun Kim, Deok-Bong Kim,
tems Center, San Diego.
Cheoljung Kweon, and Gilchang Kim. 1994. An English- Dinesh Tummala, Steph~%nie Seneff, Douglas Paul, Clifford
to-Korean Machine Translator: MATES/EK. Proceedings Weinstein, and Dennis Yang. 1995. CCLINC: System
of the 15th International Conference on Computational Architecture and Concept. Demonstration of Speech-to-
Linguistics. Kyoto, Japan. Speech Translation for Limited-Domain Multilingual Ap-
Robert Frederking and Sergei Nirenburg. 1995. Three Heads plications. Proceedings of the 1995 ARPA Spoken Lan-
are Better than One. C-STAR II Meeting, Pittsburgh. guage Technology Workshop. Austin~ Texas.
James Glass, Joseph Polifroni and Stephanie Seneff. 1.994. Dermis W. Yang. 1996. Korean Language Generation in an
Multilingual Language Generation Across Multiple Do- Interlingua-based Speech Translation System. Technical
mains. Proceedings of the 1994 International Conference Report 1026. MIT Lincoln Laboratory.
on Spoken Language Processing. Yokohama, Japan.

710

Das könnte Ihnen auch gefallen