Sie sind auf Seite 1von 4

2014 Fourth International Conference on Advances in Computing and Communications

Multilingual Machine Translation with Semantic and


Disambiguation

Meera M Sony P
College of Engineering, Cherthala, Managed by IHRD College of Engineering, Cherthala, Managed by IHRD
Established by the Government of Kerala Established by the Government of Kerala
Email: mmeera18@gmail.com Email: spsony@gmail.com

AbstractMachine Translation is the process of translating II. R ELATED W ORKS


one natural language into another using automated and com-
puterized means.The proposed system is a rule based multilin- The history of machine translation dates back to the sev-
gual unidirectional translation system which translates English enteenth century. From there on wards so many researches
sentences into corresponding Malayalam and Hindi sentences. were done in the area of machine translation. But most
During translation it incorporates morphological and syntactic of them are based on statistical and inter lingual approach.
information present in the working pair of languages. And it also AnglaBharti, a multilingual translation system [1], is one
performs word sense disambiguation, semantic analysis and mor- among them which uses a rule based transfer approach for
phological processing of the generated target sentence.Machine translation. The system analyses English sentences and cre-
learning based word sense disambiguation and First Order
ates an intermediate structure called PLIL(Pseudo Lingua for
Predicate Logic(FOPL) based semantic checking are the two main
features of this paper.The word arrangement in dictionary with Indian Languages).For translation it uses a pattern directed
POS tags and word sense is an other important factor that can approach using context free grammar like structures and the
contribute to the efciency of the translation system.Hence the strategy used for translation lies in between the transfer and
system can achieve 74% accuracy for both translation which the inter lingual approach. IIT, kanpur developed an English-
includes all the tenses and different type of sentences in English. Hindi rule based translation system named AnglaHindi [1],
which combines example based and AnglaBharti approach. By
Keywords : Machine Translation,Morphology Genera- adding an additional layer over AnglaBharti, In 2005 a new
tion,Word Sense Disambiguation statistical translation system for English-Hindi is developed.
In 2009, Amrita Institute of Technology developed an English
to Malayalam translation system which incorporates morpho-
logical and syntactic information. Combining rule based and
I. I NTRODUCTION statistical approach IISC Bangalore created a translation sys-
tem named Shakti [1] which translate English to Marathi and
Language is an effective medium of communication. It Telugu using transfer based strategy.The English to Hindi MT
effectively represents the ideas and feelings of the human mind. system Mantra, developed by Applied Artifcial Intelligence
All around the world more than 5000 languages are used. It (AAI) group of CDAC,Bangalore, in 1999 uses transfer based
shows the existence of linguistic diversity. It is difcult for approach. The system translates domain specic documents
an individual to know and understand all the languages of in the eld of personal administration; specically gazette
the world. Hence, the methodology of translation is adopted notications, ofce orders, ofce memorandums and circulars.
to communicate the messages from one language to another.
Machine translation is one of the research areas under com-
III. CLASSIFICATION OF MACHINE TRANSLATION
putational linguistics. In the machine translation systems, a
APPROACHES
computerized system is used for the translation of one natural
language to another. The Machine Translation process can be broadly classied
into the following approaches Direct Machine Translation,
Machine translation systems [1] can be designed for most Rule Based Machine Translation and Corpus Based Machine
of the languages. If it is specically designed for two particular Translation [1]. The classication hierarchy is given in Fig. 1.
languages then it is called bilingual system, or if for more than
a single pair of languages then it is called multilingual systems.
The language inputted for the translation is called the source
language(SL) and language outputted by the translation system
is called the target language(TL). Based on the direction of
translation the machine translation systems can categorize into
two as, unidirectional and bidirectional. In a unidirectional
system source language is converted to target language format
during the translation. But in the case of bidirectional systems Fig. 1: Machine Translation Approaches Classication
translation in either direction is possible that means from SL In Direct Translation [1], with the help of a bilingual
to TL and TL to SL. dictionary a direct word by word translation of the input is

978-1-4799-4363-0/14 $31.00 2014 IEEE 223


DOI 10.1109/ICACC.2014.60
carried out and after which some syntactical rearrangements method for removing ambiguity in CFG, Unicode combina-
are made. The Rule Based Translation System [1]considers tion rules for morphology generation and machine learning
semantic, morphological and syntactic information and based approach for word sense disambiguation. Rule based method
on these rules the source language is transformed to the target in the proposed system helps to make translation without the
language through an intermediate representation. The Inter cost of corpus creation. The unexpected result generated by the
lingual approach converts words into an intermediate language statistical approach due to the misalignment of words in the
IL, which is a universal language created by the system, to use corpus is overcame in this proposed system. so the generated
it as an intermediate for translation into more than one target results are not deceiving.
language.
The verb classication and FOPL based semantic checking
The Transfer based approach [1] uses translation rules to used in this system can easily separate the semantically wrong
translate the input language to the output language which is sentence from the correct one with very simple steps. In
done in three phases. Firstly, the source language is converted case of structure rearrangement phase the newly introduced,
into an intermediate representation which is subsequently parse tree node numbering method can avoid the ambiguity in
converted into target language representation in the second CFG rules while generating the target sentences. The unicode
phase. The third phase involves generation of the nal target combination rules can effectively blend with the rules used for
language. translation and can generate new morpheme with morpholog-
In corpus based approach [1], the translation system is ical variations suited with the target language syntax. In the
trained with a bilingual text corpus to get the desired output. case of word sense disambiguation machine learning method
In Statistical Machine Translation, a bilingual corpus is trained can correctly identify the sense of words based on the context
and statistical parameters are derived in order to reach the used in the sentences, so the system can disambiguate the
most likely translation. While the Example Based Machine word meaning with the help of these identied senses. Since
Translation System usually uses previous translated examples the proposed system is a combination of all these method, it
to translate from source to target language. can take the advantage over most of the existing systems and
produce highly efcient translations.
IV. P ROBLEM D EFINITION The architecture of the translation system is shown in Fig.
3. In this system the translation process is divided into three
Analysis on the language features show that, for every
phases as i)sentence analysis phase ii) structure rearrangement
English sentence there exist at least one sentence in Malayalam
phase iii) target sentence generation phase. Detailed ow chart
and Hindi Languages. So our problem is to nd that correct
is shown in Fig. 4
sentence in those languages for the given English sentence with
the help of a machine. The block diagram for the proposed
system is given in Fig.2. In the diagram black box view
of system F, with input X and output Y is shown. For the
transformation of X into Y, the system F, depends on semantic
checker S, morphology generator M and disambiguator D.
So the problem of machine translation can be modeled in
mathematical formula as, X, Y, Z : F(X) Y, Z where
F(X) : D S M

Fig. 2: Block Diagram

V. P ROPOSED S YSTEM
The proposed system presents a rule based direct machine Fig. 3: Architecture
translation system which performs a unidirectional machine
translation from English to Malayalam and Hindi with the help
of a bilingual dictionary. For this system source language is
English and target languages are Malayalam and Hindi. Since A. Sentence Analysis Phase
it works for more than two languages it is a multilingual
translation system. Apart from the previous related works in English sentence has twelve tenses. The data inputted to the
the area of machine translation, the proposed system focus system may be sentence, Wh question or yes/no question and
on word sense disambiguation [11] and rst order predicate belongs to one of the twelve tense forms. Accurate translation
logic(FOPL) based semantic checking. can be generated by analyzing the nature of the input. For that
in this phase English data undergoes tokenization and parts of
The proposed system introduces verb classication based speech tagging [7]. Using the obtained tags, category of the
FOPL rule for semantic checking, parse tree node numbering sentence is analyzed for further processing.

224
parser is converted into FOPL format as: x, y, subject(x)
verb(y) cando(x, y). The predicate evaluates to true only if
x, y belongs to valid combination of subject, verb classes. The
true value of the predicate indicates the semantic correctness
of the sentence.
For example if the inputted sentence are i)Monkey ate a
banana and ii) Banana ate a Monkey, the dependency tree
obtained are shown in Fig. 7.
i)Monkey ate a Banana ii)Banana ate a Monkey

Fig. 4: Flow Diagram Fig. 7: Stanford Dependency Tree


Stanford dependencies for the sentences are given above.
B. Structure Rearrangement Phase
From the above dependencies, the subject of rst the sentence
It is the heart of this translation system, which performs is Monkey and verb is ate and that of the second sentence is
semantic checking and structure rearrangement. Input is parsed Banana and ate. In verb classication ate belongs to class
using Stanford Parser [9]developed by the Stanford University. O so it cant be combined with nonliving subject Banana but
The output from the parser consist of parse tree and depen- it can be with living subject Monkey. So the predicate for the
dency tree. The obtained results are converted into context free rst one evaluates to true but for the other is false. That shows
grammar(CFG) and predicates for subsequent processing. the semantic correctness of example 1 and semantic error of
example 2.
1) Parsing: Inputted sentence is parsed using the Stanford
English parser. The parse tree generated by the parser is con- 3) CFG Creation and Rearrangement: The created parse
verted into a numbered format to avoid ambiguity during target tree is processed to get the Context free Grammar(CFG)
sentence generation. The generated parse tree and numbered corresponds to the sentence. Context free grammar is generated
parse tree for the sentence He went to school is shown in for each branch of the parse tree. Then grammar rules are
Figure. 5. modied according to target language syntax. The grammar
generated for the parse tree in Fig. 5 and the modied grammar
is given below.
1S 2NP 4VP 1S 2NP 4VP
2NP 3PRP 2NP 3PRP
3PRP He 3PRP He
4VP 5VBD 6PP 4VP 6PP 5VBD
5VBD went 5VBD went
6PP 7TO 8NP 6PP 8NP 7TO
Fig. 5: Generated Parse Tree and Numbered Parse Tree 7TO to 7TO to
8NP school 8NP school
2) Semantic Checking: For the checking of semantic all
English verbs are classied based on nature of subjects [5]
who can do these verbs as shown in Fig. 6. 4) Sentence Creation: Structure of inputted sentence is
The classication shows that class H is a subset of class O altered according to target language format using newly formed
which is a subset of class E. That means human subject can grammar and it gives a new sentence having different word
be combined with all verbs, living subjects can be combined order from the actual one. The rearranged sentence generated
with all verbs except with class H verbs, non living subject with the modied grammar is shown in Fig. 8.
can combined with the verbs other than in class H and O. Any
mismatch in the combination leads to semantic error.

Fig. 8: Rearranged Sentence Parse Tree

Fig. 6: Verb Classication C. Target Sentence Generation Phase


The dependency tree obtained from the parser consist of All words in the rearranged sentence undergo stemming.
Stanford dependencies, from these dependencies extract the After that word by word translation is performed using the
subject and verb of the sentence. Subject verb combination dictionary. The system uses a Unicode based morphology
based rst order predicate logic(FOPL) are used for checking processing method [4]. In the morphology generation mor-
the semantic correctness. The dependency obtained from the phemes are combined to form the new word based on the

225
previously identied sentence structure and tenses. Since Hindi Recall=Number of word in the candidate solution correctly
language follows the gender matching rule, for English to aligned with reference / Number of words in reference solution
Hindi translation an extra gender processing step is used which
add post position based on the gender of the next word.
1) Disambiguation: Disambiguation [11] is an open prob-
lem in natural language processing, which governs the process
of identifying which sense of a word or preposition which is
used in a sentence, when the word/preposition has multiple
Fig. 9: Observation Result
meanings.
VII. C ONCLUSION
In this work Weka tool is used for the disambiguation of
prepositions and words [11]. The sentence used for training This work introduces an effective methodology for English
purpose is converted into a feature vector containing four to Malayalam and Hindi translation based on the rule based ap-
elds. The four elds include previous tag, word, next tag, proach. The proposed translation system can successfully work
sense. The NNge(Non-Nested Generalized Exampler) classier for almost all simple sentences in their twelve tense forms,
in Weka is used to classify these vectors based on sense eld. their negatives and question forms. Apart from other translation
The result from the classier is converted into rule format for systems it considers the semantic and disambiguation, and is
further processing in the subsequent steps. a success for these. As the result of combining the newly
introduced method with machine translation the evaluation
2) Stemming : In this step each word is converted into its result shows an accuracy of 74% with harmonic mean(Fmean)
root form by deleting the afxes such as ed, s, es, ing, s. of .74. The languages like Malayalam and Hindi are very much
Spelling rules in English are used in reverse form to perform morphologically rich and agglutinative, the performance can be
stemming on each words This step helps for morphology further improved by adding more morphological inections to
generation and dictionary lookup. The split afxes are stored the system.
separately and used at the time of morpheme generation.
3) Dictionary Lookup: This step is used to nd the correct R EFERENCES
translation of a single word in the source sentence into its [1] Antony P. J, Machine Translation Approaches and Survey for Indian
corresponding target word.To decrease the dictionary search Languages , Computational Linguistics and Chinese Language Process-
time the contents are organized in 26 les based on their ing,Vol. 18, No. 1, March 2013, pp. 47-78
starting letter. [2] Mary Priya Sebastian,K. Sheena Kurian,G. Santhosh Kumar , Align-
ment Model and Training Technique in SMT from English to
4) Morphology generation: Analyze the tense forms and Malayalam,Springer-Verlag Berlin Heidelberg 2010, IC3 2010,Part
prepositions in the sentence to make the morphological vari- I,CCIS 94,p.305-315
ations. Morphology generation is performed using Unicode [3] Mallamma V. Reddy, M. Hanumanthappa , Indic Language Machine
Translation Tool: English to Kannada/Telugu,Proceedings of Multimedia
processing [4]. Adjacent words are compared to nd the Processing, Communication and Computing Applications,Springer India
Unicode at the beginning and the end of the morphemes. 2013,p.200-213
By using this Unicode combination rule new morphemes are [4] Karin Kipper, Anna Korhonen, Morphological Analyzer for Malayalam
generated. Using Machine Learning, Language Resources and Evaluation ,Volume
42, Issue 1 , p.21-40
VI. O BSERVATION AND R ESULT [5] Karin Kipper, Anna Korhonen, A large-scale classication of English
verbs,Language Resources and Evaluation, Issue 1, Volume 42, p.21-40
Since it is a rule based translation approach accuracy of [6] Nisheeth Joshi, Hemant Darbari, Human and Automatic Evaluation
translation mainly depends on the correctness of dened rules. of English to Hindi Machine Translation Systems, Proceedings of
For disambiguation Weka tool is used for creating a rule. the Second International Conference on Computer Science Engineer-
ing and Applications,Springer-Verlag Berlin Heidelberg 2010, Volume
Disambiguation can be done more accurately with sufcient 166,2012,AISC 106,p.423-432
amount of training sentences. Sometimes the structure rear- [7] Remya Rajan, Remya Sivan, Remya Ravindran, K.P Soman. Rule
rangement phase creates erroneous result because of the free based machine translation from english to malayalam. In Conference
word order nature of Malayalam and Hindi languages. Apart Proceedings on International Conference on Advances in Computing,
from statistical approach this rule based method can give a Control, and Telecommunication Technologies, pages 439441, 2009
correct translation for most of the inputted sentences. The [8] Mary Priya Sebastian, G Santhosh Kumar,English to malayalam trans-
dictionary content is an other factor that affects the correct- lation: a statistical approach. In Proceedings of the 1st Amrita ACM-W
Celebration on Women in Computing in India, page 64. ACM, 2010.
ness of translations. With more words in dictionary number
[9] Nishtha Jaiswal, Renu Balyan, and Anuradha Sharma. A step towards
of translated sentences can be increased.For the evaluation human-machine unication using translation memory and machine trans-
precision and recall value are used. The formula used for lation system. In International Conference on Languages, Literature and
calculating the precision and recall are given below.where Linguistics, pages 6468.2011.
the candidate solution is generated by the proposed machine [10] Raghavendra Udupa U, Tanveer A. Faruquie , An English-Hindi Sta-
translation system and reference solution is generated with tistical Machine Translation System, First International Joint Conference
human translation.The obtained results for both translations 2004, Hainan Island, China,p.315-325
with different corpus are modeled as graph shown in Fig.9. [11] Roberto Navigli, Word Sense Disambiguation: A Survey, ACM Com-
puting Surveys (CSUR) 2009,Volume 41, Issue 2, Article No. 10
Precision =Number of word in the candidate solution [12] Jignashu Parikh, Pushpak Bhattacharyya , Interlingua-based English-
correctly aligned with reference solution / Number of words Hindi Machine Translation and Language Divergence, Machine Trans-
in candidate solution lation, Volume 16, Issue 4 , p.251-304.

226

Das könnte Ihnen auch gefallen