A Review On Transliteration For Indian Languages

International Journal of Creative Mathematical Sciences & Technology (IJCMST) 2(1): 54-59, 2012
ISSN (P): 2319 7811, ISSN (O): 2319 782X
A Review on Transliteration for Indian Languages

A. K. Ratha1, S. S. Dash2 and R. C. Barik3
1, 2, 3
Vikash College of Engineering for Women, Bargarh, Odisha, India
Abstract: Transliteration is the process of transcribing words from a source script to a

target script. These words can be content words or proper nouns. They may be of local or foreign origin. With increasing globalization, information access across language barriers has become important. Given a source term, machine transliteration refers to generating its phonetic equivalent in the target language.[1] This is important in many crosslanguage applications. In this paper we explore English to Indian Language transliteration. It starts with existing methods of transliteration; rule-based and statistical. It is followed by a brief overview, i.e., transliteration involving English and Hindi languages, and the motivation behind the approach of syllabification. The definition of transliteration and its structure have been discussed in detail. After which this paper highlights various concepts related to transliteration. Keywords: IL (Indian Language), OOV (Out Of Vocabulary), CLIR (Cross lingual Information Retrieval, NLP (Natural Language Processing).
INTRODUCTION
Transliteration is the conversion of a word from one script to another without losing its phonological characteristics. It is the practice of transcribing a word or text written in one writing system into another writing system. Transliteration is opposed to transcription, which specifically maps the sounds of one language to the best matching script of another language. Still, most systems of transliteration map the letters of the source script to letters pronounced similarly in the goal script, for some specific pair of source and goal language. If the relations between letters and sounds are similar in both languages, a transliteration may be (almost) the same as a transcription. In practice, there are also some mixed transliteration/transcription systems that transliterate a part of the original script and transcribe the rest. Transliteration is a crucial factor in Cross Lingual Information Retrieval (CLIR)[1,2]. It is also important for Machine Translation (MT), especially when the languages do not use the same scripts. It is the process of transforming a word written in a source language into a word in a target language without the aid of a resource like a bilingual dictionary. Word pronunciation is usually preserved or is modified according to the way the word should be pronounced in the target language. In simple terms, it means finding out how a source word should be written in the script of the target languages such that it is acceptable to the readers of the target language. One of the main reasons of the importance of transliteration from the point of view of Natural Language Processing (NLP) is that Out Of Vocabulary (OOV)[3] words are quite common since every lexical resource is very limited in practical terms. Such words include named entities, technical terms, rarely used or difficult words and other borrowed words, etc. The OOV words present a challenge to NLP applications like CLIR and MT. In fact, for very 54
Corresponding Author: A. K. Ratha, Vikash College of Engineering for Women, Bargarh, Odisha, India
ISSN (P): 2319 7811, ISSN (O): 2319 782X
close languages which use different scripts (like Hindi and Urdu), the problem of MT is almost an extension of Transliteration. A substantial percentage of these OOV words are named entities (Abdul Jaleel and Larkey, 2003; Davis and Ogden, 1998). It has also been shown that cross language retrieval performance (average precision) reduced by more than 50% when named entities in the queries were not transliterated (Larkey et al., 2003). Another emerging application of transliteration (especially in the Indian context) is for building input methods which use QWERTY keyboard for people who are more comfortable typing in English. The idea is that the user types Roman letters but the input method transforms them into letters of Indian language (IL) scripts. This is not as simple as it seems because there is no clear mapping between Roman letters and IL letters. Moreover, the output word should be a valid word. Several commercial efforts have been started in this direction due to the lack of a good (and familiar) input mechanism for ILs. These efforts include the Google Transliteration mechanism and Quilpad. (Rathod and Joshi, 2002) have also deve loped more intuitive input mechanisms for phonetic scripts like Devanagari.[1] Much of the work for transliteration in ILs has been done from one Indian script to another. One of the major work is of Punjabi machine transliteration (Malik, 2006). This work tries to address the problem of transliteration for Punjabi language from Shahmukhi (Arabic script) to Gurmukhi using a set of transliteration rules (character mappings and dependency rules). Om transliteration scheme (Ganapathiraju et al., 2005) also provides a script representation which is common for all Indian languages. The display and input are in human readable Roman script. Transliteration is partly phonetic. (Sinha, 2001) had used Hindi Transliteration used to handle unknowns in MT. Aswani et. al (Aswani and Gaizauskas, 2005) have used a transliteration similarity mechanism to align English-Hindi parallel texts. They used character based direct correspondences between Hindi and English to produce possible transliterations. Then they apply edit distance based similarity to select the most probable transliteration in the English text. However, such method can only be appropriate for aligning parallel texts as the number of possible candidates is quite small. The paper is structured as follows. In Section-II, we explain the challenges of transliteration idea of using information about the word origin for improving transliteration. Then in Section-III we describe the mapping for the Transliteration. Section-IV presents the problem in transliteration. Finally, in Section- V we present the conclusions.
CHALLENGES IN TRANSLITERATION
A source language word can have more than one valid transliteration in target language. For example, for the Odia word below four different transliterations are possible: - gautam, gautham, gowtam, gowtham Therefore, in a CLIR context, it becomes important to generate all possible transliterations to retrieve documents containing any of the given forms. Transliteration is not trivial to automate, but we will also be concerned with an even more challenging problem going 55
ISSN (P): 2319 7811, ISSN (O): 2319 782X
from English back to Odia, i.e., back-transliteration. Transforming target language approximations back into their original source script is called back-transliteration. The information-losing aspect of transliteration makes it hard to invert. Back-transliteration is less forgiving than transliteration. There are many ways to write a name like (meenakshi, meenaxi, minakshi, minaakshi), all equally valid, but we have much lesser flexibility in the reverse direction.
TRANSLITERATION OF INDIAN WORDS

These words include (mainly Indian) named entities of (e.g. Taj Mahal, Manmohan Singh) and common vocabulary words (common nouns, verbs) which need to be transliterated. They also include words which are spelled similar to the way Indian words are spelled when written in Latin (e.g. Baghdad, Husain). As stated earlier, these classes of words are much more relevant for an input method using a QWERTY keyboard. Since words of Indian origin usually have phonetic spellings when they are written in English (Latin), the issue of pronunciation estimation or lookup is not important. However, there can be many possible vowel and consonant segments which can be formed out of a single word. For example ai can be interpreted as a single vowel with sound AE , or as two vowels AA IH.[2,4] To perform segmentation, we have a simple program which produces candidates for all possible segments. This program uses a few rules defining the possible consonant and vowel combinations. Now we simply map these segments to their nearest IL letters (or letter combinations) as shown in Figure 1. The authors have chosen the Odia Language specifically because of their native language. This is also done using a simple set of mappings, which do not contain any probabilities or contexts. This step generates transliteration candidates. These are then filtered and ranked using Fuzzy String Matching.[1-4]
Figure 1: Maping for Odia Alphabets
56
ISSN (P): 2319 7811, ISSN (O): 2319 782X
ENGLISH TO INDIAN LANGUAGE TRANSLITERATION

Transliteration from English to Indian languages is basically similar to the problem of phonetic transcription because the scripts used for Indian languages are highly phonetic in nature as shown in Figure 2 and there is a close correspondence between letters and phonemes. There are several reasons why the problem of English to Indian language transliteration has been given more urgency than Indian language to English or Indian language to Indian language. One reason is that most Indians using computers also know English to some degree. Another reason is that, till recently, computers had hardly any support for Indian languages and most of the Indians using computers were not (in fact, still are not) able to type text directly in Indian languages.
A AA BH F CH D E
| | |
OO R N Z M S T L
Figure 2: Mapping for Hindi alphabets Therefore, if they want to search for some text in Indian languages, they would still find it easier to type their queries in the Latin script, rather than some Indian script, even if input for that script is enabled on their computer. The reason more relevant here is that we need a good transliteration system for searching text in a language different from that in which the query is 57
ISSN (P): 2319 7811, ISSN (O): 2319 782X
entered, the usual CLIR scenario. Even if the query is in an Indian language, the Indian users are more likely to type the query using the Latin alphabet, but they might want to search for documents in Indian languages, thus effectively turning (in such a case) the problem of IR into CLIR. Given this background, the two major problems for English to Indian language transliteration are: 1) Source side ambiguity: the relatively irregular spelling of English (in phonetic terms) and the lack of a commonly accepted Roman notation for typing Indian language text; 2) Target side ambiguity: a high degree of variation and non-standardization of spellings in Indian languages (in their respective scripts). [1-5] These two problems make the task much harder at both the source as well as at the target side. The relative irregularity of spellings at the source side means that the number of possible transliteration candidates that can be generated is very large for even a medium-sized word. The high variation at the target side means that it is very difficult to decide (often for even humans) whether a generated candidate is an acceptable word (or name), even if we ignore the role of the context. It needs to be emphasized that the two problems will effectively still be present when we try to transliterate romanized Indian languages text (say, IR queries) to Indian language scripts. This is because there is no popular commonly accepted notation (although there are innumerable ones for academic purposes) for typing Indian language text in the Latin script. Since the mappings from Latin letters to Indian script letters are not only many to many but highly ambiguous, the irregular spelling aspect of the problem actually becomes more important in this case.
CONCLUSION
We discussed the importance of transliteration for text processing in general and CLIR in particular, with special focus on the case of English to Indian languages. We argued that the high level of ambiguity (on the source as well as the target side) in this case makes the task of transliteration (and hence CLIR) quite a hard one. We discussed here the problems in transliteration for Indian Languages. The transliteration for Indian Languages described in this paper can be used for the development of transliteration tool for other Indian language.
REFERENCES
[1]. Transliteration as Alignment vs. Transliteration as Generation for the Purpose of Crosslingual Information Retrieval. Anil Kumar Singh, Sethuramalingam Subramaniam and Taraka Rama. Traitement Automatique des Langues, Special Issue on Multilingualism and NLP. Vol. 51, Number 2. 2010. 58
ISSN (P): 2319 7811, ISSN (O): 2319 782X
[2].
A More Discerning and Adaptable Multilingual Transliteration Mechanism for Indian Languages. Harshit Surana and Anil Kumar Singh. In Digitizing the Legacy of Indian Languages (Ed. Salonee Priya). ICFAI Books, 2009. [3]. An Outline of a Multilingual Natural Language Text and Speech Interface for Computing Devices in the South Asian Context. Anil Kumar Singh. In Proceedings of the IUI Workshop on Intelligent User Interfaces for Developing Regions (IUI4DR). Canary Islands, Spain. 2008. [4]. Aswani N., Gaizauskas R., A Hybrid Approach to Align Sentences and Words in English-Hindi Parallel Corpora, Proceedings of the ACL Workshop on Building and Using Parallel Texts, Association for Computational Linguistics, Ann Arbor, Michigan, p. 57-64, June, 2005. [5]. Virga P., Khudanpur S., Transliteration of Proper Names in Cross-Lingual Information Retrieval, Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-Language Named Entity Recognition, Association for Computational Linguistics, Morristown, NJ, USA, p. 57-64, 2003 [6]. Knight K., Graehl J., Machine Transliteration, Proceedings of the Eighth Conference on European Chapter of the Association. [7]. Singh A. K., A Computational Phonetic Model for Indian Language Scripts, Proceedingsof Constraints on Spelling Changes: Fifth International Workshop on Writing Systems, Nijmegen, The Netherlands, October, 2006. [8]. Subramaniam S., Singh A. K., Dasigi P., Experiments in CLIR Using Fuzzy String Search Based on Surface Similarity, proceedings of the 32nd Annual ACM SIGIR Conference,Boston, Massachusetts, 2009. [9]. May J., Brunstein A., Natarajan P., Weischedel R., Surprise! Whats in a Cebuano or Hindi Name?, ACM Transactions on Asian Language Information Processing (TALIP), vol. 2, n 3, p. 169-180, 2004. [10]. Oh J., Choi K., An English-Korean Transliteration Model Using Pronunciation and Contextual Rules, Proceedings of the 19th International Conference on Computational Linguistics, vol. 1, p. 1-7, 2002. ______________________
59

A Review On Transliteration For Indian Languages

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

A Review On Transliteration For Indian Languages

Hochgeladen von

Copyright:

Verfügbare Formate

International Journal of Creative Mathematical Sciences & Technology (IJCMST) 2(1): 54-59, 2012

ISSN (P): 2319 7811, ISSN (O): 2319 782X