Sie sind auf Seite 1von 8

ISSN: 2349-2783

Normalizing the Hindi Text: A Survey

Lovely Sharma Abhishek Goel


Department of Computer Science, Department of Computer Science,
Doaba College B.D. Arya Girls College

Abstract treated as errors in writing. While some are very


widely used to be called as errors. We consider all
All areas of language and speech types of spelling variations of a word in the
technology, directly or indirectly, require language.
handling of real (unrestricted) text. For We will develop a solution to come up with a set of
example, Text-to-Speech systems directly rules specific to language which can handle such
need to work on real text, whereas variations, which could result in more precise
Automatic Speech Recognition systems performance.

depend on language models that are Therefore apart from the Indian language words we
trained on text. This survey paper describes should also be able to handle proper names and
about how to normalize Hindi Text using English words transliterated in Indian languages
standard rules. We studied the different since they form substantial percentage of words.

papers that used for different languages like For example we found widely used spelling
Bangla, Python, French etc. and even Hindi variations for the Hindi word `angrezi' as shown
also. Different papers had used different below.

types of algorithms for the different fields


like the NLP and DIP etc. Based on the
related work we will develop an algorithm
using Hindi Standard Rules for standardizing
the spelling variants in Hindi Text.
In real text, many non-standard representations of
1. INTRODUCTION words appear, for e.g., numbers (year, time, ordinal,
Hindi Text Normalization (HTN) works on spelling cardinal, floating point), abbreviations, acronyms,
standardization issues, thereby resulting in multiple currency, dates, URLs. All these non-standard
spelling variants for the same word. The major representations must typically be normalized, or in
reasons for this phenomenon can be attributed to other words converted to standard words, which
the phonetic nature of Indian languages and multiple would then be processed in various applications.
dialects, transliteration of proper names, words
borrowed from foreign languages, and the phonetic
variety in Indian language alphabet. The variety in Some examples of the problems faced by natural
the alphabet, different dialects and influence of language understanding systems:
foreign languages has resulted in spelling variations
of the same word. Such variations sometimes can be

International Journal of Innovation and Research in Computer Science 37


ISSN: 2349-2783

 Text Segmentation: Some written languages like truncating the vowel form of the last consonant
Chinese, Japanese and Thai do not have single word while speaking, even as it continues to be written in
boundaries either, so any significant text parsing full form). There are also some modern conventions
usually requires the identification of word
for writing English words in Devanāgarī.
boundaries, which is often a non-trivial task.
 Word Sense Disambiguation: Many words have
more than one meaning; we have to select the
meaning which makes the most sense in context.
 Hindi Text Normalization: Many Words in Hindi
2. NEED OF TEXT
language written in different ways. We work on the NORMALIZATION
Normalization of those Hindi words.
 Syntactic Ambiguity: The grammar for natural
languages is ambiguous, i.e. there are often multiple 2.1 Normalization of Non-standard Words Real text
possible parse trees for a given sentence. Choosing contains a variety of "non-standard" token types,
the most appropriate one usually requires semantic such as digit sequences; words, acronyms and letter
and contextual information. Specific problem
sequences in all capitals; mixed case words (WinNT,
components of syntactic ambiguity include sentence
boundary disambiguation. SunOS); abbreviations; Roman numerals; URL's and
e-mail addresses. Many of these kinds of elements
are pronounced according to principles that are
1.1 Main cause behind Hindi Text Normalization:- quite different from the pronunciation of ordinary
words. Furthermore, many items have more than
The issue why Normalization is required on Hindi
one plausible pronunciation, and the correct one
Text is due to Devanāgarī (दे वनागरी) Script. must be disambiguated from context: IV could be
Devanāgarī (दे वनागरी) is an abugida script used to "four", "fourth", "the fourth", or "I.V. Normalizing or
write several Indo-Aryan languages, including rewriting such text using ordinary words is an
Sanskrit, Hindi, Gujarati, Marathi, Sindhi, Bihari, Bhili, important issue for several applications more
Marwari, Konkani, Bhojpuri, Pahari (Garhwali and sophisticated text normalization will be an important
Kumaoni), Santhali, Nepali, Newari, Tharu and tool for utilizing the vast amounts of on-line text
sometimes Kashmiri and Romani. The Devanāgarī resources. Normalized text is likely to be of specific
Italic text writing system can be called an abugida, as benefit in information extraction applications.
each consonant has an inherent vowel (a), which can
be changed with the different vowel signs.
Devanāgarī is written from left to right. A top line 2.2 Text normalization challenges
linking characters is thought to represent the line of
the page with characters historically being written
under the line. In Sanskrit, words were originally
The process of normalizing text is rarely
written together without Media: spaces, so that the
straightforward. Texts are full of heteronyms,
top line was unbroken, although there were some
numbers, and abbreviations that all require
exceptions to this rule. The break of the top line
expansion into a phonetic representation. There are
primarily marks breath groups. In modern languages,
many spellings in English and many other languages
word breaks are used. When reading Sanskrit
which are pronounced differently based on context.
written in Devanāgarī, the pronunciation is
For example, “IV could be "four", "fourth", "the
completely unambiguous. Similarly, any word in
fourth", or "I.V. ".
Sanskrit is considered to be written only in one
manner (discounting modern typesetting variations Most text-to-speech (TTS) systems do not generate
in depicting conjunct forms). However, for modern semantic representations of their input texts, as
languages, certain conventions have been made (e.g. processes for doing so are not reliable, well

International Journal of Innovation and Research in Computer Science 38


ISSN: 2349-2783

understood, or computationally effective. As a


result, various heuristic techniques are used to guess
the proper way to disambiguate homographs, like
examining neighboring words and using statistics
about frequency of occurrence. 3. LITERATURE SURVEY

Similarly, abbreviations can be ambiguous. For


example, the abbreviation "in" for "inches" must be All areas of language and speech technology, directly
differentiated from the word "in", and the address or indirectly, require handling of real (unrestricted)
"12 St John St." uses the same abbreviation for both text [1]. For example, Text-to-Speech (TTS) systems
"Saint" and "Street". TTS systems with intelligent directly need to work on real text, whereas
front ends can make educated guesses about Automatic Speech Recognition (ASR) systems
ambiguous abbreviations, while others provide the depend on language models that are trained on text.
same result in all cases, resulting in nonsensical (and In real text, many non-standard representations of
sometimes comical) outputs. words appear, for e.g., numbers (year, time, ordinal,
cardinal, floating point), abbreviations, acronyms,
currency, dates, URLs. All these non-standard
representations must typically be normalized, or in
other words converted to standard words, which
2.3 Application of Text Normalization would then be processed in various applications.

Text Normalization is one of the most important 3.1 Related Work


tasks in text processing and text to speech
conversion. There are two major components in a
text to speech system: (1) text normalization (2)
Firoj Alam, S. M. Murtoza Habib, Mumit Khan
speech generation. Before discussing the issues
(2009) describes about a process of text
related to text processing, let us briefly discuss the
normalization system of Bangla language by
nature of the Indian languages scripts for which the
identifying the semiotic classes from Bangla text
synthesis systems are built. The basic units of the
corpus. After identifying the semiotic classes a set of
writing system in Indian languages are Aksharas,
rules were written for tokenization and
which are orthographic representations of speech
verbalization. This study is important for Text-To-
sounds. An Akshara in Indian language scripts is a
Speech (TTS) system and as well as in language
syllable and can be typically of the following form: V,
model for speech recognition. It must first be pre-
CV, CCV and CCCV where C is a consonant and V is a
processed to remove the ambiguities and convert
vowel. All Indian language scripts have a common
some Non Standard Words (NSWs) into their
phonetic base, and a universal phone set consisting
standard word pronunciations. This study develops a
of about 35 consonants and about 18 vowels. The
method to normalize Bangla text using a rule based
pronunciation of these scripts is almost
system rather than using a decision tree and a
straightforward. There is more or less one to one
decision list to ambiguous tokens. Many little works
correspondence between what is written and what
has been done on under resourced languages such
is spoken. However, in languages such as Hindi and
as Bangla. The basic similarity among the work done
Bengali the inherent vowel (short /a/) associated
is that each involves repeated sequences of
with a consonant is not pronounced depending on
tokenization, token classification, token sense
the context. It is referred to as inherent vowel
disambiguation and standard word generation to get
suppression or schwa deletion.

International Journal of Innovation and Research in Computer Science 39


ISSN: 2349-2783

the normalized text. According to semiotic classes a Yuxiang Jia, Dezhi Huang, Wu Liu, Yuan Dong,
lexical analyser was designed to tokenize each NSW Shiwen Yu, Halia Wang (2008) develops taxonomy
by regular expression using tool JFlex. We assigned a of NSWs on the basis of a large scale Chinese corpus,
tag for each token according to semiotic classes. The and proposes a two-stage NSWs disambiguation
outputs of the tokenization are then used in the next strategy, Finite State Automata (FSA) for initial
step i.e. token expander. According to the assigned classification and Maximum Entropy (ME) classifiers
tag token verbalization and disambiguation was for subclass disambiguation. Typical methods for
performed by the token expander. In Semiotic class text normalization are based on handcrafted rules.
identification, we identified a set of semiotic classes But such hand-crafted rules are difficult to write,
which belongs to the Bangla language. To do this, we maintain and adapt to new domains. On the other
selected a news corpus, forum and blog, then we hand, in view of homograph disambiguation, many
proceeded in two steps to identify the semiotic machine learning methods are employed and have
classes: (i) Python [4] script was used to identify the shown their advantages. Decision tree and decision
semiotic class from news corpus and we manually list are used in English and Hindi text normalization
checked it in the forum and blog (ii) we defined a set [1]. The text normalization approach proposed in
of rules according to context of homographs or this paper does not need word segmentation
ambiguous tokens. The result is a set of semiotic process. Finite state automata detect NSWs from the
classes in Bangla text as shown in table1. real text and make an initial classification and then
maximum entropy classifiers are used for further
classification. The process flow is outlined in Fig.1.

Figure 3.1 Possible token type in Bangla Text

We defined each semiotic class to a specific tag and


assigned this tag to each class of token. The
tokenization undergoes three levels such as: i.
Figure 3.2 Flow chart of Text Normalization
Tokenizer ii. Splitter and iii. Classifier. Like English
and other South Asian scripts Bangla also uses
whitespace to tokenize a string of characters into a
separate token. The output of this work is the list of Finite State Automata (FSA) are designed to detect
words in a normalized form. The performance of this NSWs and give an initial classification based on
rule based system is 90% for ambiguous tokens such NSWs formats. Next is the subclass disambiguation
as float, time and currency. The accuracy of these process which is needed to determine the true
three types of token is: floating point 100%, currency pronunciations in certain contexts with the help of
100% and time 62% [5]. Maximum Entropy classifier. Last module is the
Standard Word Generation of text normalization. It
is a generation step while former steps are analysis

International Journal of Innovation and Research in Computer Science 40


ISSN: 2349-2783

steps. The input of this module is NSW itself and its may be treated in a similar way to Machine
class tag. The output is its corresponding Chinese Translation: The tools and algorithms developed for
words. The conversion is a one-one correspondence Machine Translation may be used for text
and finite state transducers are applicable here. This normalization, with the “spoken language” being
paper makes an extensive investigation of Chinese treated as the target language [10].
text normalization. NSWs taxonomy is developed
based on a large scale corpus. After a systematic K.Panchapagesan, Partha Pratim Talukdar,
analysis of the taxonomy, a two stage NSWs N.sridhar Krishna, Kalika Bali, A.G. Ramakrishnan
classification strategy is proposed, finite state (2004) proposed a novel approach to text
automata for initial classification and maximum normalization, where in tokenization and initial
entropy classifiers for further classification. token classification are combined into one stage,
Experiment results show that this approach achieves followed by a second level of token sense
a good performance and generalizes well to new disambiguation [2]. The architecture of the proposed
domains. In addition, this approach is character- approach is shown in Figure 1. Tokenization and
based, no need of word segmentation preprocess. Initial Token Classification are performed using a
However, some error occurs in experiment such as lexical analyzer that is derived from various token
number sequence error [7]. definitions in the form of regular expressions. For
the second level of token sense disambiguation,
Conghui Zhu, Jie Tang, Hang Li, Hwee Tou Ng, Tie- traditionally, text normalization is viewed as an
Jun Zhao (2007) describes about the issue of text engineering issue and is conducted in a more or less
normalization, an important yet often overlooked ad-hoc manner.
problem in natural language processing. By text
normalization, we mean converting ‘informally
inputted’ text into the canonical form, by eliminating
‘noises’ in the text and detecting paragraph and
sentence boundaries in the text. Previously, text
normalization issues were often undertaken in an
ad-hoc fashion or studied separately. This paper first
gives a formalization of the entire problem. It then
proposes a unified tagging approach to perform the
task using Conditional Random Fields (CRF). The Figure 3.3 Stages involved in Text Normalization.

paper shows that with the introduction of a small set


of tags, most of the text normalization tasks can be
performed within the approach. The accuracy of the In this approach to text Normalization, tokenization
proposed method is high, because the subtasks of and initial token classification / identification are
normalization are interdependent and should be achieved in a single step. We have used Flex, an
performed together [9]. automatic generator tool for high-performance
scanners; Flex takes a set of regular expressions as
Gralioski Filip, Jassem Krzysztof, Wagner Agnieszka,
input and generates a scanner as output that will
Wypych Mikołaj (2006) describes some problems of
scan an input stream for the tokens represented by
text normalization for an inflected language like
the regular expression. A scanner works as a lexical
Polish. It is shown that text normalization, being
analyzer, recognizes lexical patterns in the input
usually one of the first steps of text-to-speech
text, and thereby groups input characters into
synthesis, is a complex process, referring to several
tokens. Tokens are specified using patterns (regular
levels of analysis: morphological, syntactical and
expressions). In token sense disambiguation, once
semantic. It is claimed here that text normalization
the tokens are extracted from the input text, the

International Journal of Innovation and Research in Computer Science 41


ISSN: 2349-2783

category of each token need to be identified. This is sequences are usually divided into subtokens using
accomplished by the lexical analyser itself when the presence of commas, hyphens, or slashes within
there is no ambiguity arising from the format(s) of them. The algorithm in the Perl splitter is that the
the token. In case of ambiguity, the token with input texts are first bracketed into groups, and then
possible token types is output to facilitate further the texts are split based on split points, while
disambiguation. Identification of token category keeping the groups intact. Next is the classifier which
involves high degree of ambiguity. For example, operates in two stages. The first stage handles all
‘1960’ could be of the type ‘Year’, or of the type tags except for ASWD, EXPN, and LSEQ, which were
‘Cardinal Number’, and ‘1.25’ could be of the type combined into a single ALPHA tag. If a token is
‘Float’, or of the type ‘Time’. Disambiguation is tagged as ALPHA, the alphabetic classifier classifies
generally handled by hand-crafted context- the token as one of ASWD, EXPN, or LSEQ. The
dependent rules. We have used decision tree based classifier generates features for a token using the
data-driven techniques to address the same. When a two preceding tokens, the token itself, and the two
token is input to the tree for disambiguation, a following tokens. The general classifier has 128
decision is made by traversing the tree starting from features. Other features include the length of the
the root node, taking various paths satisfying the token and context tokens, whether the target and
conditions at intermediate nodes, till the leaf node is context tokens have proper name capitalization, the
reached. Decision lists are a special class of decision number of splits the splitter would make on the
trees; they can use for representing a wide range of target token, and some very basic context
classifiers. A decision list can be viewed as hierarchy disambiguation scoring. The result is not satisfactory
of rules, when a classification is needed; the rule in due to the performance of the classifier, since the
the hierarchy is addressed. If that rules fails to low recall rate evidently lowers the accuracy rate [4].
classify as well, the third rule is addressed, and so
on. They are basically if-then else statements [3]. For Gilles Adda, Martine Adda-Decker, Jean-Luc
example: Gauvain, Lori Lamel (1997) describes a quantitative
investigation into the impact of text normalization
if condition1 (x) is true then output = output1 (x) on lexica and language models for speech
recognition in French. The text normalization
else if condition2 (x) is true then output = output2 process defines what is considered to be a word by
(x) the recognition system. Depending on this definition
we can measure different lexical coverage’s and
. . .
language model perplexities, both of which are
. . . closely related to the speech recognition accuracies
obtained on read newspaper texts. Different text
. . . normalizations of up to 185M words of newspaper
texts are presented along with corresponding lexical
Else output = default_output (x)
coverage and perplexity measures. Some
Steve Atwell, Hahn Koo, Liam Moran, and Tae-Jin normalizations were found to be necessary to
Yoon (2004) describes the implementation of a achieve good lexical coverage, while others were
system in the programming language Python that more or less equivalent in this regard. The choice of
normalizes texts into a form that resembles how a normalization to create language models for use in
human might read it out loud. Classified ads are the the recognition experiments with read newspaper
target domain of the normalization system. For a texts was based on these findings. Best system
given input text, a sequence of letters can either be configuration obtained an 11.2% word error rate in
treated as a single token or a group of subtokens the AUPELF ‘French-speaking’ speech recognizer
that can be split further. For example, character

International Journal of Innovation and Research in Computer Science 42


ISSN: 2349-2783

evaluation test held in February 1997. We use two Unicode contains 10 consonant characters with
large French dictionaries: BDLEX and DELAF [6]. nukta (a dot under consonant) and one nukta
character itself. We delete all occurrences of nukta
character and replace all consonants with nuktas
with their corresponding consonant character. This
4. RULES USED FOR would equate words like the ones shown below.
NORMALIZING TEXT

Indian language words face spelling standardization


issues, thereby resulting in multiple spelling variants
for the same word. The major reasons for this
phenomenon can be attributed to the phonetic
nature of Indian languages and multiple dialects,
transliteration of proper names, words borrowed
from foreign languages, and the phonetic variety in
Indian language alphabet. Given such variations in
Halanth deletion:-
spelling it becomes difficult for web Information
Retrieval applications built for Indian languages.
India is a multi-language, multi-script country with
22 official languages and 11 written script forms. Hindi and many other Indian languages face the
About a billion people in India use these languages problems of 'schwa' (the default vowel 'a' that
as their first language. Some of the rules used are: occurs with every consonant) deletion. Lots of
spelling variations occur due to 'schwa' deletion. In
order to normalize such words we delete all the
halanth characters in the given word before making
Mapping chandrabindu to bindu:- a string match. This operation would normalize
words as shown in the example below.

Often people tend to use chandrabindu (a half-


moon with a dot) and bindu (a dot on top of
alphabet) interchangeably. Lots of confusion exists in
common language usage on which to use when. In
order to equate all such words we convert all
occurrences of chandrabindu to bindu, which would
equate all the words shown below.
Mapping character ‘ न ् ’ to Bindu:-

In Hindi language this problem occurs because


various people use ‘Bindu’ in place of “न ् “. In our
application, we convert ‘न ् ‘to Bindu (ं). Using this
Nukta deletion:-
feature we can normalize words as shown in this
example:

International Journal of Innovation and Research in Computer Science 43


ISSN: 2349-2783

Normalization” Fifth International Conference on Knowledge Base


Computer Systems (KBCS), Hyderabad, India, 19-22 December
2004.

[4]. Steve Atwell, Hahn Koo, Liam Moran, and Tae-Jin Yoon,
“Text Normalization in Python”
www.linguistics.uiuc.edu/grads/moran/papers/TextNorm.pdf.

[5]. Firoj Alam, S. M. Murtoza Habib, Mumit Khan, “Text


Normalization system for Bangla” Paper accepted in Conference
on Language and Technology 2009 (CLTO9), 22-24 January 2009.
So, these are some of the rules that we will used for
standardizing the spelling variants in Hindi Text. [6]. Gilles Adda, Martine Adda-Decker, Jean-Luc Gauvain,
Lori Lamel, “Text Normalization and Speech Recognition in
French” Proc. ESCA Eurospeech'97,
ftp://tlp.limsi.fr/public/euro97bref.ps.Z

[7]. Yuxiang Jia, Dezhi Huang, Wu Liu, Yuan Dong, Shiwen


5. CONCLUSION AND FUTURE Yu, Halia Wang, “Text Normalization in Mandarin Text-To-Speech
WORK System” Acoustics, Speech and Signal Processing, 2008. ICASSP
2008. IEEE International Conference on Volume, Issue, March 31
We have studied papers on various languages such as 2008-April 4 2008 Page(s):4693 – 4696
Bangla, Python, etc. and come to conclusion that
[8]. Pingali, P., Jagarlamudi, J., and Varma, V. 2006,
normalization of Hindi Text is an important part of
“WebKhoj: Indian language IR from multiple character encodings”,
Natural Language Processing (NLP). Hindi Rules
In Proceedings of the 15th International Conference on World
should be used for standardizing the spelling variants
Wide Web (Edinburgh, Scotland, May 23 - 26, 2006), WWW '06.
in Hindi Text. For Future Work, we will develop
ACM Press, New York, NY, 801-809.
more rules and Secondly, Development of dictionary
automatically from Hindi corpus for such words and [9]. Conghui Zhu, Jie Tang, Hang Li, Hwee Tou Ng, Tie-Jun
depending upon the frequency of words in text, Zhao, “A Unified Tagging Approach to Text Normalization”, in
choosing one among them as standard. Using above Proceedings of ACL.2007
rules, standardizing the given Hindi Text
automatically. [10]. Gralioski Filip, Jassem Krzysztof, Wagner Agnieszka,
Wypych Mikołaj, “Text Normalization as a Special Case of Machine
Implementation will be done in Microsoft .Net as Translation”, Proceedings of the International Multiconference on
Front-End and MS-Access as Back-End. Computer Science and Information Technology pp. 51–56.

REFERENCES

[1]. Richard Sproat, Alan Black, Stanley Chen, Shankar


Kumar, Mari Ostendorfand Christopher Richards, “Normalization
of Non-Standard Words”, Computer Speech and

Language, 15(3):287.333, 2001.

[2]. Manish Sinha, Mahesh Kumar Reddy .R, Pushpak


Bhattacharyya, Prabhakar Pandey, Laxmi Kashyap, “Hindi Word
Sense Disambiguation”, International Symposium on Machine
Translation, Natural Language Processing and Translation Support
Systems, Delhi, India, November, 2004.

[3]. K.Panchapagesan, Partha Pratim Talukdar, N.sridhar


Krishna, Kalika Bali, A.G. Ramakrishnan, “Hindi Text

International Journal of Innovation and Research in Computer Science 44

Das könnte Ihnen auch gefallen