Analysis of Nastaleeq For Pakistani Languages

Analysis of Nasta’leeq for Pakistani Languages
M G Abbas Malik Christian Boitet Pushpak Bhattachariyya

GETALP – LIG, University Joseph Fourer CSE, IIT Bombay, India
abbas.malik@imag.fr Christian.Boitet@imag.fr pb@cse.iitb.ac.in
Abstract to be the dominant carrier of information across the

globe. Currently, English is the lingua franca for
Nasta’leeq is a bidirectional, diagonal, non- Internet and most of the information is available in it,
monotonic, cursive, highly context-sensitive and very but that makes information practically inaccessible to
complex writing system for languages like Urdu, the vast majority of the world. This is applicable
Punjabi, Pashto, Sindhi, Baluchi, Kashmiri, etc. especially to countries like Pakistan where those who
written in a derivation of the Perso-Arabic script. The may be considered barely literate in Urdu according to
style is defined by well-formed rules, passed down 43.92% literacy rate as claimed in the census of 1998
mainly through generations of calligraphers and old are nearly 66 million. That is rather a large number
manuscripts, and then by the printed materials of the compared to nearly 26 million (17.29%) who, having
modern era, but these rules have not been passed the ten-year school system (matriculation), can
quantitatively analyzed in enough details for languages presumably read and understand a little English. And
mentioned above. This paper first gives a brief yet Internet and computer programs function in English
introduction of different writing styles of Perso-Arabic in Pakistan and not even in Urdu let alone the other
script and then discusses its salient features. It also languages. This means that most Pakistanis are either
briefly discusses the alphabets of Pakistani languages. excluded from the digital world or function in it as
Finally, it gives the quantitative analysis of Nasta’leeq handicapped aliens. Indeed, most matriculates from
and explains its context-sensitive behavior with respect Urdu and Sindhi medium schools have such
to Pakistani languages that is equally true for Arabic, rudimentary knowledge of English that they cannot
Persian and other languages written in a derivation of carry out any meaningful interaction, especially that
Perso-Arabic script. Finally, it discusses the Context- which would increase their knowledge or analytical
Sensitive Substitution Grammar of Nasta’leeq, a skills, with the digital world. Perhaps only the 4.38%
computational model of Nasta’leeq. graduates (about 6.5 millions) could do so (Rahman,
2004).
1. Introduction Language No. of Speakers
Urdu* 164,290,000
Pakistan is a country with at least six major
languages and 58 minor ones (Rahman, 2004). Urdu, Punjabi 66,225,000
the national language, has over 11 million (7.57%) Pashto 23,130,000
mother-tongue speakers while those who use it as a
Sindhi 21,150,000
second language are more than 105 million (Grimes,
2000). Punjabi, the mother tongue of 44.15% Baluchi 5,355,000
population, is the biggest language of Pakistan. Other Kashmiri 4,496,000
major languages are Pashto, Sindhi, Baluchi and * Urdu include native and 2nd language speakers
Kashmiri according to their number of speakers and Source: Rahman, 2004
geographical divisions. The size of these languages and Table 1: Speakers of Pakistani Languages
Urdu is shown in Table 1.
The benefits from the Information Technology (IT) 2. Arabic Script and its Writing Styles
revolution cannot be reaped unless masses use it,
which is not possible unless computing is possible in a Arabic script is a cursive writing system. It has
language that is understood by the masses (Afzal and many writing styles including Naskh, Kufi, Sulus,
Hussain, 2001). Information has become such an Riqah, Deevani, etc. some of them are shown in figure
integral part of our global society that its access is 1. Nasta’leeq writing style was developed in Iran
considered as a basic human right. Internet is believed during the 14th and 15th centuries by combining Naskh
and Taleeq (an old obsolete style)1. It is one of the Hamza never comes at the beginning of a word
main genres of the Islamic calligraphy. It is rich in (Khaver, 1999), but it could come at the beginning of a
calligraphic content. Owing to the complexities of ligature. Also it takes the independent shape instead of
rendering, the basic shapes identified in this section are the final shape when it comes at the end of the word.
unable to render languages in an acceptable form in Thus, it has initial, middle and independent shapes
Nasta’leeq. A detailed quantitative analysis of (Khaver, 1999; Malik, 2005), as illustrated in figure 5.
Nasta’leeq with respect to Pakistani languages is given
in section 4.
Ñ‚
u@ÐδÖZz õô̼ Zu[pŰz
Figure 5: Shapes of Hamza (circled)
Arabic, Persian and Pakistani languages have a
large set of diacritical marks that are necessary for the
correct articulation of a word. The diacritical marks
¯‫وا‬ ‫ا‬ ‫و‬ appear above or below a character to define a vowel or
to geminate a character (Khaver, 1999; Malik, 2005).
They are the foundation of the vowel system in these
‫وا‬ ‫و ا‬ languages. The most common diacritical marks with
the character Beh are shown in Figure 6.
‫والقم‬a ‫وسخ الشم‬a E[ G[ F[
‫وا‬ ‫و ا‬
Figure 6: BEH with Diacritical Marks
Diacritics, though part of the language, are
sparingly used. They are essential for ambiguities
‫و را س وا ر‬ removal, natural language processing and speech
Figure 1: Different Writing Styles for Arabic
synthesis (Khaver, 1999; Malik, 2005; Malik, 2006;
Malik et al., 2008).
The distinguishing characteristics of Perso-Arabic
script are discussed for the benefit of the unacquainted 3. Pakistani Languages
reader. It is read from right-to-left. Figure 2 shows
some sample characters from Pakistani languages. Pakistani languages are written in an alphabet that is
Unlike English, characters do not have upper and lower derived from the Perso-Arabic alphabet. It is not
case. possible to discuss all Pakistani languages here. This
b ‫ ڃ‬b c b b ` ^ [ \ \ ‫ [ \ ] ^ _ ] \ ٿ‬Z paper only discusses the six languages given in Table 1
r q p o n m l k ‫ ړ ږ ڙ‬g i g g ‫ ڏ‬e e ‫ ڊ‬p f e e ‫څ ڇ‬ because the last five represent the major geographical
} ~ ‫ { | ~ ~ ~ ې‬z ‫ ڼ‬y ‫ ڻ‬y y x w u v u t s division of Pakistan and Urdu is the National language
of Pakistan. All of these languages belong to the Indo-
Figure 2: Sample Characters of Pakistani Languages
European language family. Their family tree is given in
The shape assumed by a character in a word is
Figure 7.
context-sensitive, i.e. the shape is different depending
whether the position of the character is at the
beginning, in the middle or at the end of the constituent
word. This generates three shapes, the fourth being the
independent shape of the character (Khaver, 1999;
Malik, 2005). Figure 3 shows these four shapes of the
character Beh in Naskh writing style.
Figure 3: Context sensitive shapes of BEH

(Khaver, 1999) Figure 7: Language Tree of Pakistani Languages
To be precise, the above is true for all except certain The alphabets of each of these languages are
number of characters that only have the independent discussed separately here with their Unicode values. In
and the terminating shape when they come at Unicode, Arabic and its associated languages like
beginning and middle or end of a word respectively Urdu, Punjabi, Pashto, Sindhi, etc. have been allocated
(Khaver, 1999; Malik, 2005). Some of these characters the code points 0600h – 06FFh, 0750h – 077Fh and
are shown in Figure 4. FB50h – FEFFh.
~ ‫ ڙ‬g ‫ ړ ږ‬g i g g ‫ ڏ‬e e ‫ ڊ‬p f e e W Z
Figure 4: Sample Characters having only Two Shapes
1
http://en.wikipedia.org/wiki/Nasta%27liq_script
3.1. Urdu region e.g. Hindi, Urdu, Punjabi, etc. In Urdu, the
special character Heh Doachashmee (|) is used to mark
Urdu is the National language of Pakistan and one the aspiration. Thus aspirated consonants are
of the state languages of India with more than 60 represented by the combination of the consonant to be
millions native speakers. It is one of the biggest aspirated and Heh Doachashmee (‫ )ه‬e.g. [ [b] + | [h]
languages of the world, if one considers Hindi/Urdu as
dialects of the same language called Hindustani by = J[J [bʰ], ` [ʤ] + | [h] = J[Y [ʤʰ], etc. Urdu has 15
Platts (Platts, 1909). Table 2 gives the size of aspirated consonants (Malik et al., 2008). Aspirated
Hindi/Urdu. Urdu consonants are given in Table 4.
Speakers Native 2nd Language Total Sr. Symbol Sr. Symbol Sr. Urdu
Hindi 366,000,000 487,000,000 853,000,000 1 J[J [bʰ] 6 JY [ʧʰ] 11 JÏ [kʰ]
Urdu 60,290,000 104,000,000 164,290,000
2 JJ [pʰ] 7 |e [ḓʰ] 12 JÏ [gʰ]
Total 426,290,000 591,000,000 1,017,000,000
Table 2: Hindi and Urdu speakers (Malik et al., 2008) 3 J[J [ṱʰ] 8 |e [ɖʰ] 13 Jà [lʰ]
Urdu is written in Nasta’leeq style. It has 35 4 JJ[ [ʈʰ] 9 |g [rʰ] 14 Jà [mʰ]
consonant characters representing 27 consonant sounds
as some consonant sounds are represented by two or 5 J[Y [ʤʰ] 10 |g [ɽʰ] 15 JJ[ [nʰ]
more consonant characters, e.g. the sound ‘s’ is Table 4: Aspirated Urdu Consonants
represented by three different characters Seh (_), Seen In addition to consonants, Urdu has 10 vowels and 7
of them also have their nasalized forms (Malik et al.,
(k) and Sad (m) (Malik et al., 2008). Out of 35
2008; Hussain, 2004). They are represented with the
consonant characters, 32 are adopted from Persian. 3 help of four long vowels (Alef Madda (W), Alef (Z),
retroflex consonants are added to accommodate the
Waw (z) and Yeh (~)) and three short vowels (Arabic
indigenous sounds of the Indian sub-continent. These
Fatha F◌, Damma E◌ and Kasra G◌). The representation of
characters are Tteh (^) [ʈ], Ddal (e) [ɖ] and Rreh (g)
a vowel is context-sensitive, i.e. a vowel may be
[ɽ]. Non-aspirated consonants of Urdu are given in written in two or more ways according to the context in
Table 3. a word, e.g. the vowel sound [ə] is represented by Alef
Sr. Symbol Unicode Sr. Symbol Unicode (Z) + Zabar (F◌) at the start of a word and by Zabar (F◌) in
1 [ [b] 0628 19 m [s] 0635 the middle of a word. The vowel sound [ə] never
2 \ [p] 067E 20 n [z] 0636 comes at the end of a word. Nasalization of a vowel is
marked with Noon-ghunna (y) and Noon (y) at the end
3 ] [ṱ] 062A 21 o [ṱ] 0637
and in the middle of a word respectively (Malik et al.,
4 ^ [ʈ] 0679 22 p [z] 0638
2008). For more details, see (Malik et al., 2008).
5 _ [s] 06B2 23 q [ʔ] 0639 Urdu contains 15 diacritical marks. They represent
6 ` [ʤ] 062C 24 r [ɣ] 063A vowel sounds, except Hamza-e-Izafat (Y◌) and Kasr-e-
7 b [ʧ] 0686 25 s [f] 0641 Izafat (G◌) that are used to build compound words, e.g.
8 b [h] 062D 26 t [q] 0642 ‫[ اِدارﮦٔ ﺳﺎﺋﻨﺲ‬ɪḓɑrəhɪsɑɪns] (Institute of Science), ‫ﺦ‬
ِ ‫ﺗﺎرِﻳ‬
9 c [x] 062E 27 u [k] 06A9 ‫[ ﭘﻴﺪاﺋﺶ‬tɑrixɪpedɑɪʃ] (date of birth), etc. Shadda (H◌) is
10 e [ḓ] 062F 28 v [g] 06AF used to geminate a consonant e.g. ‫ب‬ ّ ‫[ ر‬rəbb] (God), ‫اﭼّﻬﺎ‬
11 e [ɖ] 0688 29 w [l] 0644 [əʧʧʰɑ] (good), etc. Sukun (H◌) is used to mark the
12 f [z] 0630 30 x [m] 0645 absence of a vowel after the base consonant (Platts,
13 g [r] 0631 31 y [n] 0646 1909; Malik et al., 2008).
Languages of Table 1 share Perso-Arabic
14 g [ɽ] 0691 32 z [v] 0648
punctuation and special symbols. These punctuation
15 i [z] 0632 33 { [h] 06C1 marks and symbols are given in Table 5.
16 g [ʒ] 0698 34 ~ [j] 06CC Sr. Symbol Unicode Sr. Symbol Unicode
k [s] 1 Ô 060C 10 ؏ 060F
17 0633 35 > [ṱ] 0629
2 ; 061B 11 ؐ◌ 0610
18 l [ʃ] 0634
3 ? 061F 12 œ◌ 0611
Table 3: Non-Aspirated Urdu Consonants
4 X 06D4 13 Ÿ◌ 0612
The phenomenon of aspiration does not exist in
Persian or Arabic but it exists in languages of the 5 ‫؀‬ 0600 14 ؓ◌ 0613
6 ‫؁‬ 0601 15 ؔ◌ 0614 Pashto has 39 consonants and uses the same Persian
7 ‫؂‬ 0602 16 ؕ◌ 0615 number system without any change. The vowel system
8 ‫؃‬ 0603 17 % 066A of Pashto is also context-sensitive and is represented
with the help of long vowels and diacritical marks.
9 C 060E
Pashto is traditionally written in Naskh style. Table 8
Table 5: Punctuation Marks and other Symbols shows remaining Pashto characters that are not present
Urdu has a numeral system that is a derived from in Urdu or have different shape than in Urdu.
Persian. It assigns the same Unicode values as Persian Sr. Symbol Unicode Sr. Symbol Unicode
ranging 06F0 – 06F9 but employs different shapes for
1 b [dz] 0681 4 k [ȿ] 069A
number 4, 5 and 7. They are shown in Table 6.
Sr. Symbol Unicode Sr. Symbol Unicode 2 ‫[ څ‬ts] 0685 5 u [g] 06AB
1 0 06F0 6 5 06F5 3 ‫[ ږ‬ȥ] 0696
2 1 06F1 7 6 06F6 Table 8: Pashto Characters
3 2 06F2 8 7 06F7
4 3 06F3 9 8 06F8 3.4. Sindhi
5 4 06F4 10 9 06F9
Table 6: Urdu Numerals Sindhi has 40 non-aspirated consonants and 11
3.2. Punjabi aspirated consonants. In Sindhi, aspiration is done in
different ways e.g. for the aspiration of Jeem (`) it uses
Punjabi is written in two mutually incomprehensible Heh Doachashmee (|) like Urdu and Punjabi, for the
scripts, one is the derivation of Perso-Arabic script aspiration of Beh ([) it introduces a new character
(called Shahmukhi) used in Pakistan and the other is with four dots below that is \, for the aspiration of Dal
Gurmukhi used in India. The Punjabi (Shahmukhi)
alphabet is a superset of the Urdu alphabet and has one (e) it also introduces a new character with two
additional non-aspirated consonant, Rnoon (y) [ɳ] horizontal dots that is e, for the aspiration of Sindhi
(Malik, 2005; Malik, 2006). The rest is the same as Tteh (\) it introduces a new character with two vertical
Urdu. Punjabi is also traditionally written in Nasta’leeq dots that is ^, etc. Sindhi aspirated and non-aspirated
style. For more details on the Punjabi (Shahmukhi) consonants that are not present in Urdu or have
alphabet see (Malik, 2005; Malik, 2006). different shapes than in Urdu are given in Table 9.
Sr. Symbol Unicode Sr. Symbol Unicode
3.3. Pashto 1 [ [ɓ] 067B 12 e [ɖʰ] 068D
2 \ [bʰ] 0680 13 ‫[ ڙ‬ɽ] 0699
Like Persian, Pashto does not have the aspirations.
Heh Gol ({) takes the shape of Heh Doachashmee (|) 3 ‫[ ٿ‬ṱʰ] 067F 14 |‫[ ڙ‬ɽʰ] -
when it comes at the start or middle of a ligature. 4 \ [ʈ] 067D 15 s [pʰ] 06A6
Although, the Urdu/Punjabi retroflex sounds exist in 5 ^ [ʈʰ] 067A 16 ‫[ ڪ‬k] 06AA
Pashto, but Pashto employs different characters for
6 b [] 0684 17 u [kʰ] 06A9
them. Table 7 gives a shape comparison of retroflex
consonants in Pakistani languages. 7 ‫[ ڃ‬ɲ] 0683 18 ‫[ ڳ‬ɠ] 06B3
Urdu, Baluchi, 8 ‫[ ڇ‬ʧʰ] 0687 19 v [ŋ] 06B1
IPA Punjabi Pashto Sindhi
Kashmiri
9 e [ḓʰ] 068C 20 ‫[ ڻ‬ɳ] 06BB
ʈ ^ ^ ] \
10 ‫[ ڊ‬ɖ] 068A 21 ~ [j] 064A
ɖ e e p ‫ڊ‬
ɽ g g ‫ړ‬ ‫ڙ‬ 11 ‫[ ڏ‬ɗ] 068F
Table 9: Aspirated and Non-aspirated Sindhi Consonants
ɳ - y ‫ڼ‬ ‫ڻ‬
Sindhi has 51 consonants and 16 vowels that are also
Table 7: Comparison of Retroflex Consonants context sensitive.
In Pashto, there exist five different kinds of Yeh. Pashto and Sindhi are both traditionally written in
One is employed as a consonant and the others Naskh and nobody has done their analysis for
represent different vowel sounds. They are shown in Nasta’leeq style. We are doing the analysis because
Figure 8. they could also be written in Nasta’leeq just like
~ [j], ~ [i], ‫[ ې‬e], ~ [əy], ‫[ ٸ‬ə] Arabic that is also traditionally written in Naskh but
Figure 8: Five Yehs of Pashto we can find very fine and beautiful old manuscripts of
Arabic and Qur’an in the Indian sub-continent that are languages. Nasta’leeq is inherently context-sensitive.
written in Nasta’leeq style. Thus it is worth to provide Figure 9 shows different context-sensitive shapes of
an analysis of Nasta’leeq for Pashto and Sindhi and the character Beh.
provide an opportunity to the Pashto and Sindhi
speaking community to write their languages in p[p úú[ oo[ ùú[ õ[ õ w[ w ä[ä à[ Ý[ [6
ÌÌ[ â[â JJ[ EE á[ á ßà[ I[ I HH Þ[Þ
Nasta’leeq.
3.5. Baluchi [ Sensitive Shapes[ of Beh

Figure 9: Context
Only Wali and Hussain (2004) have given a
Baluchi uses a modified alphabet of Urdu and is quantitative analysis of Nasta’leeq (Nafees style) only
written in Nasta’leeq style. Baluchi has removed the for Urdu language. Here in this study, we will give the
redundant characters for the same sound, e.g. for the quantitative analysis of Noori style of Nasta’leeq for
sound of [s], it keeps the character Seen (k) and five major Pakistani languages, given in Table 1.
discards the others (m ،_). Thus Baluchi has 22 For analysis purposes, we can divide our discussion
into different parts like independent shapes, two, three
consonants and like Persian and Pashto, it also does not
and four characters-joining. Once the analysis is done
have aspirations. It has two additional diacritics; one is
for the four characters long ligature, the joining is
the Hamza mark (Y◌) above and the other is similar to
recursive for ligatures longer than four, thus no further
inverted Damma (E◌) that is horizontally reversed and analysis and no new shape is required to represent the
much flatter (E◌). Some native speakers also write writing of a language in Nasta’leeq style. It is clearly
Baluchi using the same Urdu alphabet. shown in Figure 10.
›»[[[[[[[[$
3.6. Kashmiri [ Nature of Nasta’leeq
Figure 10: Recursive
Kashmiri employs the Urdu alphabet with a few To ease the analysis, we can divide characters into
additions to represent its specific vowels. Kashmiri has different groups on the basis of similarity in shapes,
two additional Yehs (~), one with an oval below (~) e.g. the set of character shown in Figure 11 can be
grouped under the name Beh_Family.
and the other with a ‘v’ mark above (~). It also has two
additional Waws (z), one with a circle at the ending tail ^ \ ‫[ \ ] ^ _ ] [ \ ٿ‬
(z) and the other with a ‘v’ mark above (‫)ۆ‬. In Figure 11: Beh_Family Members
The basic shape of each character of Figure 11 is
diacritical marks, it adds two diacritical marks (slightly exactly the same except their Noktas (dots or marks)
modified Hamza (‫ )ء‬mark), coming above and below above or below. Similarly, we can divide all other
the character. The extra characters of Kashmiri are character into different groups. All different groups of
shown in Table 10. It is also traditionally written in characters are given in Table 11.
Nasta’leeq style. Sr. Name Members
Sr. Symbol Unicode Sr. Symbol Unicode 1 Alef Z Z ‫ ٲ ٳ‬Z W Z
1 ~ [] - 4 ~ [e] 06CE 2 Beh ^ \ ‫[ \ ] ^ _ ] [ \ ٿ‬
2 z [ɔ] 06C4 5 6◌* [ə] - 3 Jeem ‫ ڃ ڇ‬b ‫ څ‬b c b b `
3 ‫[ ۆ‬o:] 06C6 4 Dal e ‫ ڊ ڏ‬e p f e e
* This diacritical mark comes above and below the characters.
Thus it represents two diacritical marks.
5 Reh ‫ ړ ږ ڙ‬g i g g
Table 10: Kashmiri Characters 6 Seen k l k
7 Sad n m
4. Analysis of Nasta’leeq 8 Toain p o
The rendering of Pakistani languages in Nasta’leeq 9 Ain r q
is very complex because the shape and position of the 10 Feh s s
characters not only depend upon its position (at the 11 Qaf t
start, in the middle or at the end) in the word but also 12 Kaf v ‫ ڪ ڳ‬u v ‫ ڪ‬u
depend upon the surrounding characters in the word.
The 4-shape analysis given in Section 2, is not 13 Lam w
sufficient to handle Nasta’leeq rendering of Pakistani 14 Meem x
15 Noon ‫ ڼ ڻ‬y y y 11 " 28 F◌ 45 ٜ◌
16 Waw ‫ ۆ‬z z 12 " 29 E ◌ 46 E ◌
17 Heh { 13 " 30 F ◌ 47 E ◌
18 Heh-
| 14  31 F ◌ 48 G ◌
Doachashmee
15 " 32 E ◌ 49 E ◌
19 Hamza ‫ء‬
16  33 G ◌ 50 6 ◌
20 Choti-Yeh ~ ~ ‫~ ~ ~ ~ ې‬
17 " 34 51 H ◌  ◌
21 Bari-Yeh }
Table 12: Ligatures and Diacritical Marks
Table 11: Characters Families
In addition to all characters of Table 11, there exist
certain ligatures that are treated like independent
4.2. Two Characters Joining
characters in the Nasta’leeq writing style. They are
We will do the analysis of two characters joining in
given in Figure 12. They act like independent
reverse order, i.e. first we will identify the final shapes
characters that do not join with the following character
and then initial shapes for these final shapes. There are
in the ligature and have only two (independent and
two types of characters, one which has only two
final) shapes.
(independent and final) shapes. This group consists of
Ç = Z + v ،» = Z + u Ô‫ = ﻻ‬Z + w
Alef_Family, Dal_Family, Reh_Family, two characters
 = Z + ‫ ڳ‬،Ç = Z + v » = Z + u from Choti-Yeh_Family, i.e. ~ (Alef Maskura) and ~
Figure 12: Ligatures 1
(Pashto yeh with tail) and Bari-Yeh (}). Some of these
characters have two final shapes depending on their
4.1. Independent Shapes
joining behavior with different families, e.g.
Reh_Family has two final shapes, one for Beh, Jeem,
All characters of Table 11, ligatures of Figure 12,
Kaf, Lam, Noon, Hamza and choti-yeh families and the
punctuation marks and special symbols of Table 6,
other for the rest of the families. Final shapes of these
Urdu Numerals of Table 5 and Arabic numerals are
families are given in Table 13.
independent characters. In addition to punctuation
Sr. Shape Example
marks of Table 6, other English punctuation marks like
 [
 6 6 î 6 [ [6
single quote, double quote, colon, etc. are also included 1
into Nasta’leeq. 2 ä ää ää[ ä¡ ä‡ äu[ ä[ä
There are certain special ligatures that are included 
3 w u w w ww[ u¢ uu ww[ w[ w
in the Nasta’leeq, e.g. Allah ligature (‫)ﷲ‬, Muhammad
ligature (‫)ﷴ‬, etc. Other 23 two character ligatures are 4 Ì Ì ÌÌ Ì[ Ì Ìè ÌÏ Ì[ ÌÌ[
also included into the Nasta’leeq. In addition to all the Table 13: Final Shapes of Alef, Dal, Reh and two Yehs
above characters, Nasta’leeq also has a large set of Final Shapes of rest of families are given in Table 14.
Sr. Shape Example
diacritical marks that contains diacritical marks of
Arabic, Persian, Urdu, Punjabi, Pashto, Sindhi, 1 › Ý Ý  Ý[ ›  ÝÝ[ Ý[
Baluchi, and Kashmiri. All these ligatures and [
2  à à Û r [[ à[
diacritical marks are given in Table 12. [
Sr. Symbol Sr. Symbol Sr. Symbol 3 õ Ò õ õ õõ õm ÒÒ õf[ õ[ õ
‫ﷸ‬  H◌ [
1 18 35 4 ù  ù ú ùú ùä ùÓ ùU[ ùú[
2 ‫ﷲ‬ 19 " 36 W ◌
5 o o o o[o o o‰ oy[ oo[
3 ‫ﷴ‬ 20  37 Y ◌ [
4 ‫ﷺ‬ 21 " 38 Y ◌
6 ú  ú ú úú úä úÓ úU[ úú[
5 " 22  39 G ◌ 7 p p p p p[p p pŠ pz[ p[p
6 9 23 " 40 E ◌ 8 Þ Þ Þ Þ[ Þ Þm ÞÉ Þh[ Þ[Þ
7 D 24 ‫ۓ‬ 41 H ◌ 9 H  HH HH[ H‹ H‹ H[ HH
8 L 25 { 42 D ◌  [ [
10 I I I II IÝ IË I^[ I[ I
9 P 26 ‫ؤ‬ 43 D ◌ [
D Ì D ◌
11 ß ßà ßà ßÛ ßr ß [o ßà[
10 27 44
12 á á á áá[ á áÍ áa[ á[ á 18 Ì  Choti-Yeh_Fina
* °
13 â â â â â[â âß âÎ â<[ â[â 19 Bari-yeh_Fina
20  D La_Fina
14 E EE EE[ E£ E Eb EE
 [ [ 21 Ï [ Ka_Fina
15 J JJ JJ[ Jm J5 JY[ JJ[ * Behinit family with Bari-yeh is stored as
16 Z Zð Zó Zá ZÐ Z° ligatures
[ Table 15: Initial Shapes of Beh and Jeem Families
17 ‫ﻼ‬ ‫ﻼ‬ ‫ [ﻼ‬ ‫ﻼ &ﻼ‬6 ‫ﻼ‬D[ ‫ [ﻼ‬ With 21 initial shapes of all families, all possible
18 ‘ ‘ ‘ Ï ‘[Ï ‘7 ‘f ‘[[ ‘Ï[ two character ligatures can be represented in
Nasta’leeq. Kaf and Lam families do not have an initial
Table 14: Final Shapes
shape for Alef because these pairs of characters are
Hamza (‫ )ء‬does not have a final shape. Thus there stored as ligatures, shown in Figure 12.
are 22 final families depending upon their final shapes,
given in Table 13 and 14. 4.3. Three Characters Joining
The above two tables not only give us the final
shapes of all the families of Table 11 and of ligatures Final shapes have already been identified in the
of Figure 12 (La ‫ ﻻ‬and Ka family  ÔÇ Ô» ÔÇ ،»), they previous section. Similar to the initial shapes, 21
also give us the analysis of initial shapes of Beh, Jeem, medial shapes are identified for the final shape
Seen, Sad, Noon and Choti-yeh families. The analysis families. Medial shapes of Behmedi and Jeemmedi
of initial shapes of Beh, Noon, Hamza and Choti-yeh families for final families are given in Table 16.
family shows that they have the same base form for the Behmedi Jeemedi
Sr. Final Families
initial shape with variations in Noktas which is clear Shape Shapes
from the above examples. It is also clear from the 1 3 = Alef_Fina
above examples that the initial form for final shapes of
Sad and Ain families are the same. Thus the Behinit 2 » 1 Beh_Fina
family (including initial forms of Beh, Noon, Hamza 3 â r Jeem_Fina
and Choti-yeh families) has 21 initial shapes, given in
Table 15. 4 s = Dal_Fina
Sr.
Behinit Jeeminit
Final Families 5 3 p Reh_Fina
Shape Shapes
1 6  Alef_Fina 6  1 Seen_Fina
2  Ý Beh_Fina 7  1 Sad_Ain_Fina
3 à [ Jeem_Fina  
8 Tah_Fina
4 ä u Dal_Fina
9   Feh_Fina
5 w w Reh_Fina
6 õ f Seen_Fina 10   Qaf_Fina
Kaf_Fina,
7 ú U Sad_Ain_Fina 11 A =
Gaf_Fina
8 o y Tah_Fina
12 ‚ = Lam_Fina
9 p z Feh_Fina
10 Þ h Qaf_Fina
13  r Meem_Fina
11 H  Kaf_Fina 14 $  Noon_Fina
12 I ^ Lam_Fina 15 â Å Waw_Fina
13 à o Meem_Fina 16 ø q Hehgol_Fina
14 á a Noon_Fina Heh-
15 â < Waw_Fina 17 P ˆ doachashmee_Fi
16 E b Hehgol_Fina na
Heh- 18  ˜ Choti-Yeh_Fina
17 J Y doachashmee_Fin , 
19 Bari-yeh_Fina
a
20 3 = La_Fina Now first we need to identify medial shapes that will
join with the already identified medial shapes.
21  6 Ka_Fina Secondly, we need to identify initial shapes that will
Table 16: Medial Shapes of Beh and Jeem Families join with newly identified medial shapes in the first
Behmedi shapes can be grouped into four different step and this will complete our joining analysis.
families according to the joining behavior with the The process of identifying the new medial shapes is
previous character. This is clearly shown in Table 17. the same as that we have used to identify the initial
shapes for the first 21 medial shapes. Similar to the
Name of Family Shape Members Behinit family, Behmedi family also have four new
1, 2, 4, 7, 8, 9, 10, 11, 12, 15, shapes for its first 21 members and one shape for the
Behmedi1
16, 19, 20, 21, 24, 25, 28, 29 Jeemmedi family. All additional medial shapes of the
Behmedi2 3, 13, 17, 18, 26, 30 Behmedi and Jeemmedi families, identified for medial
Behmedi3 6, 14, 22, 23, 27 shapes, are given in Table 19.
If we look at Table 17, then we will come to know
Behmedi4 5
that Behmedi2 family includes the medial shapes # 26
Table 17: Behmedi Families and 30. Thus, fortunately, we do not have new initial
For the families of Table 17, we need four initial shapes for these newly identified medial shapes of
shapes of each character that has an initial shape. Thus Table 19 and this completes our analysis.
the Behinit family has four new shapes for the Ligatures longer than 4 can be built using
Behmedi family and one shape for the Jeemmedi recursively the shapes already identified. It is shown in
family. All additional initial shapes of the Behinit and Figure 10. Hence, we have 1 or 2 final shapes, 30
Jeeminit families, identified for medial shapes, are initial shapes and 30 medial shapes of the characters in
given in Table 18. Pakistani languages. Hence we need 996 shapes of
Thus now, we have 30 initial shapes and 21 medial characters to represent Pakistani languages in
shapes that may represent all possible ligatures of Nasta’leeq style.
length three of Pakistani languages in Nasta’leeq style. Behinit Jeeminit
It is not possible to list all shapes of all characters due Sr.
Shape Shapes
Medial Families
to space shortage.
Behinit Jeeminit
22  6 Behmedi1
Sr. Medial Families
Shape Shapes 23   Behmedi2
22 3 " Behmedi1
24  6 Behmedi3
23 P / Behmedi2
25 ø  Behmedi4
24 $  Behmedi3
26   Jeemmedi
25 3 3 Behmedi4
27  6 Seenmedi
26 = D Jeemmedi Sadmedi, Tahmedi,
28  
27   Seenmedi Ainmedi, Fehmedi
Kafmedi, Gafmedi,
Sadmedi, Tahmedi, 29 : 6
28 Ï  Lammedi
Ainmedi, Fehmedi
Meemmedi, Hehgolmedi,
Kafmedi, Gafmedi, 30  
29 o  Heh-doachashmeemedi
Lammedi Table 19: More Medial Shapes of Beh and Jeem Families
Meemmedi,
30   Hehgolmedi, Heh-
doachashmeemedi
5. Context Sensitive Substitution Grammar
Table 18: More Initial Shapes of Beh and Jeem Families
The analysis given in the Section 4 can be
represented in the Context-Sensitive Substitution
4.4. Four Characters Joining
Grammar. Figure 13 shows some rules of the
contextual substitution grammar of Nasta’leeq.
We are doing our analysis in the reverse direction, Initial Rule
i.e. from left-to-right. In the analysis of three characters beh → behinit1 aiknoktabelow
joining, we have already identified the shapes of the jeem → jeeminit1 aiknoktabelow
last two characters of our four characters ligatures that No Context (Before | After)
are final shapes and medial shapes for our final shapes. Medial Rule
Beh → behmedi1 aiknoktabelow Perso-Arabic script like Urdu, Punjabi, Pashto, Sindhi,
Jeem → jeemmedi1 aiknoktabelow Baluchi, Kashmiri, etc. The analysis of Nasta’leeq for
No Context (Before | After) Pakistani languages is equally true for Arabic and
Final Rule Persian for writing them in the Nasta’leeq style. The
beh → behfina1
jeem → jeemfina
analysis of Nasta’leeq and the Context-Sensitive
No Context (Before | After) Substitution Grammar, discussed in this paper, can be
Contextual Substitution Rule for Behfina1 used to build a good font for Arabic, Persian, Urdu,
behinit1 → behinit2 Punjabi, Pashto, Sindhi, Baluchi and Kashmiri
jeeminit1 → jeeminit2 languages to write them in the Nasta’leeq style.
behmedi1 → behmedi2 The practical implementation of a character-based
jeemmedi1 → jeemmedi2 Nasta’leeq font for Arabic, Persian and Pakistani
Context ( | behfina1) languages is much more complex process as compared
Contextual Substitution Rule for Jeemfina1 to its theoretical analysis. A practical development of a
behinit1 → behinit3
jeeminit1 → jeeminit3
character-based Nasta’leeq font for the said languages
behmedi1 → behmedi3 not only needs the Conext-Sensitive Substitution
jeemmedi1 → jeemmedi3 Grammar, but it also requires other important
Context ( | jeemfina) information about the positioning to correctly position
Contextual Substitution Rule for Behmedi1 characters considering its contexts. Just to have an idea
Family of the practical complexity, the Initial Rule of Figure
behinit1 → behinit22 13 substitutes Beh ([) with its initial shape behinit1 (6)
jeeminit1 → jeeminit22
behmedi1 → behmedi22 and aiknoktabelow (a dot below the initial shape) but it
jeemmedi1 → jeemmedi22 does not give any idea about the position of the Nokta.
Context ( | <behmedi1 Family>) In other rules given in Figure 13, we are substituting
Figure 13: Context-Sensitive Substitution Grammar shapes but we do not give any idea about their
The Initial Rule tells that Beh ([) and Jeem (`) will positions that they are joining with the context before
be substituted by behinit1 (6) and jeeminit1 () and after shapes properly or not. Positioning of Noktas
and diacritical marks with respect to all shapes is
respectively with appropriate Nokta with them
another complex problem for a practical Nasta’leeq
whenever they will come at the initial position of a
font for languages written in the Peso-Arabic script.
ligature. Medial and Final rules also have same kind of
interpretation for medial and final positions
respectively. The Contextual Substitution Rule for 7. References
Behfina1 tells that default initial shapes behinit1 (6) and
T. Rahman, “Language Policy and Localization in
jeeminit1 () at the initial position will be substituted Pakistan: Proposal for a Paradigmatic Shift”, in proc.
by behinit2 () and jeeminit2 (Ý) when they are Crossing the Digital Divide, SCALLA Conference on
Computational Linguistics, 5 – 7 January, 2004.
followed by a Behfina1. It also tells that default medial
B. F. Grimes, “Pakistan”. Ethnologue: Languages of the
shapes behmedi1 (3) and jeemmedi1 (=) at the medial World. 14th Edition Dallas, Texas; Summer Institute of
position will be substituted with behmedi2 (») and Linguistics, 2000.
jeeminit2 (1) when they are followed by a Behfina1. M. Afzal, S. Hussain, “Urdu Computing Standards:
Development of Urdu Zabta Takhti (UZT) 1.01”. in proc.
The other rules also have the same kind of INMIC-2001, Lahore, 2001.
interpretations. Figure 13 shows a very small part of Z. Khaver, “Standard Code Table for Urdu”, in proc. 4th
the Context-Sensitive Substitution Grammar of Symposium on Multilingual Information Processing (MLIT-
Nasta’leeq. This clearly shows the contextual nature 4), Yangon, Myanmar, CICC, japan, 1999.
and contextual complexity of Nasta’leeq. M. G. Abbas Malik, “Towards a Unicode Compatible
Theoretically, the Context-Sensitive Substitution Punjabi Character Set”. In proc. 27th Internationalization and
Grammar is a computational model of Nasta’leeq’s Unicode Conference, Berlin, Germany, 2005.
M. G. Abbas Malik, “Punjabi Machine Transliteration”.
contextual substitution complexity. In proc. 21st International Conference on Computational
Linguisitcs COLING-06 and 44th Annual Meeting of ACL,
6. Conclusion Sydney, Australia, 2006.
J. T. Platts, “A Grammar of the Hindustani or Urdu
Nasta’leeq is a bidirectional, diagonal, non- Language”. Crosby Lockwood and Son, 7 Stationers Hall
monotonic, cursive, highly context-sensitive and very Court, Ludgate hill, London. E.C., 1909.
complex writing system for languages written in the M. G. Abbas Malik, Christian Boitet, Pushpak
Bhattcharyya, “Hindi Urdu Machine Transliteration using
Finite-state Transducers”. In proc. 22nd International
Conference on Computational Linguistics COLING-08,
Manchester, UK, 2008.
S. Hussain, “Letter to Sound Rules for Urdu Text to
Speech System”, in Proc. of Workshop on “Computational
Approaches to Arabic Script-based Languages”, COLING-
04, Geneva, Switzerland, 2004.
Wali, A., Hussain, S., “Context Sensitive Shape-
Substitution in Nastaliq Writing System: an analysis and
fomulation”. In Proc. of “International Joint Conferences on
Computer, Information and Systems Sciences and
Engeenering”, 2006.

Analysis of Nastaleeq For Pakistani Languages

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Analysis of Nastaleeq For Pakistani Languages

Hochgeladen von

Copyright:

Verfügbare Formate

Analysis of Nasta’leeq for Pakistani Languages

M G Abbas Malik Christian Boitet Pushpak Bhattachariyya

Abstract to be the dominant carrier of information across the

Figure 3: Context sensitive shapes of BEH

3.5. Baluchi [ Sensitive Shapes[ of Beh

Das könnte Ihnen auch gefallen