Beruflich Dokumente
Kultur Dokumente
Ñ‚
u@ÐδÖZz õô̼ Zu[pŰz
Figure 5: Shapes of Hamza (circled)
Arabic, Persian and Pakistani languages have a
large set of diacritical marks that are necessary for the
correct articulation of a word. The diacritical marks
¯وا ا و appear above or below a character to define a vowel or
to geminate a character (Khaver, 1999; Malik, 2005).
They are the foundation of the vowel system in these
وا و ا languages. The most common diacritical marks with
the character Beh are shown in Figure 6.
والقمa وسخ الشمa E[ G[ F[
وا و ا
Figure 6: BEH with Diacritical Marks
Diacritics, though part of the language, are
sparingly used. They are essential for ambiguities
و را س وا ر removal, natural language processing and speech
Figure 1: Different Writing Styles for Arabic
synthesis (Khaver, 1999; Malik, 2005; Malik, 2006;
Malik et al., 2008).
The distinguishing characteristics of Perso-Arabic
script are discussed for the benefit of the unacquainted 3. Pakistani Languages
reader. It is read from right-to-left. Figure 2 shows
some sample characters from Pakistani languages. Pakistani languages are written in an alphabet that is
Unlike English, characters do not have upper and lower derived from the Perso-Arabic alphabet. It is not
case. possible to discuss all Pakistani languages here. This
b ڃb c b b ` ^ [ \ \ [ \ ] ^ _ ] \ ٿZ paper only discusses the six languages given in Table 1
r q p o n m l k ړ ږ ڙg i g g ڏe e ڊp f e e څ ڇ because the last five represent the major geographical
} ~ { | ~ ~ ~ ېz ڼy ڻy y x w u v u t s division of Pakistan and Urdu is the National language
of Pakistan. All of these languages belong to the Indo-
Figure 2: Sample Characters of Pakistani Languages
European language family. Their family tree is given in
The shape assumed by a character in a word is
Figure 7.
context-sensitive, i.e. the shape is different depending
whether the position of the character is at the
beginning, in the middle or at the end of the constituent
word. This generates three shapes, the fourth being the
independent shape of the character (Khaver, 1999;
Malik, 2005). Figure 3 shows these four shapes of the
character Beh in Naskh writing style.
1
http://en.wikipedia.org/wiki/Nasta%27liq_script
3.1. Urdu region e.g. Hindi, Urdu, Punjabi, etc. In Urdu, the
special character Heh Doachashmee (|) is used to mark
Urdu is the National language of Pakistan and one the aspiration. Thus aspirated consonants are
of the state languages of India with more than 60 represented by the combination of the consonant to be
millions native speakers. It is one of the biggest aspirated and Heh Doachashmee ( )هe.g. [ [b] + | [h]
languages of the world, if one considers Hindi/Urdu as
dialects of the same language called Hindustani by = J[J [bʰ], ` [ʤ] + | [h] = J[Y [ʤʰ], etc. Urdu has 15
Platts (Platts, 1909). Table 2 gives the size of aspirated consonants (Malik et al., 2008). Aspirated
Hindi/Urdu. Urdu consonants are given in Table 4.
Speakers Native 2nd Language Total Sr. Symbol Sr. Symbol Sr. Urdu
Hindi 366,000,000 487,000,000 853,000,000 1 J[J [bʰ] 6 JY [ʧʰ] 11 JÏ [kʰ]
Urdu 60,290,000 104,000,000 164,290,000
2 JJ [pʰ] 7 |e [ḓʰ] 12 JÏ [gʰ]
Total 426,290,000 591,000,000 1,017,000,000
Table 2: Hindi and Urdu speakers (Malik et al., 2008) 3 J[J [ṱʰ] 8 |e [ɖʰ] 13 Jà [lʰ]
Urdu is written in Nasta’leeq style. It has 35 4 JJ[ [ʈʰ] 9 |g [rʰ] 14 Jà [mʰ]
consonant characters representing 27 consonant sounds
as some consonant sounds are represented by two or 5 J[Y [ʤʰ] 10 |g [ɽʰ] 15 JJ[ [nʰ]
more consonant characters, e.g. the sound ‘s’ is Table 4: Aspirated Urdu Consonants
represented by three different characters Seh (_), Seen In addition to consonants, Urdu has 10 vowels and 7
of them also have their nasalized forms (Malik et al.,
(k) and Sad (m) (Malik et al., 2008). Out of 35
2008; Hussain, 2004). They are represented with the
consonant characters, 32 are adopted from Persian. 3 help of four long vowels (Alef Madda (W), Alef (Z),
retroflex consonants are added to accommodate the
Waw (z) and Yeh (~)) and three short vowels (Arabic
indigenous sounds of the Indian sub-continent. These
Fatha F◌, Damma E◌ and Kasra G◌). The representation of
characters are Tteh (^) [ʈ], Ddal (e) [ɖ] and Rreh (g)
a vowel is context-sensitive, i.e. a vowel may be
[ɽ]. Non-aspirated consonants of Urdu are given in written in two or more ways according to the context in
Table 3. a word, e.g. the vowel sound [ə] is represented by Alef
Sr. Symbol Unicode Sr. Symbol Unicode (Z) + Zabar (F◌) at the start of a word and by Zabar (F◌) in
1 [ [b] 0628 19 m [s] 0635 the middle of a word. The vowel sound [ə] never
2 \ [p] 067E 20 n [z] 0636 comes at the end of a word. Nasalization of a vowel is
marked with Noon-ghunna (y) and Noon (y) at the end
3 ] [ṱ] 062A 21 o [ṱ] 0637
and in the middle of a word respectively (Malik et al.,
4 ^ [ʈ] 0679 22 p [z] 0638
2008). For more details, see (Malik et al., 2008).
5 _ [s] 06B2 23 q [ʔ] 0639 Urdu contains 15 diacritical marks. They represent
6 ` [ʤ] 062C 24 r [ɣ] 063A vowel sounds, except Hamza-e-Izafat (Y◌) and Kasr-e-
7 b [ʧ] 0686 25 s [f] 0641 Izafat (G◌) that are used to build compound words, e.g.
8 b [h] 062D 26 t [q] 0642 [ اِدارﮦٔ ﺳﺎﺋﻨﺲɪḓɑrəhɪsɑɪns] (Institute of Science), ﺦ
ِ ﺗﺎرِﻳ
9 c [x] 062E 27 u [k] 06A9 [ ﭘﻴﺪاﺋﺶtɑrixɪpedɑɪʃ] (date of birth), etc. Shadda (H◌) is
10 e [ḓ] 062F 28 v [g] 06AF used to geminate a consonant e.g. ب ّ [ رrəbb] (God), اﭼّﻬﺎ
11 e [ɖ] 0688 29 w [l] 0644 [əʧʧʰɑ] (good), etc. Sukun (H◌) is used to mark the
12 f [z] 0630 30 x [m] 0645 absence of a vowel after the base consonant (Platts,
13 g [r] 0631 31 y [n] 0646 1909; Malik et al., 2008).
Languages of Table 1 share Perso-Arabic
14 g [ɽ] 0691 32 z [v] 0648
punctuation and special symbols. These punctuation
15 i [z] 0632 33 { [h] 06C1 marks and symbols are given in Table 5.
16 g [ʒ] 0698 34 ~ [j] 06CC Sr. Symbol Unicode Sr. Symbol Unicode
k [s] 1 Ô 060C 10 ؏ 060F
17 0633 35 > [ṱ] 0629
2 ; 061B 11 ؐ◌ 0610
18 l [ʃ] 0634
3 ? 061F 12 œ◌ 0611
Table 3: Non-Aspirated Urdu Consonants
4 X 06D4 13 Ÿ◌ 0612
The phenomenon of aspiration does not exist in
Persian or Arabic but it exists in languages of the 5 0600 14 ؓ◌ 0613
6 0601 15 ؔ◌ 0614 Pashto has 39 consonants and uses the same Persian
7 0602 16 ؕ◌ 0615 number system without any change. The vowel system
8 0603 17 % 066A of Pashto is also context-sensitive and is represented
with the help of long vowels and diacritical marks.
9 C 060E
Pashto is traditionally written in Naskh style. Table 8
Table 5: Punctuation Marks and other Symbols shows remaining Pashto characters that are not present
Urdu has a numeral system that is a derived from in Urdu or have different shape than in Urdu.
Persian. It assigns the same Unicode values as Persian Sr. Symbol Unicode Sr. Symbol Unicode
ranging 06F0 – 06F9 but employs different shapes for
1 b [dz] 0681 4 k [ȿ] 069A
number 4, 5 and 7. They are shown in Table 6.
Sr. Symbol Unicode Sr. Symbol Unicode 2 [ څts] 0685 5 u [g] 06AB
1 0 06F0 6 5 06F5 3 [ ږȥ] 0696
2 1 06F1 7 6 06F6 Table 8: Pashto Characters
3 2 06F2 8 7 06F7
4 3 06F3 9 8 06F8 3.4. Sindhi
5 4 06F4 10 9 06F9
Table 6: Urdu Numerals Sindhi has 40 non-aspirated consonants and 11
3.2. Punjabi aspirated consonants. In Sindhi, aspiration is done in
different ways e.g. for the aspiration of Jeem (`) it uses
Punjabi is written in two mutually incomprehensible Heh Doachashmee (|) like Urdu and Punjabi, for the
scripts, one is the derivation of Perso-Arabic script aspiration of Beh ([) it introduces a new character
(called Shahmukhi) used in Pakistan and the other is with four dots below that is \, for the aspiration of Dal
Gurmukhi used in India. The Punjabi (Shahmukhi)
alphabet is a superset of the Urdu alphabet and has one (e) it also introduces a new character with two
additional non-aspirated consonant, Rnoon (y) [ɳ] horizontal dots that is e, for the aspiration of Sindhi
(Malik, 2005; Malik, 2006). The rest is the same as Tteh (\) it introduces a new character with two vertical
Urdu. Punjabi is also traditionally written in Nasta’leeq dots that is ^, etc. Sindhi aspirated and non-aspirated
style. For more details on the Punjabi (Shahmukhi) consonants that are not present in Urdu or have
alphabet see (Malik, 2005; Malik, 2006). different shapes than in Urdu are given in Table 9.
Sr. Symbol Unicode Sr. Symbol Unicode
3.3. Pashto 1 [ [ɓ] 067B 12 e [ɖʰ] 068D
2 \ [bʰ] 0680 13 [ ڙɽ] 0699
Like Persian, Pashto does not have the aspirations.
Heh Gol ({) takes the shape of Heh Doachashmee (|) 3 [ ٿṱʰ] 067F 14 |[ ڙɽʰ] -
when it comes at the start or middle of a ligature. 4 \ [ʈ] 067D 15 s [pʰ] 06A6
Although, the Urdu/Punjabi retroflex sounds exist in 5 ^ [ʈʰ] 067A 16 [ ڪk] 06AA
Pashto, but Pashto employs different characters for
6 b [] 0684 17 u [kʰ] 06A9
them. Table 7 gives a shape comparison of retroflex
consonants in Pakistani languages. 7 [ ڃɲ] 0683 18 [ ڳɠ] 06B3
Urdu, Baluchi, 8 [ ڇʧʰ] 0687 19 v [ŋ] 06B1
IPA Punjabi Pashto Sindhi
Kashmiri
9 e [ḓʰ] 068C 20 [ ڻɳ] 06BB
ʈ ^ ^ ] \
10 [ ڊɖ] 068A 21 ~ [j] 064A
ɖ e e p ڊ
ɽ g g ړ ڙ 11 [ ڏɗ] 068F
Table 9: Aspirated and Non-aspirated Sindhi Consonants
ɳ - y ڼ ڻ
Sindhi has 51 consonants and 16 vowels that are also
Table 7: Comparison of Retroflex Consonants context sensitive.
In Pashto, there exist five different kinds of Yeh. Pashto and Sindhi are both traditionally written in
One is employed as a consonant and the others Naskh and nobody has done their analysis for
represent different vowel sounds. They are shown in Nasta’leeq style. We are doing the analysis because
Figure 8. they could also be written in Nasta’leeq just like
~ [j], ~ [i], [ ېe], ~ [əy], [ ٸə] Arabic that is also traditionally written in Naskh but
Figure 8: Five Yehs of Pashto we can find very fine and beautiful old manuscripts of
Arabic and Qur’an in the Indian sub-continent that are languages. Nasta’leeq is inherently context-sensitive.
written in Nasta’leeq style. Thus it is worth to provide Figure 9 shows different context-sensitive shapes of
an analysis of Nasta’leeq for Pashto and Sindhi and the character Beh.
provide an opportunity to the Pashto and Sindhi
speaking community to write their languages in p[p úú[ oo[ ùú[ õ[ õ w[ w ä[ä à[ Ý[ [6
ÌÌ[ â[â JJ[ EE á[ á ßà[ I[ I HH Þ[Þ
Nasta’leeq.
›»[[[[[[[[$
3.6. Kashmiri [ Nature of Nasta’leeq
Figure 10: Recursive
Kashmiri employs the Urdu alphabet with a few To ease the analysis, we can divide characters into
additions to represent its specific vowels. Kashmiri has different groups on the basis of similarity in shapes,
two additional Yehs (~), one with an oval below (~) e.g. the set of character shown in Figure 11 can be
grouped under the name Beh_Family.
and the other with a ‘v’ mark above (~). It also has two
additional Waws (z), one with a circle at the ending tail ^ \ [ \ ] ^ _ ] [ \ ٿ
(z) and the other with a ‘v’ mark above ()ۆ. In Figure 11: Beh_Family Members
The basic shape of each character of Figure 11 is
diacritical marks, it adds two diacritical marks (slightly exactly the same except their Noktas (dots or marks)
modified Hamza ( )ءmark), coming above and below above or below. Similarly, we can divide all other
the character. The extra characters of Kashmiri are character into different groups. All different groups of
shown in Table 10. It is also traditionally written in characters are given in Table 11.
Nasta’leeq style. Sr. Name Members
Sr. Symbol Unicode Sr. Symbol Unicode 1 Alef Z Z ٲ ٳZ W Z
1 ~ [] - 4 ~ [e] 06CE 2 Beh ^ \ [ \ ] ^ _ ] [ \ ٿ
2 z [ɔ] 06C4 5 6◌* [ə] - 3 Jeem ڃ ڇb څb c b b `
3 [ ۆo:] 06C6 4 Dal e ڊ ڏe p f e e
* This diacritical mark comes above and below the characters.
Thus it represents two diacritical marks.
5 Reh ړ ږ ڙg i g g
Table 10: Kashmiri Characters 6 Seen k l k
7 Sad n m
4. Analysis of Nasta’leeq 8 Toain p o
The rendering of Pakistani languages in Nasta’leeq 9 Ain r q
is very complex because the shape and position of the 10 Feh s s
characters not only depend upon its position (at the 11 Qaf t
start, in the middle or at the end) in the word but also 12 Kaf v ڪ ڳu v ڪu
depend upon the surrounding characters in the word.
The 4-shape analysis given in Section 2, is not 13 Lam w
sufficient to handle Nasta’leeq rendering of Pakistani 14 Meem x
15 Noon ڼ ڻy y y 11 " 28 F◌ 45 ٜ◌
16 Waw ۆz z 12 " 29 E ◌ 46 E ◌
17 Heh { 13 " 30 F ◌ 47 E ◌
18 Heh-
| 14 31 F ◌ 48 G ◌
Doachashmee
15 " 32 E ◌ 49 E ◌
19 Hamza ء
16 33 G ◌ 50 6 ◌
20 Choti-Yeh ~ ~ ~ ~ ~ ~ ې
17 " 34 51 H ◌ ◌
21 Bari-Yeh }
Table 12: Ligatures and Diacritical Marks
Table 11: Characters Families
In addition to all characters of Table 11, there exist
certain ligatures that are treated like independent
4.2. Two Characters Joining
characters in the Nasta’leeq writing style. They are
We will do the analysis of two characters joining in
given in Figure 12. They act like independent
reverse order, i.e. first we will identify the final shapes
characters that do not join with the following character
and then initial shapes for these final shapes. There are
in the ligature and have only two (independent and
two types of characters, one which has only two
final) shapes.
(independent and final) shapes. This group consists of
Ç = Z + v ،» = Z + u Ô = ﻻZ + w
Alef_Family, Dal_Family, Reh_Family, two characters
= Z + ڳ،Ç = Z + v » = Z + u from Choti-Yeh_Family, i.e. ~ (Alef Maskura) and ~
Figure 12: Ligatures 1
(Pashto yeh with tail) and Bari-Yeh (}). Some of these
characters have two final shapes depending on their
4.1. Independent Shapes
joining behavior with different families, e.g.
Reh_Family has two final shapes, one for Beh, Jeem,
All characters of Table 11, ligatures of Figure 12,
Kaf, Lam, Noon, Hamza and choti-yeh families and the
punctuation marks and special symbols of Table 6,
other for the rest of the families. Final shapes of these
Urdu Numerals of Table 5 and Arabic numerals are
families are given in Table 13.
independent characters. In addition to punctuation
Sr. Shape Example
marks of Table 6, other English punctuation marks like
[
6 6 î 6 [ [6
single quote, double quote, colon, etc. are also included 1
into Nasta’leeq. 2 ä ää ää[ ä¡ ä‡ äu[ ä[ä
There are certain special ligatures that are included
3 w u w w ww[ u¢ uu ww[ w[ w
in the Nasta’leeq, e.g. Allah ligature ()ﷲ, Muhammad
ligature ()ﷴ, etc. Other 23 two character ligatures are 4 Ì Ì ÌÌ Ì[ Ì Ìè ÌÏ Ì[ ÌÌ[
also included into the Nasta’leeq. In addition to all the Table 13: Final Shapes of Alef, Dal, Reh and two Yehs
above characters, Nasta’leeq also has a large set of Final Shapes of rest of families are given in Table 14.
Sr. Shape Example
diacritical marks that contains diacritical marks of
Arabic, Persian, Urdu, Punjabi, Pashto, Sindhi, 1 › Ý Ý Ý[ › ÝÝ[ Ý[
Baluchi, and Kashmiri. All these ligatures and [
2 à à Û r [[ à[
diacritical marks are given in Table 12. [
Sr. Symbol Sr. Symbol Sr. Symbol 3 õ Ò õ õ õõ õm ÒÒ õf[ õ[ õ
ﷸ H◌ [
1 18 35 4 ù ù ú ùú ùä ùÓ ùU[ ùú[
2 ﷲ 19 " 36 W ◌
5 o o o o[o o o‰ oy[ oo[
3 ﷴ 20 37 Y ◌ [
4 ﷺ 21 " 38 Y ◌
6 ú ú ú úú úä úÓ úU[ úú[
5 " 22 39 G ◌ 7 p p p p p[p p pŠ pz[ p[p
6 9 23 " 40 E ◌ 8 Þ Þ Þ Þ[ Þ Þm ÞÉ Þh[ Þ[Þ
7 D 24 ۓ 41 H ◌ 9 H HH HH[ H‹ H‹ H[ HH
8 L 25 { 42 D ◌ [ [
10 I I I II IÝ IË I^[ I[ I
9 P 26 ؤ 43 D ◌ [
D Ì D ◌
11 ß ßà ßà ßÛ ßr ß [o ßà[
10 27 44
12 á á á áá[ á áÍ áa[ á[ á 18 Ì Choti-Yeh_Fina
* °
13 â â â â â[â âß âÎ â<[ â[â 19 Bari-yeh_Fina
20 D La_Fina
14 E EE EE[ E£ E Eb EE
[ [ 21 Ï [ Ka_Fina
15 J JJ JJ[ Jm J5 JY[ JJ[ * Behinit family with Bari-yeh is stored as
16 Z Zð Zó Zá ZÐ Z° ligatures
[ Table 15: Initial Shapes of Beh and Jeem Families
17 ﻼ ﻼ [ﻼ ﻼ &ﻼ6 ﻼD[ [ﻼ With 21 initial shapes of all families, all possible
18 ‘ ‘ ‘ Ï ‘[Ï ‘7 ‘f ‘[[ ‘Ï[ two character ligatures can be represented in
Nasta’leeq. Kaf and Lam families do not have an initial
Table 14: Final Shapes
shape for Alef because these pairs of characters are
Hamza ( )ءdoes not have a final shape. Thus there stored as ligatures, shown in Figure 12.
are 22 final families depending upon their final shapes,
given in Table 13 and 14. 4.3. Three Characters Joining
The above two tables not only give us the final
shapes of all the families of Table 11 and of ligatures Final shapes have already been identified in the
of Figure 12 (La ﻻand Ka family ÔÇ Ô» ÔÇ ،»), they previous section. Similar to the initial shapes, 21
also give us the analysis of initial shapes of Beh, Jeem, medial shapes are identified for the final shape
Seen, Sad, Noon and Choti-yeh families. The analysis families. Medial shapes of Behmedi and Jeemmedi
of initial shapes of Beh, Noon, Hamza and Choti-yeh families for final families are given in Table 16.
family shows that they have the same base form for the Behmedi Jeemedi
Sr. Final Families
initial shape with variations in Noktas which is clear Shape Shapes
from the above examples. It is also clear from the 1 3 = Alef_Fina
above examples that the initial form for final shapes of
Sad and Ain families are the same. Thus the Behinit 2 » 1 Beh_Fina
family (including initial forms of Beh, Noon, Hamza 3 â r Jeem_Fina
and Choti-yeh families) has 21 initial shapes, given in
Table 15. 4 s = Dal_Fina
Sr.
Behinit Jeeminit
Final Families 5 3 p Reh_Fina
Shape Shapes
1 6 Alef_Fina 6 1 Seen_Fina
2 Ý Beh_Fina 7 1 Sad_Ain_Fina
3 à [ Jeem_Fina
8 Tah_Fina
4 ä u Dal_Fina
9 Feh_Fina
5 w w Reh_Fina
6 õ f Seen_Fina 10 Qaf_Fina
Kaf_Fina,
7 ú U Sad_Ain_Fina 11 A =
Gaf_Fina
8 o y Tah_Fina
12 ‚ = Lam_Fina
9 p z Feh_Fina
10 Þ h Qaf_Fina
13 r Meem_Fina
11 H Kaf_Fina 14 $ Noon_Fina
12 I ^ Lam_Fina 15 â Å Waw_Fina
13 à o Meem_Fina 16 ø q Hehgol_Fina
14 á a Noon_Fina Heh-
15 â < Waw_Fina 17 P ˆ doachashmee_Fi
16 E b Hehgol_Fina na
Heh- 18 ˜ Choti-Yeh_Fina
17 J Y doachashmee_Fin ,
19 Bari-yeh_Fina
a
20 3 = La_Fina Now first we need to identify medial shapes that will
join with the already identified medial shapes.
21 6 Ka_Fina Secondly, we need to identify initial shapes that will
Table 16: Medial Shapes of Beh and Jeem Families join with newly identified medial shapes in the first
Behmedi shapes can be grouped into four different step and this will complete our joining analysis.
families according to the joining behavior with the The process of identifying the new medial shapes is
previous character. This is clearly shown in Table 17. the same as that we have used to identify the initial
shapes for the first 21 medial shapes. Similar to the
Name of Family Shape Members Behinit family, Behmedi family also have four new
1, 2, 4, 7, 8, 9, 10, 11, 12, 15, shapes for its first 21 members and one shape for the
Behmedi1
16, 19, 20, 21, 24, 25, 28, 29 Jeemmedi family. All additional medial shapes of the
Behmedi2 3, 13, 17, 18, 26, 30 Behmedi and Jeemmedi families, identified for medial
Behmedi3 6, 14, 22, 23, 27 shapes, are given in Table 19.
If we look at Table 17, then we will come to know
Behmedi4 5
that Behmedi2 family includes the medial shapes # 26
Table 17: Behmedi Families and 30. Thus, fortunately, we do not have new initial
For the families of Table 17, we need four initial shapes for these newly identified medial shapes of
shapes of each character that has an initial shape. Thus Table 19 and this completes our analysis.
the Behinit family has four new shapes for the Ligatures longer than 4 can be built using
Behmedi family and one shape for the Jeemmedi recursively the shapes already identified. It is shown in
family. All additional initial shapes of the Behinit and Figure 10. Hence, we have 1 or 2 final shapes, 30
Jeeminit families, identified for medial shapes, are initial shapes and 30 medial shapes of the characters in
given in Table 18. Pakistani languages. Hence we need 996 shapes of
Thus now, we have 30 initial shapes and 21 medial characters to represent Pakistani languages in
shapes that may represent all possible ligatures of Nasta’leeq style.
length three of Pakistani languages in Nasta’leeq style. Behinit Jeeminit
It is not possible to list all shapes of all characters due Sr.
Shape Shapes
Medial Families
to space shortage.
Behinit Jeeminit
22 6 Behmedi1
Sr. Medial Families
Shape Shapes 23 Behmedi2
22 3 " Behmedi1
24 6 Behmedi3
23 P / Behmedi2
25 ø Behmedi4
24 $ Behmedi3
26 Jeemmedi
25 3 3 Behmedi4
27 6 Seenmedi
26 = D Jeemmedi Sadmedi, Tahmedi,
28
27 Seenmedi Ainmedi, Fehmedi
Kafmedi, Gafmedi,
Sadmedi, Tahmedi, 29 : 6
28 Ï Lammedi
Ainmedi, Fehmedi
Meemmedi, Hehgolmedi,
Kafmedi, Gafmedi, 30
29 o Heh-doachashmeemedi
Lammedi Table 19: More Medial Shapes of Beh and Jeem Families
Meemmedi,
30 Hehgolmedi, Heh-
doachashmeemedi
5. Context Sensitive Substitution Grammar
Table 18: More Initial Shapes of Beh and Jeem Families
The analysis given in the Section 4 can be
represented in the Context-Sensitive Substitution
4.4. Four Characters Joining
Grammar. Figure 13 shows some rules of the
contextual substitution grammar of Nasta’leeq.
We are doing our analysis in the reverse direction, Initial Rule
i.e. from left-to-right. In the analysis of three characters beh → behinit1 aiknoktabelow
joining, we have already identified the shapes of the jeem → jeeminit1 aiknoktabelow
last two characters of our four characters ligatures that No Context (Before | After)
are final shapes and medial shapes for our final shapes. Medial Rule
Beh → behmedi1 aiknoktabelow Perso-Arabic script like Urdu, Punjabi, Pashto, Sindhi,
Jeem → jeemmedi1 aiknoktabelow Baluchi, Kashmiri, etc. The analysis of Nasta’leeq for
No Context (Before | After) Pakistani languages is equally true for Arabic and
Final Rule Persian for writing them in the Nasta’leeq style. The
beh → behfina1
jeem → jeemfina
analysis of Nasta’leeq and the Context-Sensitive
No Context (Before | After) Substitution Grammar, discussed in this paper, can be
Contextual Substitution Rule for Behfina1 used to build a good font for Arabic, Persian, Urdu,
behinit1 → behinit2 Punjabi, Pashto, Sindhi, Baluchi and Kashmiri
jeeminit1 → jeeminit2 languages to write them in the Nasta’leeq style.
behmedi1 → behmedi2 The practical implementation of a character-based
jeemmedi1 → jeemmedi2 Nasta’leeq font for Arabic, Persian and Pakistani
Context ( | behfina1) languages is much more complex process as compared
Contextual Substitution Rule for Jeemfina1 to its theoretical analysis. A practical development of a
behinit1 → behinit3
jeeminit1 → jeeminit3
character-based Nasta’leeq font for the said languages
behmedi1 → behmedi3 not only needs the Conext-Sensitive Substitution
jeemmedi1 → jeemmedi3 Grammar, but it also requires other important
Context ( | jeemfina) information about the positioning to correctly position
Contextual Substitution Rule for Behmedi1 characters considering its contexts. Just to have an idea
Family of the practical complexity, the Initial Rule of Figure
behinit1 → behinit22 13 substitutes Beh ([) with its initial shape behinit1 (6)
jeeminit1 → jeeminit22
behmedi1 → behmedi22 and aiknoktabelow (a dot below the initial shape) but it
jeemmedi1 → jeemmedi22 does not give any idea about the position of the Nokta.
Context ( | <behmedi1 Family>) In other rules given in Figure 13, we are substituting
Figure 13: Context-Sensitive Substitution Grammar shapes but we do not give any idea about their
The Initial Rule tells that Beh ([) and Jeem (`) will positions that they are joining with the context before
be substituted by behinit1 (6) and jeeminit1 () and after shapes properly or not. Positioning of Noktas
and diacritical marks with respect to all shapes is
respectively with appropriate Nokta with them
another complex problem for a practical Nasta’leeq
whenever they will come at the initial position of a
font for languages written in the Peso-Arabic script.
ligature. Medial and Final rules also have same kind of
interpretation for medial and final positions
respectively. The Contextual Substitution Rule for 7. References
Behfina1 tells that default initial shapes behinit1 (6) and
T. Rahman, “Language Policy and Localization in
jeeminit1 () at the initial position will be substituted Pakistan: Proposal for a Paradigmatic Shift”, in proc.
by behinit2 () and jeeminit2 (Ý) when they are Crossing the Digital Divide, SCALLA Conference on
Computational Linguistics, 5 – 7 January, 2004.
followed by a Behfina1. It also tells that default medial
B. F. Grimes, “Pakistan”. Ethnologue: Languages of the
shapes behmedi1 (3) and jeemmedi1 (=) at the medial World. 14th Edition Dallas, Texas; Summer Institute of
position will be substituted with behmedi2 (») and Linguistics, 2000.
jeeminit2 (1) when they are followed by a Behfina1. M. Afzal, S. Hussain, “Urdu Computing Standards:
Development of Urdu Zabta Takhti (UZT) 1.01”. in proc.
The other rules also have the same kind of INMIC-2001, Lahore, 2001.
interpretations. Figure 13 shows a very small part of Z. Khaver, “Standard Code Table for Urdu”, in proc. 4th
the Context-Sensitive Substitution Grammar of Symposium on Multilingual Information Processing (MLIT-
Nasta’leeq. This clearly shows the contextual nature 4), Yangon, Myanmar, CICC, japan, 1999.
and contextual complexity of Nasta’leeq. M. G. Abbas Malik, “Towards a Unicode Compatible
Theoretically, the Context-Sensitive Substitution Punjabi Character Set”. In proc. 27th Internationalization and
Grammar is a computational model of Nasta’leeq’s Unicode Conference, Berlin, Germany, 2005.
M. G. Abbas Malik, “Punjabi Machine Transliteration”.
contextual substitution complexity. In proc. 21st International Conference on Computational
Linguisitcs COLING-06 and 44th Annual Meeting of ACL,
6. Conclusion Sydney, Australia, 2006.
J. T. Platts, “A Grammar of the Hindustani or Urdu
Nasta’leeq is a bidirectional, diagonal, non- Language”. Crosby Lockwood and Son, 7 Stationers Hall
monotonic, cursive, highly context-sensitive and very Court, Ludgate hill, London. E.C., 1909.
complex writing system for languages written in the M. G. Abbas Malik, Christian Boitet, Pushpak
Bhattcharyya, “Hindi Urdu Machine Transliteration using
Finite-state Transducers”. In proc. 22nd International
Conference on Computational Linguistics COLING-08,
Manchester, UK, 2008.
S. Hussain, “Letter to Sound Rules for Urdu Text to
Speech System”, in Proc. of Workshop on “Computational
Approaches to Arabic Script-based Languages”, COLING-
04, Geneva, Switzerland, 2004.
Wali, A., Hussain, S., “Context Sensitive Shape-
Substitution in Nastaliq Writing System: an analysis and
fomulation”. In Proc. of “International Joint Conferences on
Computer, Information and Systems Sciences and
Engeenering”, 2006.