Beruflich Dokumente
Kultur Dokumente
1 / 60
Preview
Urdu Transliterator
Syntax
Semantics
2 / 60
Urdu
Urdu is
3 / 60
Urdu
Urdu is
a South Asian language spoken primarily in Pakistan and India
3 / 60
Urdu
Urdu is
a South Asian language spoken primarily in Pakistan and India
descended from (a version of) Sanskrit (sister language of Latin)
3 / 60
Urdu
Urdu is
a South Asian language spoken primarily in Pakistan and India
descended from (a version of) Sanskrit (sister language of Latin)
structurally identical to Hindi (spoken mainly in India)
3 / 60
Urdu
Urdu is
a South Asian language spoken primarily in Pakistan and India
descended from (a version of) Sanskrit (sister language of Latin)
structurally identical to Hindi (spoken mainly in India)
together with Hindi the fourth most spoken language in the world
( 250 million native speakers)
3 / 60
4 / 60
4 / 60
4 / 60
4 / 60
4 / 60
Context of Work
5 / 60
Context of Work
5 / 60
Context of Work
5 / 60
Context of Work
5 / 60
Context of Work
5 / 60
Context of Work
5 / 60
Context of Work
5 / 60
Context of Work
5 / 60
6 / 60
7 / 60
'see<[1:Nadya], [113:book]>'
PRED 'Nadya'
CHECK _LEX-SOURCE morphology, _PROPER known-name
SUBJ
OBJ
SPEC
DET
PRED
'the'
DET-TYPE def
_SUBCAT-FRAME V-SUBJ-OBJ
TNS-ASP MOOD indicative, PERF -_, PROG -_, TENSE past
57 CLAUSE-TYPE decl, PASSIVE -, VTYPE main
7 / 60
8 / 60
'dEkH<[1:nAdiyah], [20:kitAb]>'
PRED
'nAdiyah'
_NMORPH obl
CHECK
SUBJ
NTYPE
SEM-PROP SPECIFIC +
1 CASE erg, GEND fem, NUM sg, PERS 3
PRED
OBJ
CHECK
'kitAb'
LEX-SEM AGENTIVE +
TNS-ASP ASPECT perf, MOOD indicative
42 CLAUSE-TYPE decl, PASSIVE -, VTYPE main
8 / 60
'dEkH<[1:nAdiyah], [20:kitAb]>'
PRED
'nAdiyah'
_NMORPH obl
CHECK
SUBJ
NTYPE
SEM-PROP SPECIFIC +
1 CASE erg, GEND fem, NUM sg, PERS 3
PRED
OBJ
CHECK
'kitAb'
LEX-SEM AGENTIVE +
TNS-ASP ASPECT perf, MOOD indicative
42 CLAUSE-TYPE decl, PASSIVE -, VTYPE main
'dEkH<[1:nAdiyah], [20:kitAb]>'
PRED
'nAdiyah'
_NMORPH obl
CHECK
SUBJ
NTYPE
SEM-PROP SPECIFIC +
1 CASE erg, GEND fem, NUM sg, PERS 3
PRED
OBJ
CHECK
'kitAb'
LEX-SEM AGENTIVE +
TNS-ASP ASPECT perf, MOOD indicative
42 CLAUSE-TYPE decl, PASSIVE -, VTYPE main
9 / 60
10 / 60
10 / 60
10 / 60
10 / 60
10 / 60
10 / 60
Possible Applications
11 / 60
Possible Applications
Large-Coverage, Deep Computational Grammars can be useful for:
11 / 60
Possible Applications
Large-Coverage, Deep Computational Grammars can be useful for:
Meaning-Sensitive Applications
11 / 60
Possible Applications
Large-Coverage, Deep Computational Grammars can be useful for:
Meaning-Sensitive Applications
Web-Search
11 / 60
Possible Applications
Large-Coverage, Deep Computational Grammars can be useful for:
Meaning-Sensitive Applications
Web-Search
Question-Answering
11 / 60
Possible Applications
Large-Coverage, Deep Computational Grammars can be useful for:
Meaning-Sensitive Applications
Web-Search
Question-Answering
Knowledge Representation
11 / 60
Possible Applications
Large-Coverage, Deep Computational Grammars can be useful for:
Meaning-Sensitive Applications
Web-Search
Question-Answering
Knowledge Representation
Text Summarization
11 / 60
Possible Applications
Large-Coverage, Deep Computational Grammars can be useful for:
Meaning-Sensitive Applications
Web-Search
Question-Answering
Knowledge Representation
Text Summarization
Machine Translation
11 / 60
Possible Applications
Large-Coverage, Deep Computational Grammars can be useful for:
Meaning-Sensitive Applications
Web-Search
Question-Answering
Knowledge Representation
Text Summarization
Machine Translation
Computer-Assisted Language Learning
11 / 60
12 / 60
12 / 60
12 / 60
12 / 60
12 / 60
12 / 60
13 / 60
13 / 60
13 / 60
morphology (fst)
13 / 60
morphology (fst)
13 / 60
morphology (fst)
13 / 60
morphology (fst)
Overview
Overall Architecture
tokenizer
morphology (fst)
14 / 60
Urdu Transliterator
Urdu
Hindi
Romanized Script
(the XLE grammar)
Right now we are working on the Urdu-Roman transliterator.
15 / 60
Urdu Transliterator
Transliteration scheme
An excerpt from our scheme table:
Unicode Urdu character
Latin letter
in transliteration scheme
b
Phoneme
/p/
/t/
//
/j/
>
//
/b/
16 / 60
Urdu Transliterator
17 / 60
Urdu Transliterator
parsing
Urdu script:
ASCII:
bA
generating
17 / 60
Urdu Transliterator
parsing
Urdu script:
ASCII:
bA
generating
17 / 60
Urdu Transliterator
parsing
Urdu script:
ASCII:
bA
generating
17 / 60
Urdu Transliterator
18 / 60
Urdu Transliterator
18 / 60
Urdu Transliterator
Transliterator
Morphology
XLE
Input
Output
Input
Output
...
kitAb
kitAb
kitAb+Noun+Fem+Sg+Count
...
18 / 60
Urdu Transliterator
Example
The transliterator at this position works quite well:
Urdu Transliterator
Roman script
ba
bi
bu
20 / 60
Urdu Transliterator
Roman script
ba
bi
bu
20 / 60
Urdu Transliterator
Roman script
ba
bi
bu
20 / 60
Urdu Transliterator
Consequences
letter combination
ktAb
representation
kitAb
...
translation
book
21 / 60
Urdu Transliterator
Consequences
letter combination
ktAb
representation
kitAb
...
translation
book
21 / 60
Urdu Transliterator
Consequences
letter combination
ktAb
representation
kitAb
...
translation
book
(demo)
21 / 60
Urdu Transliterator
Solution
In order to restrict this overgeneration the possible letter combinations
need to be constrained:
22 / 60
Urdu Transliterator
Solution
In order to restrict this overgeneration the possible letter combinations
need to be constrained:
which vowels are actually allowed to cooccur?
ai, but not ia?
22 / 60
Urdu Transliterator
Solution
In order to restrict this overgeneration the possible letter combinations
need to be constrained:
which vowels are actually allowed to cooccur?
ai, but not ia?
which consonants are actually allowed to cooccur?
initial kr, but not gr?
22 / 60
Urdu Transliterator
Solution
In order to restrict this overgeneration the possible letter combinations
need to be constrained:
which vowels are actually allowed to cooccur?
ai, but not ia?
which consonants are actually allowed to cooccur?
initial kr, but not gr?
certain combinations with semi-vowels or consonants are not allowed:
a short vowel followed by v may not be followed by u or i
22 / 60
Urdu Transliterator
Solution
In order to restrict this overgeneration the possible letter combinations
need to be constrained:
which vowels are actually allowed to cooccur?
ai, but not ia?
which consonants are actually allowed to cooccur?
initial kr, but not gr?
certain combinations with semi-vowels or consonants are not allowed:
a short vowel followed by v may not be followed by u or i
certain positions are prohibited:
a word can never end in a short vowel or begin with a short vowel
that is only represented with a diacritic
22 / 60
Urdu Transliterator
Solution
write rules and filters out of these constraints and apply them to the
transliterator
(demo)
23 / 60
Urdu Transliterator
Solution
write rules and filters out of these constraints and apply them to the
transliterator
(demo)
Problem: these rules cannot be found in the literature - they are a
product of extensive manual labor
23 / 60
Urdu Transliterator
Solution
write rules and filters out of these constraints and apply them to the
transliterator
(demo)
Problem: these rules cannot be found in the literature - they are a
product of extensive manual labor
However, the transliterator works quite well now
Some sentences are still a little slow (but I keep looking for possible
restrictions)
continue with generation of Urdu and the Hindi transliterator
23 / 60
Urdu Transliterator
Overview
Overall Architecture
tokenizer
morphology (fst)
24 / 60
Syntax
Syntax
25 / 60
Syntax
Syntax
25 / 60
Syntax
Syntax
25 / 60
Syntax
Syntax
25 / 60
Syntax
Syntax
25 / 60
Syntax
Syntax
25 / 60
Syntax
Syntax
"nAdiyah hansI"
ROOT
CS 1:
PRED
'hans<[1:nAdiyah]>'
PRED
'nAdiyah'
NTYPE
KP VCmain
NP
hansI
SUBJ
NSYN proper
SEM-PROP SPECIFIC +
1 CASE nom, GEND fem, NUM sg, PERS 3
CHECK
nAdiyah
26 / 60
Syntax
Syntax
"nAdiyah hansI"
ROOT
CS 1:
PRED
'hans<[1:nAdiyah]>'
PRED
'nAdiyah'
NTYPE
KP VCmain
NP
hansI
SUBJ
NSYN proper
SEM-PROP SPECIFIC +
1 CASE nom, GEND fem, NUM sg, PERS 3
CHECK
nAdiyah
26 / 60
Syntax
Syntax
"nAdiyah hansI"
ROOT
CS 1:
PRED
'hans<[1:nAdiyah]>'
PRED
'nAdiyah'
NTYPE
KP VCmain
NP
hansI
SUBJ
NSYN proper
SEM-PROP SPECIFIC +
1 CASE nom, GEND fem, NUM sg, PERS 3
CHECK
nAdiyah
Syntax
LFG implementation
Conclusion
27 / 60
Syntax
Extraction from DP
(2) a.
Er hat viele B
ucher u
ber Logik gekauft.
He has many books on logic bought
He has bought many books about logic.
b. B
ucher u
ber Logik hat er viele gekauft.
c. Uber
Logik hat er viele B
ucher gekauft. (German)
(3) mantiq=par nidA=nE Ek kitAb
logic=Loc.on Nida=Erg one book.F.3Sg
xarId-I
he.
buy-Perf be.Pres
Nida has purchased a book on logic. (Urdu)
28 / 60
Syntax
Quantifier Float
29 / 60
Syntax
NP-internal discontinuity
Discontinuous NP
Discontinuous AP
30 / 60
Syntax
31 / 60
Syntax
Type of Argument
Dative Marked
(ii)
Ablative Marked
(iii)
Locative Marked
(iv)
Adpositional
Syntax
(6) a. istisnA
immunity
b.
muqaddamAt=sE istisnA
court-case.Pl=Abl immunity
immunity from court-cases
c.
muqaddamAt=sE AInI
istisnA
court-case.Pl=Abl constitutional immunity
constitutional immunity from court-cases
33 / 60
Syntax
(7) a. barIfiNg
briefing
b.
salAmtI=par barIfiNg
security=Loc briefing
briefing on security
c.
salAmtI=par tafsIlI barIfiNg
security=Loc detailed briefing
detailed briefing on security
34 / 60
Syntax
(8) a. mutAlbA
demand
b.
ArmI-cIf=sE
mutAlbA
army-chief=Abl demand
demand to the army-chief
c.
ArmI-cIf=sE
qAnUnI mutAlbA
army-chief=Abl legal
demand
legal demand to the army-chief
35 / 60
Syntax
36 / 60
Syntax
Hierarchical structure of AP in NP
NP
CS 1:
AP
KP
NP K
N sE
KP
AP
A istis2nA
NP K h2As3il AInI
muqaddamAt N kO
s3adr
Syntax
(10) a1.
ArmI-cIf=sE2 salAmtI=par1 barIfiNg1 =kA mutAlbA2
army-chief=Abl security=Loc.on briefing=Gen demand
The demand to the army chief for briefing on security
a2. [NP [KP ArmI-cIf=sE][KP [NP [KP salAmtI=par] barIfiNg]=kA]
mutAlbA]
b. salAmtI=par1 ArmI-cIf=sE2 barIfiNg1 =kA mutAlbA2
38 / 60
Syntax
39 / 60
Syntax
NP
KP/PP
A+
Spec(N)/Arg(N/A)
Arg-taking-adj
Arg-less-adj Head-noun
40 / 60
Syntax
Implementation Issues
41 / 60
Syntax
42 / 60
Syntax
43 / 60
Syntax
NP
CS 1:
KP
KP
AP
NP K
NP K
N kO
AP
A istis2nA
N sE h2As3il AInI
s3adr muqaddamAt
Figure: C-structure
44 / 60
Syntax
'istis2nA<[34:muqaddamah]>'
PRED 'muqaddamah'
CHECK _NMORPH obl
34 CASE inst, GEND masc, NUM pl
PRED
OBJ-GO
'h2As2il<[1:s3adr]>'
PRED 's3adr'
CHECK _NMORPH obl
NSEM COMMON count
NSYN common
1 CASE dat, GEND masc, NUM sg, PERS 3
NTYPE
ADJUNCT
CHECK
_RESTRICTED -
LEX-SEM GOAL +
attributive
39 ATYPE
49
PRED 'AInI'
ATYPE attributive
[39:h2As2il]
<s
44
Figure: F-structure
45 / 60
Syntax
Summary
Urdu is a typical language in which discontinuous NPs are found both at:
Clause-level
Constituent-level
Constituent-level discontinuity in Urdu can be implemented in LFG
framework by making use of:
Shuffle operator (,)
Non-deterministic operator ($)
Head-precedence operator (<h)
46 / 60
Syntax
Overview
Overall Architecture
tokenizer
morphology (fst)
47 / 60
Semantics
Intro
Aim: a large-coverage computational semantic analyzer on the basis of a
deep syntactic analysis
use f-structures as starting point
apply xfr semantic rules from f-structure facts to a semantic
representation (Crouch and King, 2006)
judgment on the semantic well-formedness of a sentence
The girl laughs. semantically well-formed
#The tree laughs. semantically ill-formed
Semantics
Intro
F-structure for nAdiyah hansI (Nadya laughed).
"nAdiyah hansI"
PRED
'hans<[1:nAdiyah]>'
PRED
'nAdiyah'
NTYPE
SUBJ
SEM-PROP SPECIFIC +
1 CASE nom, GEND fem, NUM sg, PERS 3
CHECK
Semantics
50 / 60
Semantics
51 / 60
Semantics
Hindi WordNet
Facts:
inspired in methodology and architecture by the English WordNet
(Fellbaum 1998)
52 / 60
Semantics
Hindi WordNet
about 3.900 verbs, 57.000 nouns, 13.700 adjectives and 1.300 adverbs
words are grouped according to their meaning similarity (synsets)
53 / 60
Semantics
Hindi WordNet
Issues
far less specific concepts than in the English WordNet
Hindi WordNet:
TOP Noun Inanimate
TOP Noun Inanimate
Object
Object
Artifact
Artifact
English WordNet:
entity physical entity object whole unit
piece of work publication book
entity physical entity object whole unit
furnishing piece of furniture table
kitAb
mez
artifact
creation
product
artifact
instrumentatlity
54 / 60
Semantics
out of 3.900 Hindi verbs, we have found 534 verbs in an Urdu verb
list (Humayoun, 2006)
complex predicates are included in Hindi WordNet, but not in the
Urdu wordlist
total of around 700 Urdu verbs more than 2/3 of Urdu verbs are
found
all found verbs seem to be valid
extract verb information from Hindi WordNet for the Urdu VerbNet
55 / 60
Semantics
Polysemy:
An extreme case - eat expressions in Hindi/Urdu (Hook and Pardeshi,
2009):
employing eat in idiomatic expressions
about 160 eat expressions for Hindi/Urdu
variety of uses due to loan translations from Persian
56 / 60
Semantics
Semantics
58 / 60
Semantics
Wrap up
challenges ahead
Demo
59 / 60
Semantics
Thank you!
60 / 60