Sanskrit Lexical Analyser

Sanskrit Lexical Analyser
(A project module of Universal digital library,IIIT-Allahabad)

Report by:
Rajeev Kumar
20031032,
IIIT-Allahabad, India.
Email: rkumar_03@iiita.ac.in
Guidance:
Dr. Ratna Sanyal

Dr. Sudip Sanyal
Dr. U.S.Tiwary
Indian Institute Of Information Technology,Allahabad

Devghat,Jhalwa, Allahabad, U.P. , India
Abstract:
The Sanskrit language is one of the earliest attested members of the

Indo-European language family .It is not only a classical language, but
also an official language of India.
By classical we means: ancient,originated on its own and rich in

literature.
It has a similar position in India to that of Latin and Greek in Europe, and
is a central part of Hindu/Vedic traditions.Many of the Indian languages
and foreign languages has it’s origin from sanskrit only.Paninian
grammar is still regarded as the mother of all the grammar.
So,Sanskrit has retained it’s position and charm in it’s original form
The present project concerns with the Morphological and Lexical

analysis of the sanskrit words. Here I have tried for “sandhi-wichched” (
siNx iv¢h ) and finding out “vibhakti’s” ( ivÉiKTa ) from “shabdroop”.
I have designed my own rule format(by extending basic rules).

Motivation & Objective
Motivation
People both in India and abroad are surely and steadily realizing the
importance of ancient scripts in the diverse fields of science,
commerce and arts. Also, as spiritual awareness sweeps the world,
great efforts are being made to present the vedic scriptures in different
languages to the common people.Swami Dayanand saraswati… once
rightly quoted “back to vedas”.
German have already developed their machine translator from

Sanskrit to German. People at MIT’s also working at large scale on
Sanskrit. In India the work on Sanskrit is being carried out at IIIT-
Allahabad,IIT’s ,IIIT-Hyderabad etc.
Objective
The present project concerns with the Morphological and Lexical

analysis of the sanskrit words.The lexical analyser may be applied for
Machine Translation domain of Natural Language Processing(by
mapping the parse tree of sanskrit language with other).Also,it may be
used for information retrieval
Lexicon analysis & NLP
Sentences
Morphological & Syntactic analysis semantic analysis context analysis

Lexical analysis. (Grammar and parse tree) (understanding the meaning)
(Sandhi) (vibhakti)
Root words --Gender, form
-- marker
How to proceed for morphological analysis??
Here we basically deal with finding out the root word.
Since a given word may be composed of two words,so one may go for
separating these words. e.g. rvINd+ = riv + #Nd+ .
Also, a marker may be attached to the word,

e.g. muinna muin_yam! Muini_a>
(muninaa) (munibhyaam) (munibhiH)
In above case muni-- naa

muni— bhyaam markers
muni-- bhiH
So, if somehow we remove the marker then we may get the root word.
Sandhi( siNx iv¢h )
• What is sandhi ??
Sandhi refers to combination of words when they are spoken with

each other without a gap.
dae v[aeR me< AitZay saimPy ke kar[ jae ivkar %TpNn haeta hE ,//%se siNx
khte h<E,
• All together there are 5 + 14 + 7 = 26 crisp rules on which sandhi

vigrah is carried out.
5 for swar sandhi(Svr siNx )
14 for wyanjan sandhi( Vy<jn siNx)
7 for wisarg sandhi( ivsgR siNx)
• During formation of sandhi words, 3 cases to deal with:
1) effect on the first word only: (i.e. last letter of the first word
changes)
st! + jn?> = sJjn>
2) effect on the second word only: (first letter of the second

word changes)
v&] + Daya = v&]CDaya
3) effect on both of the word.(last letter of 1st word and first

letter of 2nd word changes)
riv + #Nd+ = rvINd+
• I Expand those sandhi rules(from 26 to about 100).
e.g. Ak> sv[eR dI"R> : it means Ak! ke bad yid sv[R Svr hae tae daenae<
Svr ke Swan me< dI"R hae jata hE,
Ak! A ,Aa ,# ,% ,\ ,l&
sv[R Svr A ,Aa ,# ,$ ,% ,^ ,\ ,\& ,l&
I expanded them in the form:
A + A = Aa a + a = aa
# +$ =$ i + ii = ii
….. And so on.i.e. each letter will be dealt independently
• Rule format: i.e. In what format the rules have been stored in the
machine.
» effectOn - - result - - left - - right
e.g. # +# =$ i + i = ii
riv + #Nd+ = rvINd
effectOn takes values: f,s,b

f: first
s: second
b: both
e.g. riv + #Nd+ = rvINd
Here , both the words (left and right) underwent changes.

So, the rule format for the above rule will be:
effectOn - - result - - left - - right

b - - ii --i -- i
i.e rav—ii—ndra : if you get “ii” then add “i” on left i.e. rav + “i”
and add “i” on the right…… “i”+ndra.
So,u get ravi + indra
Rule format:
Consider: Ak> sv[eR dI"R>
Ak! A ,Aa ,# ,% ,\ ,l&
sv[R Svr A ,Aa ,# ,$ ,% ,^ ,\ ,\& ,l&
Rule format in the text rule file ):

effectOn--result--left--right
effectOn: character f : if result can be obtained by making changes in first

s : if result can be obtained by making changes second
b : if result can be obtained by making changes both
result: string
left: OR separated strings (e.g. Ak! A |Aa| #|%|\|l& )
string|string|......|string|
right: OR separated strings(e.g. sv[R Svr A |Aa |# |$ |% |^ |\ |\& |l)
Storage used:
I have stored rules in the files indexed on there starting letter. I.e. for
above case, b - - ii - - i - - i the rule will be in the file “i.txt”.
Similarly, words/root words are stored in the file indexed on starting

letter. I.e. raviindra is stored in “r.txt”. Also only one word per line
is stored in the corresponding file.
Sandhi algorithm (my approach)
Step1: receive a word for doing sandhi wichched
Step2: try iteratively for breaking the word into two part(left and right)
e.g. if the word is kastu then:
kast u
kas tu
ka stu
k astu
try for sandhi wichched.(note: sandhi wigrah is : word=leftWord+rightWord,

i.e. ending of leftWord and satarting of rightWord gets connected to form the
sandhi
now in order to detect the effect of addition(making up sandhi),i assumed that
effected word is encapsulated in right word
e.g. kastu=kaH + tu.... so when i receive "kastu" for sandhi wichched,then i take
up left as "ka" and right as "stu",and then pass "ka"(left) and "stu"(right) for
trying up sandhi wichched)
step3: Now we have two words: left,right (e.g. ka,stu)

1) sets a variable flag to false.
2) reads a rule from rulefile named as starting letter of right
(e.g. here it will open s.txt, since "stu" starts with letter s)
3)try applying each of the sandhi rules(present in the rule file,here sandhi
rules present in s.txt) on left and right(i.e. on ka and stu).
4) if one or more rules listed in corresponding rule file is applicable

then flag is set to true.
Step4: how to apply rule??
a) extract effectOn,result,left,right from the rule.

b) if received secondString starts with result then try to apply the rule
result=left+right
here result is encapsulated in the secondString
so extract right part from secondString by removing result from
it(secondString)
e.g. here secondString ( stu ) starts with result (s)
so rule f--s--H|--t|th| have the chances of performing
sandhi wichched.
c) separate left-result-right
e.g. ka-s-tu for(kastu)
leftGuess : ka
result :s
rightGuess : tu
d) check for the certainty of sandhi formation by applying rules.

e) check for the other chances/probability of sandhi formation:
if nothing is present in the rightGuess then,no sandhi.... so return
false
rightGuess,leftGuess and result, all the three present in the
database, then chances of sandhi vigrah
if leftWord is present then also thr. is some chance f sandhi-
vigrah,since markers may not be present in the database
How to check certainty of sandhi(step 4.d)??
As stated earlier, while formation of sandhi,the effect may be on first

word,or on second word or on both the words.
a)effect on first word:
example:
f—y—i|ii|I|—a|aa|A|u|uu|U|e|ai|o|au|aM|aH|RRi|Rî|RÎ|Lî|LÎ|
So,if on forming sandhi only first word is effected and second word
remains as it is then append something(depending on rule) to the leftGuess
........ no changes in rightGuess
check for left + right = result .....if the rightGuess word exist in
database,also check right of rule ..... f--result--left--right could be applied
(since,both, sandhi wigrah and sandhi formation on the given rule should
be checked up).
then depnding upon the presence of left hand side one may declare the
result.
left + right = sandhi
even if the left word is not present,but,since, rightGuess is present in
database,so there are chances of sandhi vigrah so record it to tryResult.txt
Since we are in effectOn=='f'
so we need something to be added to the leftGuess and then
check there presence in the database.
If rightGuess is allready present in database then, if the

leftGuess+append is also present in database then one can be sure of
sandhi vigrah.
Even if leftuess+append NOT present in database but,if
there(leftGuess+append),sandhi wichched is possible then also one can be
sure of sandhi-vigrah.
e.g. gaNitakaavyayormadhuramelanam + avalokyate
gaNitakaavyayormadhuramelanam is not present in database
but gaNitakaavyayoH + madhuramelanam is possible
b) effect on second word:

rule example: s--Dh--Sh|shh|T|Th|D|Dh|N|--dh|
So,if on forming sandhi only second word is effected and first word remains
as it is then prefix something(depending on rule) to the rightGuess ........ no
changes in leftGuess
check for left + right = result .....if the right(prefix+rightGuess) word exist
in database.
check right of rule ..... s--result--left—right could be applied
then depnding upon the presence of left hand side one may declare the result.
Since we are in effectOn=='s' so we need something to be prefixed to the
rightGuess and then check there presence in the database.
even if the leftGuess word is not present,but,since,rightGuess is present in

check if the leftGuess ends with one of the string present in left part of the
rule since sandhi vigrah and sandhi formation both of them should be
checked. rule: effectOn--result--left—right
right: orSeparatedString(string|string|......|string|)
e.g. for rule s--Dh--Sh|shh|T|Th|D|Dh|N|--dh|
left: Sh|shh|T|Th|D|Dh|N|
e.g. leftGuess sheshh ends with shh(which is present
Sh|shh|T|Th|D|Dh|N|)
c) effect on both the words
rule example: b--e--a|aa|A|--i|ii|I|
so,if on forming sandhi both(first and second) word gets effected.then

(depending on rule) append something to the leftGuess and prefix something to
the rightGuess.
for each of the potential prefix for rightGuess

append a postfix(extracting from rule) to the left part and check presence of
(leftGuess+append),(add+rightGuess) in database.
extract one of the prefix for rightGuess from the rule's right
rule : b--e--a|aa|A|--i|ii|I|
right: i|ii|I|
potential prefix: i,ii,I
if (add+rightGuess)AND(leftGuess+append) both present then declare the

Result
left + right = sandhi

So,even if the left word is not present,but,since, add+rightGuess is present in
now for each prefix for the rightGuess:

extract one of the postfix for leftGuess from the rule's left
rule : b--e--a|aa|A|--i|ii|I|
left : a|aa|A
potential postfix/append: a,aa,A
If (add+rightGuess) is allready present in database,then, if the

(leftGuess+append) is also present in database ,then one can be sure of sandhi
vigrah
Sandhi Algorithm in nut shell:
1. traverse the string in reverse order .
2. For k = n to 1,
break the string at kth position into two : leftBreakStr, rightBreakStr
i) for starting char of rightBreakStr …..get the rulefile
ii)Try to apply each rule contained in this rulefile:
-- extract the result part of the rule from the rightBreakstr
-- Find out the effected Substring,
leftStr = try to append something (left) to ftBreakstr

rightStr=try to add something(right) in front of rightBreakStr.
-- if Rootatabase( rightString ) :
if ( Rootdatabase(leftString)) then set sandhiPossible : true
if(sandhiwichched(leftString )) then set sandhiPossible : true
Sample Output :
Vibhakti( ivÉiKTa )
Vibhakti’s are Helpful in finding gender , form and markers.
Vibhakti
Though there are only six vibhakti’s in Sanskrit as compared to 8 in hindi,

but we consider all the 8 vibhakti’s.
vachan(vcn )
In Sanskrit, there are 3 forms/vachans (vcn ) : @kvcn , iÖvcn , bhuvcn as

compared to 2 in hindi.
Genders(il<g )
Again,in Sanskrit there are 3 genders (il<g ) : iôil<g ,puiLL<aGa ,npu<skiLa<g as
compared to
Two in hindi/English.
Approach for Vibhakti( ivÉiKTa )
First thing is to detect markers. Once the marker is detected the work is
done.
By marker I mean:
Consider “goes”. Here the root word is “go” and the marker is “es”.
Similarly in the word “going”, root word is “go” and the marker is “ing”.
Similarly we have markers attached to the root word in sanskrit.

e.g. in the word “munibhyaam” (muin_yam! ) the root word is “muni”( muin)
and the marker is “bhyaam”( _yam! ).

e.g. consider the following
@kvcn id&vcn bhuvcn

t&itya> muinna muin_yam! Muini_a>
(muninaa) (munibhyaam) (munibhiH)
In above case muni-- naa

muni— bhyaam markers
muni-- bhiH
So if I encounter “bhyaam” in the given word then I can say that it is t&itya>
ivÉiKTa ,id&vcn
Note: there is possibility that muin_yam! (munibhyaam) may occur at three

places
i.e. For muni(muin), muin_yam! Comes at t&itya> , ctuiwR , p<cim so,these
cases can’t be dealt at word level.(can be sorted out at sentence level only)•
Vibhakti Algorithm
The algorithm to handle vibhakti is almost similar to those of sandhi but

here we have to consider the following representation.
vachan : edb (@kvcn , iÖvcn , bhuvcn)
vibhakti : 12345678
ling : mfn (iôil<g ,puiLL<aGa ,npu<skiLa<g)
rule : extract--add--vibhakti--vachan--ling- -ending swar
extract--add--vibhakti--vachan--ling - -ending swar
e.g. for munibhyaam ibhyaam--i--3|4|5--d--m—i
i.e. If you encounter the word munibhyaam, then extract “ibhyaam”,then

add “i” then get the root word.
Information Gain : karak : t&itya> , ctuiwR , p<cim
vachan : id&vcn
gender : male with ending word i,
i.e. #kara<t puiLl<g zBd
Storage used:
I have stored rules in the files indexed on there starting letter. I.e. for
above case, ibhyaam--i--3|4|5--d--m—i the rule will be in the file
“i.txt”.
Similarly, words/root words are stored in the file indexed on starting

letter. I.e. raviindra is stored in “r.txt”. Also only one word per line
is stored in the corresponding file.
Steps:
Step1: receive a word for detecting karak,ling and vachan.

Step2: try extract the marker from the word.since markers are appended at
Last so process on each letter of the word in reverse order and try to
separate the root from marker.
a) separate the word into to part from jth index
e.g. ramAH, then ramA H

ram AH
ra mAH
r amah
b) obtain all possible rules that can be applied to the given word.
how to obtain the rules??
e.g. ramA H
then open the file H.txt and get all possible rules from the H.txt and
store them in array rule[]
c) iteratively apply all rules stored in the array rule[]
rule format : result--add--vibhakti--vachan--ling--ending word

e.g. rule format: AH--a--1|8--b--m--a
vachan : edb
vibhakti : 1 2 3 4 5 6 7 8
ling : mfn
d) get starting index and ending index of the word to be added (add
part from the rule format)and hence extract the add part from the
rule.
rule: AH--a--1|8--b--m--a
add: a
e) check if the word to be added is $(empty),if

so then place blank istead of $
f) add the word to the left part of query word.
first: ram
second: AH
add: a
firstRes: rama
g) heck if the firstRes in DATABASE??
if the word found then project the result i.e. ling ,vachan , karak
i.e. vibhakti is possible, so break the corresponding into result.
e.g. root: rama
karak: 1|8
vachan: bahuvachan
ling: male
Algorithmic Steps:
Step 1:
fetch the word
Step2:
extract the vibhakti:(raamaabhyaam)====> possible>am..aam..abhyaam
==>extract using the last index i.e. m better store the rules based on last index
if u extract am and add a ..then left : raamaabhyaa not found in db
if u extract aam and add a ..then left : raamaabhyaam not found in db
if u extract abhyaam and add a ..then left : raamaa found in db
==> declare it's ling vachan karak based on it's ending e.g. based on
rup i.e. akaaraant pulling shabda
aakaarant pulling shabda
ikaarant pulling shabda e.tc.
Requirement Analysis:
Hardware:
CPU with minimum of 2.40 GH
256 MB of RAM
Softwares:
JDK version 1.4.2 or above
Itranslator.exe (freely available,can be downloaded )

Link: http://www.omkarananda-ashram.org/Sanskrit/itranslator99.htm
How to run:
Package : sandhi.sandhi
Src/sandhi/sandhi : contains all java files
1) frame.java
2) Sanskrit.java
3) Sandhi.java
4) Vibhakti.java
5) Utils.java
Data: data directory contains

1) Database (root + others)
2) sandhiRules
3) vibhaktiRules
Main function() : in sandhi.sandhi.frame.java

load and run frame.java i.e. sandhi.sandhi.frame.java
Input / Output:
Input:
Format: roman representation of devnaagri:

e.g. raama for ram
Step1:
Single word input:
Either enter a text into the text box ,or select a word from the list(on the
left hand side).
File input:
Browse a file in which Sanskrit words are kept.

Also, there shouldn’t be any blank line in the file.
Step2: press sandhi or vibhakti as u desire.
NOTE: for processing the word it may take about a second or two, so please wait.
If the processing is done on the file, the please wait for around 2-3 minute.
Output:
Output is displayed to the output window(right side in the frame.)

Actually , result of the query is recorded in the result.txt or in tryResult.txt
result.txt: if the desired thing could be performed with 100% certainty.

tryResult.txt: if only chances/probability of things being performed.
Now for viewing the result in devnagri form follow the following steps.
Step1: open Itranslator.exe.

Step2: Browse file,result.txt / tryResult.txt (both present in the current directory
of project).
Use-Case diagram:
USE-CASE : 1
Use-case description:
getHelp:
action performed: guides the user for using the

software.
browseFile
action performed: browses a file for sandhi –

wigrah/vibhakti
Sandhi:
pre condition: a word is selected or entered for getting
the desired work done OR a file is selected
action performed: calls for sandhiOrVibhakti(1)
//triggerSandhi
post condition: gets the result on output window
vibhakti
pre condition: a word is selected or entered for getting the
desired work done OR a file is selected
action performed: calls for sandhiOrVibhakti(2)
//triggerVibhakti
post condition: gets the result on output window
uploadWordDB
pre condition: a word has been entered in corrosponding box
action performed: uploads a word into list database
post Condition : a word is appended to words.txt
uploadRootWord
pre condition: a word has been entered in corrosponding box

action performed: uploads a word into dictfile
post Condition : a word is appended to dictfile.txt
JList1
action performed: gets the list of words in the list(left side large
window) oulput
action performed: shows the result in the output window(right
side large window)
USE-CASE : 2
Use-Case2 Discription :
Note:(linguistics)
sandhi can be possible: it means the word can be broken up into two
or more distinct word
vibhakti is possible: it means that given the word ,one can get its
karak,vachan,ling
vibhakti is probable: it means that given the word ,there are some
chances that one may detect karak,vachan,ling
trigger :
pre conditions:
1)A word(NOT NULL) has been received
2)either sandhi or vibhakti has been selected(through GUI )
Action Performed :triggers either the "triggerSandhi" or
“triggerVibhakti"
Post condition: either sandhiVichched or Vibhakti's is performed.

triggerSandhi:
triggerSandhii:
pre condition:
A word has been received from function "trigger".
Action Performed:
1)creates a log file("vigrahNotPossible.txt"),for keeping track
of those subwords whose sandhi couldn't be performed.(to
increase the efficiency of performing sandhi wigrah)[see
details of increasing efficiency]
2)calls sandhiWichched(word) for performing sandhi wigrah on
the given word.
3)records the result of action performed in result.txt(if sandhi
wigrah has been performed successfully then retrieves the
result from result.txt else writes to result.txt word+" couldn't
be performed")
post condition: result of the action has been recorded
triggerVibhakti:
pre condition:
A word has been received from function "trigger".
Action Performed:
1)calls getRoop(word) for getting karak,vachan,ling
2)records the result of action performed in result.txt(if vibhakti
is possible, OR only some chances(probability) of vibhakti to
be formed)
post condition: result of the action has been recorded.
getResult:
pre condition:
name of the file from where result to be extracted has been
received.
Action Performed:
1)checks "couldn't be formed" is written OR not in "result.txt".
2)if result.txt doesn't contains "couldn't be formed", it means
the desired thing has been performed so,retrieve the result.
Since,sandhi wichched is a recursive process,so retrieve the
result in reverse order.
e.g. kastvam = kastu + am.

kastu=kaH + tu.
while writing the result to the file result.txt,
kaH + tu ,would be written first
then kastu+am ,would be written.
But in order to project the result to the user in a easily
readable manner one should project kastu+am
first,then kaH + tu.
3)if result.txt contain's "couldn't be formed",it means the
desired thing couldn't be performed.So,project the probable
results recorded in "tryResult.txt".
post condition:
User(GUI/frame) gets desired result.
USE-CASE : 3
USE-CASE:3 Description
sandhiWichhed
pre condition: receives a word for doing sandhi wichched
action performed:
1)tries iteratively for breaking the word into two part(left and
right)
e.g. if the word is kastu then:
kast u
kas tu
ka stu
k astu
2)tries for sandhi wichched.(note: sandhi wigrah is :

word=leftWord+rightWord,
i.e. ending of leftWord and satarting of rightWord gets
connected to form the sandhi now in order to detect the
effect of addition,i assumed that effected word is in right
e.g. kastu=kaH + tu.... so when i receive "kastu" for sandhi
wichched,then i take up left as "ka" and right as "stu",and then
pass "ka"(left) and "stu"(right) for trying up sandhi wichched)
post condition: returns true if sandhi vigrah is possible,else returns

false
trySandhiwichhed
/*
sandhi rule format: effectOn--result--left--right
e.g. f--s--H|--t|th|
*/
pre condition : receives two words: left,right (e.g. ka,stu)

action performed:
1) sets a variable flag to false.
2) reads a rule from rulefile named as starting letter of right
(e.g. here it will open s.txt, since "stu" starts with letter s)
3) calls function applyRule(), for applieng each of the sandhi

rules(present in the rule file,here sandhi rules present in s.txt)
on left and right(i.e. on ka and stu).
4) if one or more rules listed in corresponding rule file is
applicable then flag is set to true.
post condition: returns flag(true if the processing possible(i.e. getting

sandhi wigrah by applying one of the rules present in rulefile
is possible),else returns false.)
applyRule
pre condition:
receives one rule,firstString,secondString
e.g. rule : f--s--H|--t|th|
first : ka
second: stu
action performed:tries to apply rule on the firstString,secondString.
post condition:returns true if rule is applicable else return false
checkWordInDB
pre condition: receives a word

action performed: checks if the word is present in database
post condition: returns true if the word is present in database. else
returns false.
USE-CASE : 4
getRoop
pre condition:
boolean variables probable and possible set to false
receives a word for detecting karak,ling and vachan
action performed:
applies rules to extract the root word ,it's karak,ling and vachan
post condition:set the status by turning on either or both of the
boolean variable(possible,probable)
breakRuleIntoResult
pre condition:
receives rule,first word,second word,status(probableOrPossible)
action performed:
breaks the (i.e. processes ) rule and builds up the result.
/* rule : AH--a--1|8--b--m--a
first : ram
second: AH
e.g. root: rama
karak: 1|8
vachan: bahuvachan
ling: male
*/
post condition:
result is recorded in result.txt(if vibhakti possible),otherwise
result is recorded in tryResult.txt(vibhakti is probable).
getProbable
action performed: returns the status of chances of vibhakti root
getPossible
action performed: returns the status of chances of vibhakti root
USE-CASE : 5
attchPackagePath:
pre condition:
receives a filename to which the pathname of package to
be attached.
action performed: attaches the relative path of package to the
filename.
post condition: relative path of the filename is returned.
createFile:
pre conditions:
1) receives a file name for creating the file
2) receives some string which is to be written to the file.(may
be "",i.e. blank)
Action performed: creates a file with the name provided, and appends
the string to the filename.
post condition:
file has been created and string has been appended.
breakIntoFile (for administrator)
pre conditions:
1) receives a file which is to be broken into subfiles based on
some parameter.
(parameter: read each line and based on starting letter of the
line,put that line in the file(name as startingLetter.txt))
2)receives the folder name in which the subfiles are to be
stored.
Action performed:
reads the file fileName line-by-line and appends each line to the
file based on starting letter of the line.(used mainly for
organising the database/rules,based on starting letters)
post condition:
file has been broken into subfiles.
getContentWholeFile
pre conditions:
receives the fileName from which the content to be
retrieved.
Action performed:
reads the file line-by-line and concat each line into one string.
post condition
string containing the content of whole file is returned.
appendToFile
pre condition:
receives the filename and string to be appended to the
file.
Action performed:
appends the string to the filename.
post condition: string is appended to the file.
checkWordInFile
pre condition: receives filename and word to be checked in the file.

(file contains only one word per line)
Action Performed:
reads a word from filename and compares it with the word
received for matching.
post condition:returns true if word is present in the file else returns
false.
checkAndFetchWordInRuleFile(used for vibhakti only)
Note: rule format for vibhakti:

result--add--vibhakti--vachan--ling--endingWord
pre condition: receives filename(rule file) and word(rule's result part)

to be checked in the file.(file contains only one word(rule) per line)
Action Performed:
reads a word(rule) from filename and compare it's result part
(contained in rule) with the word(rule's result) received for
matching.
post condition:
returns true if the rule(corresponding to the result part) is
present in the rulefile.
checkWordInRightOrSeparatedWord(used for sandhi only)
Note: rule format for sandhi:

effectOn: character f result obtained by making changes in first

s second
b both
result: string
left: OR separated strings

right: OR separated strings
pre condition:
receives a word and orSeparatedString(string|string|......|string|)
Action performed: checks if the word starts with one of the

string present in orSeparatedString(string|string|......|string|)
post condition:
returns true(if starts) or false(if doesn't starts)
checkWordInRightOrSeparatedWord(used for sandhi only)
Note: rule format for sandhi:

effectOn: character f result obtained by making changes in first

s second
b both
result: string
left: OR separated strings

right: OR separated strings
pre condition:
receives a word and orSeparatedString(string|string|......|string|)
Action performed:
checks if the word ends with one of the string
present in orSeparatedString(string|string|......|string|)
post condition:
returns true(if does ends) or false(if doesn't ends)
CLASS-DIAGRAM :
Description
Architecture Overview
Query Sanskrit Server Result
Sandhi Vibhakti
Sandhi Utils Vibhakti Utils
Sandhi Rules
Database Vibhakti
Based on Based on
Starting letter Root words Starting letter
Non root
words
User queries(through GUI) is handled by the Sanskrit Server.
Depending upon the Sandhi / Vibhakti to be performed, it passes the query to the
respective server.
Respective server, taking help of utilities, interacts with the Rules database and
Word database.Does some processing and gives result back to the Sanskrit
server,which gives back the result to the result server.
NOTE:
These servers(right now a java program),later on may be implemented in the

distributed manner.
Risk Involved:
Unable to handle the Sanskrit words like At @vaiSm
i.e. sanskrit words with blanks.
If the input is through file, then file shouldn’t contain any blank line.
References
Books referred:
• Sanskrit path pradarshak[Shanti Swaroop Dixit & Dr. milaapSingh]

• Sugham sanskrit vyakaran[Chandrakant jhaa]
• Natural language processing[akshal bharti, vineet chaitanya,rajeev sanghal]
Web references:
• http://sanskritdeepika.org
• http://www.omkarananda-ashram.org
• http://www.gitasupersite.iitk.ac.in

Sanskrit Lexical Analyser

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Sanskrit Lexical Analyser

Hochgeladen von

Copyright:

Verfügbare Formate

Sanskrit Lexical Analyser

(A project module of Universal digital library,IIIT-Allahabad)

Dr. Ratna Sanyal

Indian Institute Of Information Technology,Allahabad

The Sanskrit language is one of the earliest attested members of the

By classical we means: ancient,originated on its own and rich in

The present project concerns with the Morphological and Lexical

I have designed my own rule format(by extending basic rules).

German have already developed their machine translator from

The present project concerns with the Morphological and Lexical

Morphological & Syntactic analysis semantic analysis context analysis

How to proceed for morphological analysis??

Here we basically deal with finding out the root word.

Also, a marker may be attached to the word,

In above case muni-- naa

Sandhi refers to combination of words when they are spoken with

• All together there are 5 + 14 + 7 = 26 crisp rules on which sandhi

14 for wyanjan sandhi( Vy<jn siNx)

7 for wisarg sandhi( ivsgR siNx)

• During formation of sandhi words, 3 cases to deal with:

2) effect on the second word only: (first letter of the second

3) effect on both of the word.(last letter of 1st word and first

Svr ke Swan me< dI"R hae jata hE,

Ak! A ,Aa ,# ,% ,\ ,l&

sv[R Svr A ,Aa ,# ,$ ,% ,^ ,\ ,\& ,l&

I expanded them in the form:

….. And so on.i.e. each letter will be dealt independently

» effectOn - - result - - left - - right

riv + #Nd+ = rvINd

effectOn takes values: f,s,b

Here , both the words (left and right) underwent changes.

effectOn - - result - - left - - right

and add “i” on the right…… “i”+ndra.

So,u get ravi + indra

Consider: Ak> sv[eR dI"R>

Ak! A ,Aa ,# ,% ,\ ,l&

sv[R Svr A ,Aa ,# ,$ ,% ,^ ,\ ,\& ,l&

Rule format in the text rule file ):

effectOn: character f : if result can be obtained by making changes in first

Similarly, words/root words are stored in the file indexed on starting

Sandhi algorithm (my approach)

Step1: receive a word for doing sandhi wichched

try for sandhi wichched.(note: sandhi wigrah is : word=leftWord+rightWord,

step3: Now we have two words: left,right (e.g. ka,stu)

4) if one or more rules listed in corresponding rule file is applicable

a) extract effectOn,result,left,right from the rule.

d) check for the certainty of sandhi formation by applying rules.

How to check certainty of sandhi(step 4.d)??

As stated earlier, while formation of sandhi,the effect may be on first

a)effect on first word:

If rightGuess is allready present in database then, if the

b) effect on second word:

even if the leftGuess word is not present,but,since,rightGuess is present in

rule example: b--e--a|aa|A|--i|ii|I|

so,if on forming sandhi both(first and second) word gets effected.then

for each of the potential prefix for rightGuess

if (add+rightGuess)AND(leftGuess+append) both present then declare the

left + right = sandhi

now for each prefix for the rightGuess:

If (add+rightGuess) is allready present in database,then, if the

1. traverse the string in reverse order .

i) for starting char of rightBreakStr …..get the rulefile