Beruflich Dokumente
Kultur Dokumente
A Thesis by Philip Makedonski /nero_pdm@yahoo.com/ Submited to Seminar fr Sprachwissenschaft Eberhard Karls Universitt Tbingen, 72074 Tbingen, Germany In fulfillment of the requirements for the degree Bachelor of Arts in Computational Linguistics
July 2005
ABSTRACT
Finite State Morphology: The Turkish Nominal Paradigm
Makedonski, Philip Seminar fr Sprachwissenschaft Eberhard Karls Universitt Tbingen Supervisor: Dr. Dale Gerdemann July 2005 24 Pages
In this thesis my goal is to present a finite state approach to the inflectional morphology of Turkish nouns, the ultimate goal being building a morphological analyzer for Turkish nouns. Well be dealing primarily with the principles of vowel harmony across the different inflectional noun suffixes in Turkish as the most interesting phenomenon and my implementation of these principles in the Xerox Finite State Toolbox (xFST). We will also pay attention to the other morphophonological alternations occurring both in the stem and the suffixes attached to it as a result of the inflectional processes.
Keywords: Natural Language Processing, Finite State Networks, Morphology, Computational Linguistics
Turkish
To my family, to my love
ACKNOWLEDGEMENTS
First, Id like to thank my supervisor Dr. Dale Gerdemann for his support and advisory over this project. I appreciate the freedom and independence I had for the choice of topic and approach. I would also like to thank Dr. Sandra Kbler for her support and understanding throughout this course of studies, which in many cases turned out to be the crucial for my progress. Many, many thanks to my family for their support all the time, no matter what was happening. Thanks to my friends for their understanding. And most of all, special thanks to Nevin Recep for sparkling my interest in the Turkish language and supporting me all the time.
TABLE OF CONTENTS
ABSTRACT........................................................................................................................................................... 1 DEDICATION....................................................................................................................................................... 2 ACKNOWLEDGEMENTS.................................................................................................................................. 3 TABLE OF CONTENTS...................................................................................................................................... 4 1. INTRODUCTION ....................................................................................................................................... 5 1.1 1.2 1.3 1.4 2. MOTIVATION ....................................................................................................................................... 5 MORPHOLOGY..................................................................................................................................... 5 RELATED WORK ................................................................................................................................. 6 OVERVIEW ........................................................................................................................................... 7
BACKGROUND.......................................................................................................................................... 7 2.1 TURKISH .............................................................................................................................................. 7 2.2 FINITE STATE TECHNOLOGY ............................................................................................................ 10 2.2.1 Finite State Automata (FSA) ....................................................................................................... 10 2.2.2 Finite State Transducers (FSTs) ................................................................................................ 11 2.3 XFST ................................................................................................................................................. 12
3.
THE MODEL............................................................................................................................................. 13 3.1 THE NOMINAL PARADIGM OF TURKISH. MORPHOTACTICS............................................................ 13 3.1.1 Inflection for Number .................................................................................................................. 14 3.1.2 Case Inflection.............................................................................................................................. 14 3.1.3 Inflection for Possession .............................................................................................................. 15 3.1.4 Lexical Exceptions the su case............................................................................................... 17 3.2 PHONOLOGICAL ALTERNATION RULES ........................................................................................... 17 3.2.1 Resolving Vowel Harmony........................................................................................................... 17 3.2.2 Consonant Alternation Rules....................................................................................................... 19
3.2.2.1 Final Consonat (De)Voicing ....................................................................................................................... 19 3.2.2.2 (De)Gemination ........................................................................................................................................... 20
3.2.3
Other Alternations........................................................................................................................ 20
3.3 IMPLEMENTATION ............................................................................................................................. 22 3.3.1 The Lexicon .................................................................................................................................. 22 3.3.2 The Rules Component .................................................................................................................. 24
3.3.2.1 Vowel Harmony Rules ................................................................................................................................. 25 3.3.2.2 Consonant Alternation Rules ...................................................................................................................... 26 3.3.2.3 Fixing the Morphotactics ............................................................................................................................ 27 3.3.2.4 Rule Order ................................................................................................................................................... 28
4. 5.
APPENDIX A: LIST OF ABBREVIATIONS.................................................................................................. 31 APPENDIX B: LEXC CODE SAMPLES......................................................................................................... 32 APPENDIX C: ON REPLACEMENT RULES................................................................................................ 33
1. Introduction
1.1 Motivation
In morphologically rich languages like Bulgarian, Turkish, Russian, Spanish and many others, grammatical features and functions typically assigned to the syntactic structure in morphologically poor languages like English, are often represented in the morphological structure. As a consequence, any form of an adequate Natural Language Processing (NLP) application would require a good morphological component due to the increased role of morphology in these languages. This in turn would require a rich lexicon, and building up a lexicon, explicitly listing all the possible forms as separate entries, would quickly explode into an unmanageable size due to the rich inflectional and derivational possibilities for a single base (dictionary) form (stem). In Turkish for example, the nominal inflectional paradigm has three basic types of suffixes for number, possession and case (the number varies in the different sources), and the verbal inflectional paradigm is even more complicated with its eight affixes (again, the number might be different depending on the source). There are approximately 20.000 stems and 300-400 roots actively used in Turkish, which effectively amount to millions of inflected and derived forms. This further increases the demand for an automated morphological analysis. As it turns out, morphological structures are much more regular than syntactic ones. They can be handled very efficiently and accurately using sets of rules and compact lexicons of base forms (stems). Furthermore, important semantic and grammatical information could be encoded in such lexicons as well.
1.2 Morphology
The central concepts of morphology are morphotactics and (morpho)phonological alternations. Morphotactics (also morphosyntax or word formation) defines the constraints on possible morpheme combinations. Phonological (also orthographical) alternations define the changes in morphemes occurring in particular environments. To illustrate the issue an example from Karttunen (Karttunen, 2003) comes at hand:
(1)
Morphotactic definition accounts for the acceptability of a word like piti-less-ness and the unacceptability of a word like *piti-ness-less. Phonological alternations on the other side describe why pity is realized as piti in the context of a following less. These are simple examples that could be caught easily with a few basic rules. But for a full scale NLP, one needs a much more sophisticated system. This is especially valid for agglutinative languages like Turkish where the concept of a word is much wider. Different relations between the words in a sentence are mostly expressed by affixes. Furthermore, many affixes and roots in Turkish change their shape depending on the environment and have to obey various constraints like vowel harmony.
FST 1
Surface Form
Surface Form
Figure 1.1: Cascade-based and two-level (parallel) models in finite state morphology. In the cascade-model of composed rule transducers, each transducer operates on its own input and output, producing an intermediate output to feed the next transducer in the cascade. With the key concept here being feed, the major drawback of the two-level models has been that in the case of bleeding or feeding relations between rules (which is often the case in generative phonology), it is hardly possible to define such relations within this approach
More on transducers and automata follows in the technical background on finite state technology in Section 2.2. For now think of rule transducers simply as a way to implement rules.
1
(apart from having to design the rules very carefully in order to get the necessary result). But the convenience of the cascade-based model from this perspective comes at a price. In the process of composition, the network could easily explode into unmanageable size as many parts of it may need to be copied. Luckily there are some techniques to restrict such growth. My project combines both models in a way as we shall see later. The advantage being, whenever parallel operation of rules is needed, well use one, and whenever sequential (linear) operation of rules is needed, such will be used.
1.4 Overview
In the following sections I will present a finite state approach to a part of the Turkish morphology. I will focus on the nominal morphology only, in particular the different inflectional paradigms, as the complete nominal morphology of Turkish is a subject too broad to cover here (set aside the complete Turkish morphology). Once a solution for the nominal morphology is designed however, it could be easily extended to cover the other major word classes in a language. I will try to approach the task as modular as possible, so that if changes or extensions are required, all that is needed is to plug in the extension component and occasionally do a little tune up of the system. The key concept here is modularity. My work is based primarily on Geoffrey Lewis Turkish (Lewis, 1989) and Turkish Grammar (Lewis, 1967), referred to as the official language guides for Turkish in most papers. For the purpose of this project I will be using the Xerox Finite State Toolbox (XFST) and the manual to it by Lauri Karttunen (Karttunen, 2003). In Section 2 I will roughly present the background information needed to proceed through the paper as follows: Section 2.1 linguistic background on Turkish; Sections 2.2 and 2.3 provide some technical background on the technology employed and the particular toolbox I have chosen to use. The actual model and its implementation will be presented in their full beauty in Section 3. We conclude in Section 4 and in Section 5 I will present an outlook on possible future elaborations.
2. Background
In the following sections I will present the basic technical properties of the language and the technology used to model it.
2.1 Turkish
In this subsection I will present the most important features of Turkish that well be dealing with in the subsequent sections. Turkish is an agglutinative language from the family of Turkic languages. A Turkish word consists of a root (base form) and a number of suffixes attached to it, each extending its meaning or changing its word class:
(2)
bilgi knowledge biglisiz without knowledge bilgisizlik lack of knowledge bilgisizlikleri their lack of knowledge bilgisizliklerinden from their lack of knowledge bilgisizliklerindenmi I gather that it was from their lack of knowledge (Lewis, 1989. pp. 3)
As one might infer, many ideas typically expressed by prepositions or pronouns across languages are expressed by suffixes in Turkish. Another important feature of the Turkish language is vowel harmony. Vowel harmony is basically described as a progressive sound assimilation phenomenon. In simple words, the features of a vowel depend on the features of the preceding vowel. Well be dealing exclusively with the vowel harmony of suffixes in Turkish and as mentioned before, the scope of this project will be restricted to inflectional noun suffixes only. Geoffrey Lewis (Lewis, 1989) describes the vowel harmony in Turkish with a general law of vowel harmony in terms of the feature +/-back of vowels. The Turkish vowel system is shown in table 2.1 below: Unrounded Low High a e i Rounded Low o High u
Front Back
As stated in (Lewis, 1989), all the vowels in a word agree with the backness value of the first vowel of that word:
(3)
+Back sekiz eight seksen eighty sinir nerve sinirler nerves sinirlerimiz our nerves
-Back dokuz nine doksan ninety snr frontier snrlar frontiers snrlarmz our frontiers (Lewis, 1989. pp. 11)
In cases of disharmony1 in the root or if an invariable suffix is attached, the harmonic suffixes harmonize with the vowel of the last preceding syllable. So attaching the plural suffix -ler/ -lar, which harmonizes for backness, to anne (mother) will result in anneler (mothers) and not in *annelar, harmonizing with the vowel of the first syllable.
Exceptions to this principle are: a small number of native Turkish words elma (apple), anne (mother), karde (brother or sister); eight invariable suffixes; compound words bilgisayar (computer), from bilgi (information) and sayar (counter, lister); loanwords. Clements and Sezer account for them in (Clements, 1982)
1
There is also, as Lewis (Lewis, 1989) refers to it, a special law of vowel harmony, that constrains the occurrence of vowels in terms of roundedness1. Unrounded vowels are typically followed by unrounded vowels and rounded vowels are typically followed by low unrounded or high rounded vowels. Combining the two principles we end up with the following:
(4)
Turkish suffixes, except the eight invariable ones, harmonize with, for the sake of simplicity, the vowel of the last syllable of the word they are attached to. They could be divided in two groups: The vowels of the first group alternate between the low unrounded vowels a and e (also called e-type2 suffixes (Pollard, 1996)) and the vowels of the second group alternate between the high vowels , i, u and (the so-called i-type1 suffixes (Pollard, 1996)). Except one the present tense verbal suffix iyor/yor/uyor/yor, no other suffixes contain o and . (4) above provides some basic notion about this classification. The plural suffix -ler/ -lar falls in the first class, whereas suffix like the definite objective case suffix is an i-type suffix.
(5)
evi (the house) kolu (the arm) kitab (the book) kpry (the bridge)
One might notice a few addtional things from (5). First of all no vowel sequences are possible in Turkish. Exceptions are some loan words like saat (hour). Typically a buffer y is inserted if a suffix begining with a vowel is attached to a word ending in a vowel. In some cases it is a n or an s. Second, words in Turkish typically end in voiceless consonants, but they do change to voiced ones intervocally. This topic, allong with the other alternations occuring in the process of suffixation will be further elaborated in Section 3.2.2. These are the general morphological and phonological features of Turkish that we will pay attention to. In Section 3.1 and 3.2 I will present the actual morphotactics of the Turkish nominal inflectional paradigm and the phonological alternation rules respectively.
Exceptions to this principle will be: tapu (title-deed), avu (hollow of the hand), abuk sabuk (nonsensical), amur (mud) in general a can be followed by u if a p, v, b or m intervenes. These exceptions occur apparently only root-internally and do not seem to affect suffixation: kitap (book) kitab (book, definite objective case the book). 2 The e-/i-type distinction is really a distinction between harmonizing vowels and not suffixes as Pollard (Pollard, 1996) proposes. Some suffixes like the 3pPl Poss. leri/-lar feature both types of harmonizing vowels.
2 3
b
Figure 2.1: A simple three-state network. The state marked with and arrow (1) is the start state, the state marked with a double circle (3) is the final state.
1
c
2 3
b
Figure 2.2: A bit more complicated three-state model. The arc with input c takes us back to the start state creating a loop.
We will be talking about networks here as a general term abstracting over transducers and automata. Automata are finite state machines that only accept a set of given strings (a language), whereas transducers provide a set of outputs for an accepted input, which might as well be identical to the input. Automata describe languages, whereas transducers express relations between languages.
10
is an essential concept in Finite State Technology. Regular expressions describe the languages accepted by Finite State Automata the regular languages. In the current state, regular expressions are only partially related to real regular expressions. There are newer operations defined in every particular toolbox, extending its capabilities and expressive power. The precise syntax varies among applications and toolboxes. I will describe the necessary syntax basics in further detail, in terms of the toolbox I am using in section 2.3. A model solution for the above networks using the lexc language is provided in the appendix.
a:A
1
c
b:B
2 3
b:B
Figure 2.3: A Finite State Transducer. It accepts the same strings as the FSA in Figure 2.2, but transforms the lowercase as and bs into upper-case As and Bs respectively. The cs remain unchanged.
For an input string like ab the output will be AB, for abcb ABcB, and so on. It seems like a simple replacement operation, but there is no such operation involved here. In this case we have strings from one language (later on referred to as the UPPER language1) related to strings from another language (which will be called the LOWER language1). The c which remains unchanged is applied the identity relation. These are the basics. Once we have designed a network describing a language or a relation, we can apply different operations to it intersection (&)2, union (|), concatenatenation ( ), negation (~), subtraction (-), composition (.o.), etc. The essential terms will be explained as needed as we proceed. Most important to note here is the composition operation (.o.). A general feature of Finite State Networks is that they can be composed together yielding a sequence of transducers/ automata a modular structure that is very essential to our purpose in this paper. Composition is an operation on two relations. Say we have the transducer above (Figure 2.3) that is turning lowercase as and bs into upper case As and Bs respectively. This could be further described as <a,A> and <b,B> in terms of relations. Say we have then another transducer that is turning capital As and Bs into numbers, <A,1> and <B,2>. Composing the two of them would provide us with a new transducer taking the upper side of the first and the lower side of the second transducer, where the inner symbols match:
(6)
1 2
The terms will be explained in more detail in section 2.3 The operators and their syntax vary among toolboxes. I will be using the ones described in (Karttunen, 2003)
11
All the operations can be applied multiple times to different networks. For some of them the order matters, for others not. Composition allows us to build a cascade of multiple transducers into a single transducer, in terms of the current task at hand, compose multiple rule transducers into a single lexical transducer that is relating strings from the language of surface forms to strings from the language of lexical (underlying) forms. It was C.D. Johnson (Johnson, 1972) who first realized that morphophonological knowledge could be modeled using FSNs. The most fascinating part is, once we have constructed a transducer for morphological generation, we can easily apply it in the other direction for the task of morphological analysis. This natural feature of finite state networks is what makes them so suitable for morphological processing. I will spare the mathematical model behind Finite State Networks, as it wont be necessary to understand the current paper. For further information on finite state technology and automata theory refer to (Hopcroft, 1979).
2.3 XFST
The Xerox Finite State Toolbox (XFST) was developed at the Xerox Research Centre Europe (XRCE) by Kenneth R. Beesley and Lauri Karttunen. It implements the standard finite state operations such as composition and union as well as several innovative operations like replacement rules1 and local sequentialization. XFST includes: lexc - a complier for lexicons in the lexc language, which is specifically designed for handling morphotactics in natural languages, and xfst the core tool providing interface to the finite state calculus for building, accessing and manipulating Finite State Networks and compiler for regular expressions and replacement rules which will be essential to my work. Additionally, there is a compiler for two-level morphology rules (twolc) as described by Koskenniemi (Koskenniemi, 1983), but its application is beyond the scope of my work, so I will leave it aside. XFST also provides two tools, lookup and tokenize, designed for testing and application of larger projects, but they wont be discussed any further in this paper. In the process of implementing a morphological analyzer, the morphotactics will be defined in lexc as supposed, whereas phonological/orthographical alternation rules will be defined as separate transducers (mostly using replacement rules), composed together into a single transducer, which itself will be composed with the network derived from the lexc definition of the lexicon to finally result in a lexical transducer which will be used for our final purpose. Additional transducers can be composed to the network at hand to impose restrictions, define alternations or add more content. XFST defines transducers as relations between two languages. What would be referred to as upper language, could be thought of as the input and the lower language would then be the output when we apply an input to a transducer downwards. If we apply input to the transducer upwards then the roles switch the input is applied on the lower side and the output comes from the upper side. Although it seems a bit confusing, the terms upper and lower remain constant. In the definition of a lexical transducer, the upper side language will describe the lexical (underlying) forms of the language to be analyzed and the lower side language will contain the actual surface forms in the standard orthography.
12
3. The Model
In this section I will present the nominal paradigm of Turkish and my implementation of it. There are two modules in the model the lexicon defining the morphotactics of Turkish nouns and the morphophnonological rules component describing the alternations occurring on the surface. In Sections 3.1 and 3.2 I will present the theoretical background behind my model. An important notion in the following sections will be that of archiphonemic descriptions. As I was implementing the vowel harmony principles using variables for the alternating vowel segments, I realized that the idea of using variables could be further employed to describe other phenomena, such as the consonant alternations. My initial approach, using consonant alternation rules on the surface forms failed to describe the exceptional cases, so I had to redesign it using unspecified abstract definitions on the lexical side for entries that do undergo the alternations and underspecify the entries that do not. The general idea: I will be using both in theory and practice the so-called archiphonemes to describe classes of similar phonemes that alternate depending on the environment. For example, to describe vowel harmony I will be using I to generalize over the class of high vowels that alternate according to the principle of i-type vowel harmony and E to generalize over the class of low unrounded vowels that alternate in concordance with the principle of e-type vowel harmony. The symbols denoting the particular classes of alternating phonemes will be defined as needed as we proceed further.
2
0
3
0
4
0
Figure 3.1: A simplified FSA model for the nominal morphotactics in Turkish.
So lets have a closer look at the core of the Turkish noun paradigm. The definition will be further extended in the subsequent sections.
13
or -(y) -(n)n
O or u -(y)u -(n)un
The bracketed y and n are realized on the surface only if the word the suffix is attached to ends in a vowel. The locative and ablative suffixes are generally realized as de/da and den/dan, but when attached to a word ending in a voiceless consonant (, f, h, k, p, s, and t), they are realized as te/ta and ten/tan respectively. So using archiphonemic descriptions and the principles of vowel harmony, the case inflection summary will look like: Case Absolute (Nominative) Definite Objective (Accusative) Genitive (of) Dative (to, for) Locative (in, on, at) Ablative (from, out of) Lexical Form of the Suffix -(y)I -(n)In -(y)E -DE -DEn
(7)
arabay (car, Acc. / SF the car) evde (house, Loc. / SF in the house)
ev (house, Nom.)
14
As mentioned above, some more recent works treat what used to be (and I believe still is) a postposition (ilE) following absolute or genitive forms as an additional instrumental/ comitative case suffix ((y)lE). It is however, still used, as far as my knowledge reaches out, both as a postposition and as a cliticized suffix. I will stick to the classic works for now and treat it as a separate (non-case) suffix1.
Again, the bracketed segments surface only in particular conditions. Opposite to the case suffixes, where the bracketed segments surfaced only if the word they are attached to ends in a vowel, here the optional segments surface both if the word the possessive suffix is attached to ends in a consonant (for the first and second person singular and plural) and if the word ends in a vowel (for the third person singular). So we have vowel deletion in one case and consonant insertion in the other, to avoid vowel sequences2.
(8)
ev-(I)m (house, 1pSg Poss. / LF) araba-(I)mIz (car, 1pPl Poss. / LF) araba-(s)I (car, 3pSg Poss / LF)
evim (house, 1pSg Poss. / SF my house) arabamz (car, 1pPl Poss. / SF our car) arabas (car, 3pSg Poss. / SF his/her car)
Lewis (Lewis, 1967, 1989) states that it is attached to nominative nouns and genitive pronouns, in this sense it could be considered an additional case suffix. I will leave it aside until I get a clearer view on the issue. 2 More on vowel sequences to come in the description of the rules in the following sections
15
Possessive suffixes precede case suffixes. By having another look at the two inflectional paradigms one might or might not notice that some of the suffixed forms could occasionally overlap on the surface. For example: the underlyingly different ev-(y)I (house Definite Objective (Accusative) case, the house) and ev-(s)I (house 3pSg possessive, his house) end up absolutely the same on the surface evi:
(9)
ev (house) ev (house)
evi (house, Acc. / SF the house) evi (house, 3pSg Poss. / SF his/her house)
Things get further complicated if there are multiple instances of the plural suffix lEr in the case of 3pPl possessive for example, if the possessed noun is already plural evler (houses) *evlerleri evleri (their houses) one lEr gets deleted. So we end up having the single form evleri for both their house and their houses. Paying a closer look however, reveals even further complications: evleri could also denote the accusative case of the plural of houses (the houses) and the 3pSg possessive of the plural of houses his/her houses. Even though Turkish is morphologically highly specified, we often have 2-,3- or as in this case 4-fold ambiguities. The derivations from the underlying lexical representations of the four interpretations of evleri are given in (10) below:
(10) Pl.Acc .
(the houses)
Pl.3pPl.Poss.
(their houses)
Sg.3pPl.Poss.
(their house)
Pl.3pSg.Poss.
(his/her houses)
Worth to note, just to make things even more confusing, is that after the third person possessive suffixes, a so-called pronominal n is added when there is a case suffix following.
(12) evinde (in his/her house in our case, but also identical with in your house)
Confusing? Typically ambiguities are resolved by looking at the context where the ambiguous word occurs ambiguous forms are usually used with the genitive of the personal pronouns to avoid confusion. In this case the noun itself reverts to accusative case.
onlarn evi (their house, the house of theirs) (they, Gen.; house, Acc.) onlarn evleri (their houses, the houses of theirs) (they, Gen.; houses, Acc.) onun evleri (his houses, the houses of his) (he, Gen.; houses, Acc.) 16
For the purpose of this project, however, I wont be concerned with morphological disambiguation, as this task should be performed at a later stage, after examining the already analyzed context. There are further distinctions in the uses of the possessives in Turkish, but again, this topic is beyond the scope of my work. As one might imagine, for a single entry in the lexicon, that is for a single noun stem, there are plenty of possible inflections - 2x for number times 7x (the six possessive suffixes + the possession free form) for possession times 6x (or even 7x if the instrumental case is included) for case inflection, results in 84 basic forms from inflection only (even though some of them might be identincal), and things get further complicated.
(16) [ a | ]
This is essential for defining the i-type harmony, as it is based on two features rather than one, namely backness and roundedness. So, if the last preceding vowel is back and unrounded, the underlying I is realized as (or the hgh back and unrounded vowel so to say intersecting the set of high vowels with the sets describing the features of the last preceding vowel). The same holds for the other realizations of the undelying I:
(17) I [HighV & BackV & RoundedV] || [BackV & RoundedV] Consonant1 _
Which should be read as: I is realized as the high-back-rounded vowel (u) in the context of a back rounded last preceding vowel (o or u). The other rules are identical:
(18) I [HighV & BackV & UnroundedV] || [BackV & UnroundedV] Consonant _
I [HighV & FrontV & RoundedV] || [FrontV & RoundedV] Consonant _ I [HighV & FrontV & UnroundedV] || [FrontV & UnroundedV] Consonant _ This is only necessary to state clearly the principles operating vowel harmony. One migh as well simply write the rules as: I -> i || [ i | e] Consonant _ , but that wont have much of a descrptve liguistic value. In my solution the rules operate in parallel locally, that is for the e-type and the i-type they operate together among themselves, but the e-type harmnoy still has precedence over the itype. The reason behind it apart from the backness harmony being the more general principle and having broader coverage, the abstract symbols have to be resolved in a left-toright fashion and e-type suffixes at the current stage precede i-type suffixes. We need the exact properties of the last preceding vowel in order to resolve the next variable vowel in the following (or even in the same suffix). In this sense, I might need to combine the e-type and itype rules into one single rule operating in parallel as the system gets more sophisticated2. A few words about the exceptions to vowel harmony: We will be concerned with roots whose last vowel does not have predictive power over the harmonic features of the suffixes attached to it. Schaaik (Schaaik, 1996) refers to words which induce such exceptions as disharmonic roots. The same term however is used in some sources for roots that do not conform the principles of vowel harmony internally the already mentioned in section 2.1 exceptional cases like anne (mother), amur (mud), etc. Although they often do overlap, it cant be stated that this is always the case. The exceptions we will be dealing with are mostly of foreign origin: alkol (alcohol), rol (role), saat (clock), etc. are realized as alkol (alcohol, Acc), rol
Consonant is also a defined class featuring all the consonants A small issue that occured when I accidently switched the order of the rules was that for example in words having a round vowel in their last syllable (like katalog (catalogue)) were resolved in an unusual way *kataloglarunuz, whereas the correct form would be kataloglarnz (our catalogues). This was due to resolving the InIz (1pPl Possessive suffix) as unuz in concordance with the last (resolved) preceding vowel o (the E in the plural suffix lEr was still pending resolution). This is important, because if a e-type suffix is added, all the following suffixes feature unrounded vowels (unless a suffix with an invariable rounded vowel is added).
2 1
18
(role, Acc.), saati (clock, Acc.) and alkoller (alcohol, Pl.), roller (role, Pl.), saatler (clock, Pl.) instead of *alkolu, *rolu, *saat and *alkollar, *rollar, *saatlar respectively in their accusative and plural forms.
19
An example for both phenomena where several rules apply, will be the inflection of kitap (book) in Table 3.4: Surface Form kitap kitaplar kitabm kitapta kitabimda Lexical Form kitaB kitaB-lar kitaB-(I)m kitaB-DE kitaB-(I)m-DE Alternation Rules Bp Bp Bb, I Bp, Dt, Ea Bp, I, Dd, Ea Gloss book, Sg, Nom. book, Pl, Nom book, Sg, 1pSg Poss, Nom. book, Sg, Loc. book, Sg, 1pSg Poss, Loc
The rules in (19) and (20) are oversimplified of course. In the actual implementation they feature a wider context including morpheme boundaries to make the distinctions clearer. In linguistic terms we have regressive assimilation in stems and progressive assimilation in suffixes. The exceptions to these rules include primarily monosyllabic words that perserve the quality of their final consonant. There are however monosyllabic words that do undergo the alternation rules, as there are polysyllabic words that do not. Such exceptions will be underspecified in the lexicon with their unchanging consonant.
3.2.2.2 (De)Gemination
Apart from the final stop voicing/devoicing, which is the most productive type of consonant alternation a few other types of alternations are worth mentioning. The final consonant (de)gemination occurs only in a small number of Arabic loan words. The nature of this phenomenon is similar to the one of the final consonant (de)voicing a word final segment gets doubled if a suffix starting in a vowel (or dropping consonant) is attached to the word:
hissi (feeling, Acc., the feeling) hatt (line, Acc., the line)
Again, we will have to employ special symbols that will be realized differently on the surface depending on the context as proposed by Schaaik (Schaaik, 1996)1. He proceeds even further, investigating the dependence of these alternations on the re-syllabification processes occurring with the different suffixes. I will not go into detail however, as my project is not intended to feature a syllabification module in its current stage of development.
This issue could be approached differently, by underspecifying the geminating stems with their double consonants in the lexicon and then removing the additional segment if necessary.
20
burnu (nose, Acc. the nose) fikri (idea, Acc. the idea) ehri (city, Acc. the city)1 mr (life, Acc. the life) aln (forehead, Acc. the forehead)
This phenomenon occurs again whenever a suffix starting in a vowel is attached to the stem (seems like all the stem-internal alternations in Turkish are conditioned on the same context). The epenthesized vowel is always a high vowel, but its other features cannot be automatically determined, so it has to be hard-coded. Such stems will be indicated in the lexicon with a meta character preceding the vowel which is to be deleted. As for the quality of the consonant clusters that are formed after the epenthesis occurs, there have been several attempts to define the possible consonant sequences in such cases, but this is far beyond the scope of this paper.
Type 2: -> i / (i if a consonant follows and if a vowel follows) nev (sort) -> -> neviler (sorts) nevi (the sort / his/her sort) (Schaaik, 1996, pp. 114) Both are supposed to act as consonants if a vowel follows. In modern Turkish however, the glottal stop is mostly omitted both in speech and writing. It is preserved only when ambiguities occur telin (of the wire / your wire) and telin (denunciation). Apparently, in TLDP the glottal stop is not featured either. Both cases are
1 2
In modern Turkish, the tendency is to retain the i in ehir (city) ehiri (the city) The Type 1 glottal marker ^ is not manifesting itself orthographically.
21
accepted there camii and camisi both denote the 3pSg Possessive form (his/her mosque), identically camii and camiyi both denote the accusative case (the mosque). For the second type though, only yeisi (the despair / his/her despair) and neviyi (the sort) / nevisi (his/her sort) are recognized. So the first type allows for both realizations, whereas the second type behaves more or less as if it wasnt there at all. In my solution, I tried to approach the issue as in the TLDP. There are some mismatches though, and even though it is more likely that the mistake is overgeneration from my side, it is also possible that the TLDP analyzer has some flaws. The examples I am concerned with are:
3.3 Implementation
The model comprises of two components the lexicon, defined in lexc, describing the morphotactics of Turkish (technically it is implemented as an FSA, but it does include some transductions for the tags), and a set of rules, that describe the morphophonological alternations that occur on the surface (implemented naturally by a set of FSTs in xfst, using the formalism of replacement rules).
(26) Multichar_Symbols +Noun +Poss +Case +1p +2p +3p +Sg +Pl +DefObj +Gen
+Dat +Loc +Abl +Abs These are primarily used to define the tags to be used (case marking, possession, number, etc.). Further on, it contains a sub-lexicon of the noun stems it is the simplest, but most important part it contains the noun stems in their lexical (underlying) form, which could be automatically extracted from a dictionary. This form includes all the special symbols that denote alternating segments and trigger the alternation rules. Then on the next stage (the standard continuation class for all nouns) a tag +Noun is attached on the upper side, that is, it visible only if morphological analysis (or lookup) is performed (same for all the other tags). On the lower (surface) it is realized as an epsilon. The continuation class from there is the number lexicon number suffixes are attached on the lower (surface) side and tags +Sg and +Pl are attached on the upper (lexical) side, (the dash stands for morpheme boundary):
22
A possessive sub-lexicon follows which defines the inflection for possession as described in Section 3.1.3 with the appropriate tags. There is an intermediate lexicon however, that specifies the optionality of the possessive suffixes:
That is, either take a possessive tag +Poss and go to the lexicon of possessive suffixes, or take a +Case tag and go to the lexicon of case suffixes. So the actual sub-lexicon for the possessive suffixes is called PSuff:
After taking a possessive suffix there is again an intermediate stage that should be passed the possessive forms still have to take a +Case tag. In the morphological analysis module of the Turkish WordNet the possessive markup is obligatory. It is referred to as possessive agreement there, and if there is none, then the tag is +Pnon. I dont find it necessary for now, but of course it wont be any problem to tune my system up so that it features the same type of mark-up. Two more points to make clear: the optional segments which were marked with brackets in the theoretical part are prefixed with an optionality marker (*); the pronominal n is denoted by the capital N. Oflazer (Oflazer, 1995) defines it as a part of the case suffixes. In my case it is an optional segment that surfaces only if there is a suffix following the third person singular and plural possessive forms. In his case, there are two copies of each case suffix one that follows the third person possessive form and one for all the other possessive and nonpossessive forms. To me it seems more intuitive to have it as a part of the possessive, as it is indeed a pronominal n, and I dont find much sense in having two instances of every case inflection.
23
The last component of our lexicon is the case inflection sub-lexicon. It is obligatory, as all uninflected nouns are in their absolute form (Nominative case). The hash symbol (#) is an anchor symbol denoting word boundary (in replacement rules it is circumfixed by dots (.#.)). To summarize, a visual map of the lexicon network is presented in Figure 3.2 below: 0
0.Root
1.Noun
/Noun Stems/
2.NN
+Noun:0
+Sg:0 , +Pl:lEr
4.Possessive
+Poss:0
24
sequence, as some of them do depend on each other. Full independence is hardly achievable. In the case of a dropping vowel in the stem for instance, the vowel harmony rules have to apply before the vowel is deleted, since the suffixes have to harmonize with this vowel. This is especially true for monosyllabic roots that lose their one and only vowel. The rules are split (for now) in several groups addressing the different phenomena types that they describe. A few classes needed to be defined in order to make the rules operational. I defined a class for the vowels and consonants initially, where the consonant class had to be extended to feature all the archiphonemic descriptions used. As already mentioned, the vowels are further divided into subclasses according to their features for the vowel harmony resolution. Further on, for the rule of progressive assimilation in suffixes, I had to define a class of voiceless consonants.
25
and the even further complicated case of serhat (border), which is an exception to vowel harmony, besides undergoing germination and voicing serhaddi (border, Acc., the border). This issue could be fixed using a few minor tricks and the current system is ready to handle it, but I will leave it for a later stage of development.
26
27
The remaining rules clean up the marker leftovers. The clean up procedures can be incorporated in the rules themselves, but during the development stage, I prefer to keep them separated for debugging purposes.
First things first, getting rid of the multiple plural morpheme is a good thing to start with. There are some local dependencies among the rules, like already mentioned, the e-type harmony rule has to precede the i-type harmony rule (or probably they will have to be merged in a single rule and apply simultaneously as two-level rules). Also the vowel harmony resolution shall precede the stem vowel deletion. If we proceed from left to right (with parallel rules), the stem vowels will be deleted before the vowels in the suffixes which shall 28
harmonize with the deleted vowel are resolved. There should also be some tendency to go from simpler and more general to more sophisticated and specific rules (either in upward or downward direction). Such however is not present in the current stage of development. The final stop devoicing, the velar alternations, the germination, stem vowel deletion and the rule for the su exception could all operate at a single stage as they occur in identical contexts and their purpose is more or less the same. The suffix onset devoicing rule is partially dependent on the outcome of the final stop devoicing rule, but if the input is processed left to right, this will be determined before the application of the suffix onset devoicing rule. The pronominal n rule is also on its own, so getting the bigger picture, in the end it seems that the rules are mostly independent. All that maters is to process the input sequentially, from left to right. And therefore if we have the wrong rule ordering, rules that apply on segments that occur after unresolved segments might cause major troubles. This is the reason why most finite state approaches to Turkish morphology are based on two-level morphological descriptions.
4. Conclusions
In order to analyze the complex and often symbiotic relations between words, one needs first to determine the exact properties of each and every individual token. Some of the properties however, could only be determined after examining the environment. The common approach to this issue is inside-out (or bottom-up) starting from the basic entities and building up increasingly complex structures out of them. In this paper I presented an approach to part of the basic entities in the Turkish language.
5. Future Work
Where do we go from here on? One could come up with various ideas. I myself am not so sure which way this project will take. First, before everything else, the model has to be completed to cover the other major word category in Turkish, as well as the minor word categories, to result in a full-featured morphological processor. Then perhaps, to extend functionality a lexicon extraction routine has to be implemented, that automatically extracts entities from a dictionary into the morphological processor. This could be combined with a morphological guesser, and the two could form a symbiotic relation, in which the former will be used to train the latter, and the guessing algorithm will occasionally provide substance for the extension of the lexicon. Further on I am thinking of implementing a syllabification module as it seems quite necessary, as well as perhaps stress markup. Having a fully functional morphological processor at hand, there are various ways one could take: Integrate it into a larger NLP system (speech synthesis/recognition applications, automatic machine translation applications, language tutoring applications, artificial intelligence components, OCR applications, supplemental linguistic applications); extend its functionality for different tasks (a major advantage of the modular approach simply add a new module for the task at hand and occasionally tune up the existing modules); add a context component for disambiguation (this falls in the previous category perhaps); try approaching a different language, and numerous other options in the field. As a first step however, a complete coverage of the language of choice has to be accomplished.
29
Bibliography:
Dik, Simon C. 1981. Functional Grammar 3rd Ed,. Foris. Dordrecht. The Netherlands. Clements, George N. and Engin Sezer. 1982. Vowel and Consonant Disharmony in Turkish. Linguistic Models: The Structure of Phonological Representations (Part II), ed. by H. van der Halst and N. Smith. Foris Publishing, Dordrecht, Holland. Hankamer, Jorge. 1986. Finite State Morphology and Left to Right Parsing. Paper, 3rd International Conference on Turkish Linguistics, August 1986, Tilburg, The Netherlands. Hopcroft, J.E. 1979, Ullman, J.D., Introduction to Automata Theory, Languages and Computation. Addison Wesley. Inkelas, Sharon. C. Orhan Orgun. 1997. The Implications of Lexical Exceptions for the Nature of Grammar. Derivations and Constraints in Phonology. Roca, Iggy; Clarendon Press, Oxford. 1997. Johnson, C. Douglas. 1972. Formal Aspects of Phonological Description. Mouton. The Hague. Paris. Karttunen, Lauri, with Kenneth R. Beesley. 2003. Finite State Morphology. CSLI Publications. Stanford. Ketrez, F. Nihan. 2003. Multiple Readings of the Plural Morpheme in Turkish. USC. USA. (online at: http://www-scf.usc.edu/~ketrez/papers/ADL2003ketrez.pdf - 25.06.2005) Koskenniemi, Kimmo. 1983. Two-level Morphology. A General Computational Model for Word-Form Recognition and Production. Department of General Linguistics. University of Helsinki. Kksal, A. 1975. A First Approach to a Computerized Model for the Automatic Morphological Analysis of Turkish. Doctoral Dissertation, Hacettepe Universitesi, Ankara. Lewis, Geoffrey. 1967. Turkish Grammar. Oxford University Press. Oxford. Lewis, Geoffrey. 1989. Turkish 2nd ed. (Teach Yourself Books). Hodder and Stoughton. London. Oflazer, Kemal. 1994. Two-level Description of Turkish Morphology. Linguistic and Literary Computing. (online at: http://acl.ldc.upenn.edu/E/E93/E93-1066.pdf - 25.06.2005). Oflazer, Kemal. Elvan Gmen and Cem Bozahin. 1995. An Outline of Turkish Morphology. Technical Report. Middle East Technical University (online at: http://www.lcsl.metu.edu.tr/ftp/papers/morphspecs.ps.gz 18.07.2005). Pollard, Asuman elen; Pollard, David. 1996. Turkish: A complete course for beginners. (Teach Yourself Books). Hodder and Stoughton. London. Schaaik, Gerjan van. 1996. Studies in Turkish Grammar. Harrassowitz Verlag, Wiesbaden, Germany. Sebktekin, Hikmet I. 1971. Turkish-English Contrastive Analysis. Turkish Morphology and Corresponding English Structures. Mouton. The Hague. Paris.
Useful links:
http://www.hlst.sabanciuniv.edu/TL/ - The Turkish Lexical Database Project - provides morphological analysis to verify the results http://www.turkishdictionary.net/ - Turkish online dictionary additional glossary http://www.google.com/ - Everything is there! using the web as a corpus
30
NUMBER/POSSESSIVE:
Sg./+Sg Pl./+Pl (+)1p/2p/3p Poss./+Poss Singular Plural 1/2/3 Person Possessive
GENERAL:
FST FSA FSN LF SF Finite State Transducer Finite State Automaton (-ta) Finite State Network Lexical Form (lexicon entry form) Surface Form (standard orthographical representation)
31
!#State 2
!#State 3 !#The hash symbol denotes end of input, or a final state !#The loop back to State 1
!#################A model lexc solution for Figure 2.3########################### !#Same as above for the most part LEXICON Root One; Lexicon One a:A Two; b:B Three;
!#The semicolon operator denotes a transduction here !#Basically the expressions could be regular expressions !#with varying complexity, combining various operations, !#but as my key concept is modularity, I will try to keep !#them as simple as possible.
32
33