Beruflich Dokumente
Kultur Dokumente
Christian Wartena
1 / 32
Morphology Morphemes Language types Analysis
Morphology
Morpheme
Phonemes change meaning, but do not have meaning
Morphemes: smallest units that bear meaning
In many languages morphemes can be combined to words.
Morphology deals with the rules of building words from
morphemes.
2 / 32
Morphology Morphemes Language types Analysis
Words
What is a word
Usually clear.
Many border line cases
In some exotic languages
In German as well: some changes in spelling reform of
1996.
E.g. Gewinn bringend gewinnbringend (profitable)
Compounds:
In English usually written as two words (but e.g. football)
In German always as one word
German compounds contain additional elements that make
separation difficult:
Betrachtung + s + weise (Way of looking at things)
Practical problems: 5-fold, spinn-off, dont, etc.
3 / 32
Morphology Morphemes Language types Analysis
Types of Morphology
Inflection
Systematic building of grammatical variants.
No change of meaning /clearly defined semantics (e.g. plural)
Productive
Naming convention:
Conjugation (used for verbs)
Declension (used for all other words)
Derivation
Derivation of new words
Often change of part of speech
Meaning often unpredictable
Not always productive
Often lexicalized
4 / 32
Morphology Morphemes Language types Analysis
Morpheme types
Morpheme types
6 / 32
Morphology Morphemes Language types Analysis
Affix types
Position
Prefix
Suffix
Infix
Circumfix
Function
Derivation affix
Infection affix
7 / 32
Morphology Morphemes Language types Analysis
Productivity
Productive
New words can be formed
verb+bar, besprechbar, denkbar, programmierbar,
fotografierbar, zusammenheftbar
adjective+ly, usually, frequently
verb+able, imaginable, programmable
Non-productive
Maybe productive in the past.
Process clear, but cannot be applied to new words.
Ziegelei, Tischlerei, Molkerei, Metzgerei, Kantorei,
Bcherei, *Computerei, *Elektrikerei, *Plakaterei
Prfling, Lehrling, Sugling, Flchtling, *Telefonierling,
*Helfling, *Entwerfling
government, settlement, placement, development,
management, employment, treatment, *teachment,
*drivement, *flyment 8 / 32
Morphology Morphemes Language types Analysis
Language Typology
Morphology types
Fusional (or inflecting)
Isolating (or analytical)
Agglutinating
9 / 32
Morphology Morphemes Language types Analysis
Language Typology
Isolating languages
Each word consists of exactly one morpheme!
No inflection, no derivation!
Many languages from eastern Asia (Mandarin,
Vietnamese)
(1)
Khi ti dn nh ban ti, chng ti bt du lm bi
As I come house friend I , PLURAL I start do teaching
As I came to the house of my friend, we started the lesson.
10 / 32
Morphology Morphemes Language types Analysis
Language Typology
Agglutinating languages
Each affix has one function
Affixes can be concatenated
Examples: Turkish (and all central Asian relatives),
Finnish, Hungarian, Japanese, Korean, Georgian,
Mongolic, Yupik, . . .
11 / 32
Morphology Morphemes Language types Analysis
Language Typology
12 / 32
Morphology Morphemes Language types Analysis
Language Typology
Fusional languages
One affix has several functions
Most Indo-European Languages, Semitic Languages, most
African languages
Language Typology
Mixed languages
Most lnguages are somewhere in between!
Japanes has many fusional (irregular) suffixes.
Fusional languages can be quite regular:
Other phenomena
Polysynthesis
Sib. Yupik
Angya-ghlla-ng-yug-tuq
Boot-Augmentative-Aquire-Desiderative-3Sing
He wants to buy a big boat
Incorporation
Chukchi
Te-meyne-levte-pegt-erken
1SingSubj-large-head-pain-ImperfAspect
I had a large headache
15 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK
Morphological Analysis
Traditional Analysis
Morphology is binary and hierarchical
First we remove inflectional morphemes
Subsequently derivational morphemes
Challenges
Many very small classes of words
Ten thousands words in each language
Phonological processes at morpheme boundaries
16 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK
Approaches
Practical systems
Lexicon based (full-form lexicon).
Popular for English
Have all forms of each word in the dictionary
Heuristics / Lexicon free approaches.
Rule based.
17 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK
Some Terminology
Stem
Stem is what remains when all derivational affixes are
removed
A stemmer should find the stem of a word
Lemma
Canonical form of a word.
Form that is found in a dictionary
Algorithm that finds the lemma for a word form is called a
Lemmatizer
Root
Root is what remains when all affixes are removed
Completely different words have the same root!
Many stemmers also find roots!
18 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK
Common errors
Overstemming
Two many letters are removed
Understemming
Not all affixes are identified and removed
19 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK
Porter Stemmer
Very popular stemmer by Martin Porter
List of affixes and simple rules (e.g. -ational -ate:
relational relate)
Easy to implement, small, fast
Many errors: e.g. -ing is not always a gerund (thing, wing)
policy police, doing doe, organization organ.
No clear ideas about derivational and inflectional
morphology.
Nevertheless useful. E.g. for IR: no problems as long as
unique stems for each word are found
20 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK
21 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK
Lexical f o x +N +pl
Intermediate f o x s #
Surface f o x e s
22 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK
Compilers
XFST Xerox Finite State Tool
HFST Helsinki Finite-State Transducer Technology
SFST Stuttgart Finite-State Transducer Tools
OpenFST Google/Apache
Components
Lexicon: Right linear CFG describes lexical and
intermediate level!
Rules: Restrictions on realization of surface level
Compiler computes intersection of all restrictions (Note:
intersection of two FSAs is a FSA!)
No problem of rule ordering!
23 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK
Multichar_Symbols
+V +Praes +Sg3 +Sg2
LEXICON Root
Verb;
LEXICON Verb
atm Vschw;
wend Vschw;
frcht Vschw;
denk Vst;
fahr Vst;
sauf Vst;
24 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK
LEXICON Vst
+V+Praes+Sg3:%^%"t #;
+V+Praes+Sg2:%^%"st #;
END
25 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK
Alphabet
a a: b c d e f g h i j k l m n o p q r s t u v x y w z
Diacritics
%^ %";
Sets
Cons = b c d f g h k l m n p q r s t v x y w z;
Rules
26 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK
27 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK
Tokenization
NLTK Tokenizer
Das Paket nltk.tokenizer enthlt eine Vielzahl
unterschiedlicher Tokenizer.
Wenn man kene spezielle Anforderungen hat, nutzt man
an Besten einfach word_tokenize:
import n l t k 1
2
sentence= " At e i g h t o c l o c k on Thursday morning . . . A r t h u r d i d n t 3
f e e l v e r y good . "
tokens = n l t k . word_tokenize ( sentence , language= e n g l i s h ) 4
p r i n t tokens 5
28 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK
Sentence Splitter
Stze
Grammatische Konstruktionen beschrnken sich auf
Stze.
POS-Tagging geschieht satzweise.
Named Entities erstrecken sich nicht ber Satzgrenzen.
Satzende
Punkt, der nicht zu einer ABkrzung oder Zahl gehrt
Leerzeilen
Manchmal eine Abkrzung!
Manchmal eine Ordinalzahl!
29 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK
Tokenization
NLTK Tokenizer
Das Paket nltk.tokenizer enthlt auch Sentence
Splitter
import n l t k 1
import codecs 2
3
t e x t f i l e = codecs . open ( " S y r i e n . t x t " , " r " , " u t f 8" ) 4
t e x t = t e x t f i l e . read ( ) 5
t e x t f i l e . close ( ) 6
7
sentences = n l t k . s e n t _ t o k e n i z e ( t e x t , language= german ) 8
t o k e n i z e d _ s e n t = n l t k . t o k e n i z e . word_tokenize ( sentences [ 2 3 ] , 9
language= german )
10
print tokenized_sent 11
30 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK
Stemming
Porter Stemmer
NLTK entht einen Porter Stemmer
Alternativ knnen wir nutzen:
nltk.LancasterStemmer()
import n l t k 1
p o r t e r = n l t k . PorterStemmer ( ) 2
3
sentence= "We were one o f t h e f i r s t d i g i t a l m a r k e t i n g 4
agencies t o c a p i t a l i z e on i t . "
tokens = n l t k . word_tokenize ( sentence ) 5
stems = [ p o r t e r . stem ( t ) f o r t i n tokens ] 6
p r i n t stems 7
31 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK
Lemmatisierung
import n l t k 1
l e m m a t i z e r = n l t k . WordNetLemmatizer ( ) 2
3
sentence= "We were one o f t h e f i r s t d i g i t a l m a r k e t i n g 4
agencies t o c a p i t a l i z e on i t . "
tokens = n l t k . word_tokenize ( sentence ) 5
6
lemmata = [ l e m m a t i z e r . lemmatize ( t ) f o r t i n tokens ] 7
p r i n t lemmata 8
32 / 32