NLP 3 Morphology

Morphology Morphemes Language types Analysis
Natural Language Processing

Morphology
Christian Wartena
Hochschule Hannover, Abteilung Information und Kommunikation
April 19, 2017
1 / 32
Morphology
Morpheme
Phonemes change meaning, but do not have meaning
Morphemes: smallest units that bear meaning
In many languages morphemes can be combined to words.
Morphology deals with the rules of building words from
morphemes.
2 / 32
Words
What is a word
Usually clear.
Many border line cases
In some exotic languages
In German as well: some changes in spelling reform of
1996.
E.g. Gewinn bringend gewinnbringend (profitable)
Compounds:
In English usually written as two words (but e.g. football)
In German always as one word
German compounds contain additional elements that make
separation difficult:
Betrachtung + s + weise (Way of looking at things)
Practical problems: 5-fold, spinn-off, dont, etc.
3 / 32
Types of Morphology
Inflection
Systematic building of grammatical variants.
No change of meaning /clearly defined semantics (e.g. plural)
Productive
Naming convention:
Conjugation (used for verbs)
Declension (used for all other words)
Derivation
Derivation of new words
Often change of part of speech
Meaning often unpredictable
Not always productive
Often lexicalized
4 / 32
Morpheme types
Base morpheme Meaningful kernel of a word.

Free morpheme Morpheme that can serve as a word.
Bound morpheme Morpheme that has to be bound to another
word.
Lexical morpheme Morpheme with a own, independent
meaning.
Own meaning, or
Changes meaning of another word, without
changing grammatical properties
Grammatical morpheme Morpheme with a grammatical
function and no autonomous meaning
Grammatical function
Changes part-of-speech of a another word
Adds a feature (like plural, past, etc.)
5 / 32
Morpheme types
Free/Bound and Lexical/Grammatical

lexical grammatical
free Words that dont need in- preposition, conjunction,
flection pronoun, article
bound Words that need inflection; Affixes
some affixes, confixes
Examples are, of course, language dependent!
6 / 32
Affix types
Position
Prefix
Suffix
Infix
Circumfix
Function
Derivation affix
Infection affix
7 / 32
Productivity
Productive
New words can be formed
verb+bar, besprechbar, denkbar, programmierbar,
fotografierbar, zusammenheftbar
adjective+ly, usually, frequently
verb+able, imaginable, programmable
Non-productive
Maybe productive in the past.
Process clear, but cannot be applied to new words.
Ziegelei, Tischlerei, Molkerei, Metzgerei, Kantorei,
Bcherei, *Computerei, *Elektrikerei, *Plakaterei
Prfling, Lehrling, Sugling, Flchtling, *Telefonierling,
*Helfling, *Entwerfling
government, settlement, placement, development,
management, employment, treatment, *teachment,
*drivement, *flyment 8 / 32
Language Typology
Traditional main criteria for classification of languages

Basic word order (Subject-Verb-Object,
Subject-Object-Verb, Noun-Adjective vs. Adjective-Noun
etc.)
Morphology type
Morphology types
Fusional (or inflecting)
Isolating (or analytical)
Agglutinating
9 / 32
Language Typology
Isolating languages
Each word consists of exactly one morpheme!
No inflection, no derivation!
Many languages from eastern Asia (Mandarin,
Vietnamese)
(1)
Khi ti dn nh ban ti, chng ti bt du lm bi
As I come house friend I , PLURAL I start do teaching
As I came to the house of my friend, we started the lesson.
(Example from Comrie 1989)
10 / 32
Language Typology
Agglutinating languages
Each affix has one function
Affixes can be concatenated
Examples: Turkish (and all central Asian relatives),
Finnish, Hungarian, Japanese, Korean, Georgian,
Mongolic, Yupik, . . .
11 / 32
Language Typology
Agglutinating languages: Turkish case system
Table: Deklination des Substantivs adam im Trkischen

Singular Plural
Nominativ adam adam-lar
Akkusativ adam- adam-lar-
Genitiv adam-n adam-lar-n
Dativ adam-a adam-lar-a
Lokativ adam-da adam-lar-da
Ablativ adam-dan adam-lar-dan
12 / 32
Language Typology
Fusional languages
One affix has several functions
Most Indo-European Languages, Semitic Languages, most
African languages
Table: Deklination des Substantivs Stol (Stol, Tisch) im Russischen

Singular Plural
Nominativ stol- stol-y
Akkusativ stol- stol-y
Genitiv stol-a stol-ov
Dativ stol-u stol-om
Instrumental stol-om stol-ami
Prpositional stol-e stol-ax
13 / 32
Language Typology
Mixed languages
Most lnguages are somewhere in between!
Japanes has many fusional (irregular) suffixes.
Fusional languages can be quite regular:
Table: Konjugation des Imperfekts, Aoristus und Aoristus/Passiv des

Verbs (luein, lsen) im Altgriechischen
Imperfekt Aoristus Aoristus/Passiv
1. Pers. Sing. e-lu-o-n e-lu-sa e-lu-th-n
2. Pers. Sing. e-lu-e-s e-lu-sa-s e-lu-th-s
3. Pers. Sing. e-lu-e e-lu-se e-lu-th
1. Pers. Plur. e-lu-o-men e-lu-sa-men e-lu-th-men
2. Pers. Plur. e-lu-e-te e-lu-sa-te e-lu-th-te
3. Pers. Plur. e-lu-o-n e-lu-sa-n e-lu-th-san
14 / 32
Other phenomena
Polysynthesis
Sib. Yupik
Angya-ghlla-ng-yug-tuq
Boot-Augmentative-Aquire-Desiderative-3Sing
He wants to buy a big boat
Incorporation
Chukchi
Te-meyne-levte-pegt-erken
1SingSubj-large-head-pain-ImperfAspect
I had a large headache
15 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK
Morphological Analysis
Traditional Analysis
Morphology is binary and hierarchical
First we remove inflectional morphemes
Subsequently derivational morphemes
Goals for automatic analysis

Get the main word, that we can find in a dictionary
Get grammatical features
Challenges
Many very small classes of words
Ten thousands words in each language
Phonological processes at morpheme boundaries
16 / 32
Approaches
Practical systems
Lexicon based (full-form lexicon).
Popular for English
Have all forms of each word in the dictionary
Heuristics / Lexicon free approaches.
Rule based.
17 / 32
Some Terminology
Stem
Stem is what remains when all derivational affixes are
removed
A stemmer should find the stem of a word
Lemma
Canonical form of a word.
Form that is found in a dictionary
Algorithm that finds the lemma for a word form is called a
Lemmatizer
Root
Root is what remains when all affixes are removed
Completely different words have the same root!
Many stemmers also find roots!
18 / 32
Common errors
Overstemming
Two many letters are removed
Understemming
Not all affixes are identified and removed
19 / 32
Lexicon free morphology
Porter Stemmer
Very popular stemmer by Martin Porter
List of affixes and simple rules (e.g. -ational -ate:
relational relate)
Easy to implement, small, fast
Many errors: e.g. -ing is not always a gerund (thing, wing)
policy police, doing doe, organization organ.
No clear ideas about derivational and inflectional
morphology.
Nevertheless useful. E.g. for IR: no problems as long as
unique stems for each word are found
20 / 32
Two Level Morphology
Most advanced rule based approach

Koskenniemi 1983. Technical Report, University of Helsinki
Two representations:
Lexical level: Stem+Features
Mid-Level: Stem+ abstract Affixes (Strictly not necessary)
Surface: Word
A finite state transducer (FST) translates between the
levels
FST is compiled from rules
21 / 32
Lexical f o x +N +pl
Intermediate f o x s #
Surface f o x e s
Figure: Schematische Darstellung der Two Level Morphology
22 / 32
Compilers
XFST Xerox Finite State Tool
HFST Helsinki Finite-State Transducer Technology
SFST Stuttgart Finite-State Transducer Tools
OpenFST Google/Apache
Components
Lexicon: Right linear CFG describes lexical and
intermediate level!
Rules: Restrictions on realization of surface level
Compiler computes intersection of all restrictions (Note:
intersection of two FSAs is a FSA!)
No problem of rule ordering!
23 / 32
Two Level Morphology I
!Example German Lexicon
Multichar_Symbols
+V +Praes +Sg3 +Sg2
LEXICON Root
Verb;
LEXICON Verb
atm Vschw;
wend Vschw;
frcht Vschw;
denk Vst;
fahr Vst;
sauf Vst;
24 / 32
Two Level Morphology II

LEXICON Vschw
+V+Praes+Sg3:%^t #;
+V+Praes+Sg2:%^st #;
LEXICON Vst
+V+Praes+Sg3:%^%"t #;
+V+Praes+Sg2:%^%"st #;
END
25 / 32
Two Level Morphology: Example I
! German Twolevel Morphology.
Alphabet
a a: b c d e f g h i j k l m n o p q r s t u v x y w z
Diacritics
%^ %";
Sets
Cons = b c d f g h k l m n p q r s t v x y w z;
Rules
"e epenthese" ! Fge ein e zwischen Stamm und Suffix

%^:e <=> d|t(m) _ Cons ;
26 / 32
Two Level Morphology: Example II
"umlaut-realisierung" ! Bring Umlaut in den Stamm

a: <=> _ (u) Cons* %^: %": ;
27 / 32
Tokenization
NLTK Tokenizer
Das Paket nltk.tokenizer enthlt eine Vielzahl
unterschiedlicher Tokenizer.
Wenn man kene spezielle Anforderungen hat, nutzt man
an Besten einfach word_tokenize:
import n l t k 1
2
sentence= " At e i g h t o c l o c k on Thursday morning . . . A r t h u r d i d n t 3
f e e l v e r y good . "
tokens = n l t k . word_tokenize ( sentence , language= e n g l i s h ) 4
p r i n t tokens 5
28 / 32
Sentence Splitter
Stze
Grammatische Konstruktionen beschrnken sich auf
Stze.
POS-Tagging geschieht satzweise.
Named Entities erstrecken sich nicht ber Satzgrenzen.
Satzende
Punkt, der nicht zu einer ABkrzung oder Zahl gehrt
Leerzeilen
Manchmal eine Abkrzung!
Manchmal eine Ordinalzahl!
29 / 32
Tokenization
NLTK Tokenizer
Das Paket nltk.tokenizer enthlt auch Sentence
Splitter
import n l t k 1
import codecs 2
3
t e x t f i l e = codecs . open ( " S y r i e n . t x t " , " r " , " u t f 8" ) 4
t e x t = t e x t f i l e . read ( ) 5
t e x t f i l e . close ( ) 6
7
sentences = n l t k . s e n t _ t o k e n i z e ( t e x t , language= german ) 8
t o k e n i z e d _ s e n t = n l t k . t o k e n i z e . word_tokenize ( sentences [ 2 3 ] , 9
language= german )
10
print tokenized_sent 11
30 / 32
Stemming
Porter Stemmer
NLTK entht einen Porter Stemmer
Alternativ knnen wir nutzen:
nltk.LancasterStemmer()
import n l t k 1
p o r t e r = n l t k . PorterStemmer ( ) 2
3
sentence= "We were one o f t h e f i r s t d i g i t a l m a r k e t i n g 4
agencies t o c a p i t a l i z e on i t . "
tokens = n l t k . word_tokenize ( sentence ) 5
stems = [ p o r t e r . stem ( t ) f o r t i n tokens ] 6
p r i n t stems 7
31 / 32
Lemmatisierung
Lemmatisierung mit einem Vollformlexikon

NLTK entht einen Lemmatisierer, der Wrter im
Wordnet-Wrterbuch nachschlgt
import n l t k 1
l e m m a t i z e r = n l t k . WordNetLemmatizer ( ) 2
3
sentence= "We were one o f t h e f i r s t d i g i t a l m a r k e t i n g 4
agencies t o c a p i t a l i z e on i t . "
tokens = n l t k . word_tokenize ( sentence ) 5
6
lemmata = [ l e m m a t i z e r . lemmatize ( t ) f o r t i n tokens ] 7
p r i n t lemmata 8
32 / 32

NLP 3 Morphology

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

NLP 3 Morphology

Hochgeladen von

Copyright:

Verfügbare Formate

Morphology Morphemes Language types Analysis

Natural Language Processing

Hochschule Hannover, Abteilung Information und Kommunikation

April 19, 2017

Base morpheme Meaningful kernel of a word.

Free/Bound and Lexical/Grammatical

Examples are, of course, language dependent!

Traditional main criteria for classification of languages

(Example from Comrie 1989)

Agglutinating languages: Turkish case system

Table: Deklination des Substantivs adam im Trkischen

Table: Deklination des Substantivs Stol (Stol, Tisch) im Russischen

Table: Konjugation des Imperfekts, Aoristus und Aoristus/Passiv des

Goals for automatic analysis

Lexicon free morphology

Two Level Morphology

Most advanced rule based approach

Two Level Morphology

Figure: Schematische Darstellung der Two Level Morphology

Two Level Morphology

Two Level Morphology I

!Example German Lexicon

Two Level Morphology II

Two Level Morphology: Example I

! German Twolevel Morphology.

"e epenthese" ! Fge ein e zwischen Stamm und Suffix

Two Level Morphology: Example II

"umlaut-realisierung" ! Bring Umlaut in den Stamm

Lemmatisierung mit einem Vollformlexikon

Das könnte Ihnen auch gefallen