Sie sind auf Seite 1von 32

Morphology Morphemes Language types Analysis

Natural Language Processing


Morphology

Christian Wartena

Hochschule Hannover, Abteilung Information und Kommunikation

April 19, 2017

1 / 32
Morphology Morphemes Language types Analysis

Morphology

Morpheme
Phonemes change meaning, but do not have meaning
Morphemes: smallest units that bear meaning
In many languages morphemes can be combined to words.
Morphology deals with the rules of building words from
morphemes.

2 / 32
Morphology Morphemes Language types Analysis

Words

What is a word
Usually clear.
Many border line cases
In some exotic languages
In German as well: some changes in spelling reform of
1996.
E.g. Gewinn bringend gewinnbringend (profitable)
Compounds:
In English usually written as two words (but e.g. football)
In German always as one word
German compounds contain additional elements that make
separation difficult:
Betrachtung + s + weise (Way of looking at things)
Practical problems: 5-fold, spinn-off, dont, etc.

3 / 32
Morphology Morphemes Language types Analysis

Types of Morphology

Inflection
Systematic building of grammatical variants.
No change of meaning /clearly defined semantics (e.g. plural)
Productive
Naming convention:
Conjugation (used for verbs)
Declension (used for all other words)

Derivation
Derivation of new words
Often change of part of speech
Meaning often unpredictable
Not always productive
Often lexicalized
4 / 32
Morphology Morphemes Language types Analysis

Morpheme types

Base morpheme Meaningful kernel of a word.


Free morpheme Morpheme that can serve as a word.
Bound morpheme Morpheme that has to be bound to another
word.
Lexical morpheme Morpheme with a own, independent
meaning.
Own meaning, or
Changes meaning of another word, without
changing grammatical properties
Grammatical morpheme Morpheme with a grammatical
function and no autonomous meaning
Grammatical function
Changes part-of-speech of a another word
Adds a feature (like plural, past, etc.)
5 / 32
Morphology Morphemes Language types Analysis

Morpheme types

Free/Bound and Lexical/Grammatical


lexical grammatical
free Words that dont need in- preposition, conjunction,
flection pronoun, article
bound Words that need inflection; Affixes
some affixes, confixes

Examples are, of course, language dependent!

6 / 32
Morphology Morphemes Language types Analysis

Affix types

Position
Prefix
Suffix
Infix
Circumfix

Function
Derivation affix
Infection affix

7 / 32
Morphology Morphemes Language types Analysis

Productivity

Productive
New words can be formed
verb+bar, besprechbar, denkbar, programmierbar,
fotografierbar, zusammenheftbar
adjective+ly, usually, frequently
verb+able, imaginable, programmable

Non-productive
Maybe productive in the past.
Process clear, but cannot be applied to new words.
Ziegelei, Tischlerei, Molkerei, Metzgerei, Kantorei,
Bcherei, *Computerei, *Elektrikerei, *Plakaterei
Prfling, Lehrling, Sugling, Flchtling, *Telefonierling,
*Helfling, *Entwerfling
government, settlement, placement, development,
management, employment, treatment, *teachment,
*drivement, *flyment 8 / 32
Morphology Morphemes Language types Analysis

Language Typology

Traditional main criteria for classification of languages


Basic word order (Subject-Verb-Object,
Subject-Object-Verb, Noun-Adjective vs. Adjective-Noun
etc.)
Morphology type

Morphology types
Fusional (or inflecting)
Isolating (or analytical)
Agglutinating

9 / 32
Morphology Morphemes Language types Analysis

Language Typology

Isolating languages
Each word consists of exactly one morpheme!
No inflection, no derivation!
Many languages from eastern Asia (Mandarin,
Vietnamese)

(1)
Khi ti dn nh ban ti, chng ti bt du lm bi
As I come house friend I , PLURAL I start do teaching
As I came to the house of my friend, we started the lesson.

(Example from Comrie 1989)

10 / 32
Morphology Morphemes Language types Analysis

Language Typology

Agglutinating languages
Each affix has one function
Affixes can be concatenated
Examples: Turkish (and all central Asian relatives),
Finnish, Hungarian, Japanese, Korean, Georgian,
Mongolic, Yupik, . . .

11 / 32
Morphology Morphemes Language types Analysis

Language Typology

Agglutinating languages: Turkish case system

Table: Deklination des Substantivs adam im Trkischen


Singular Plural
Nominativ adam adam-lar
Akkusativ adam- adam-lar-
Genitiv adam-n adam-lar-n
Dativ adam-a adam-lar-a
Lokativ adam-da adam-lar-da
Ablativ adam-dan adam-lar-dan

12 / 32
Morphology Morphemes Language types Analysis

Language Typology

Fusional languages
One affix has several functions
Most Indo-European Languages, Semitic Languages, most
African languages

Table: Deklination des Substantivs Stol (Stol, Tisch) im Russischen


Singular Plural
Nominativ stol- stol-y
Akkusativ stol- stol-y
Genitiv stol-a stol-ov
Dativ stol-u stol-om
Instrumental stol-om stol-ami
Prpositional stol-e stol-ax
13 / 32
Morphology Morphemes Language types Analysis

Language Typology

Mixed languages
Most lnguages are somewhere in between!
Japanes has many fusional (irregular) suffixes.
Fusional languages can be quite regular:

Table: Konjugation des Imperfekts, Aoristus und Aoristus/Passiv des


Verbs (luein, lsen) im Altgriechischen
Imperfekt Aoristus Aoristus/Passiv
1. Pers. Sing. e-lu-o-n e-lu-sa e-lu-th-n
2. Pers. Sing. e-lu-e-s e-lu-sa-s e-lu-th-s
3. Pers. Sing. e-lu-e e-lu-se e-lu-th
1. Pers. Plur. e-lu-o-men e-lu-sa-men e-lu-th-men
2. Pers. Plur. e-lu-e-te e-lu-sa-te e-lu-th-te
3. Pers. Plur. e-lu-o-n e-lu-sa-n e-lu-th-san
14 / 32
Morphology Morphemes Language types Analysis

Other phenomena

Polysynthesis
Sib. Yupik
Angya-ghlla-ng-yug-tuq
Boot-Augmentative-Aquire-Desiderative-3Sing
He wants to buy a big boat

Incorporation
Chukchi
Te-meyne-levte-pegt-erken
1SingSubj-large-head-pain-ImperfAspect
I had a large headache

15 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK

Morphological Analysis

Traditional Analysis
Morphology is binary and hierarchical
First we remove inflectional morphemes
Subsequently derivational morphemes

Goals for automatic analysis


Get the main word, that we can find in a dictionary
Get grammatical features

Challenges
Many very small classes of words
Ten thousands words in each language
Phonological processes at morpheme boundaries
16 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK

Approaches

Practical systems
Lexicon based (full-form lexicon).
Popular for English
Have all forms of each word in the dictionary
Heuristics / Lexicon free approaches.
Rule based.

17 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK

Some Terminology

Stem
Stem is what remains when all derivational affixes are
removed
A stemmer should find the stem of a word
Lemma
Canonical form of a word.
Form that is found in a dictionary
Algorithm that finds the lemma for a word form is called a
Lemmatizer
Root
Root is what remains when all affixes are removed
Completely different words have the same root!
Many stemmers also find roots!
18 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK

Common errors

Overstemming
Two many letters are removed

Understemming
Not all affixes are identified and removed

19 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK

Lexicon free morphology

Porter Stemmer
Very popular stemmer by Martin Porter
List of affixes and simple rules (e.g. -ational -ate:
relational relate)
Easy to implement, small, fast
Many errors: e.g. -ing is not always a gerund (thing, wing)
policy police, doing doe, organization organ.
No clear ideas about derivational and inflectional
morphology.
Nevertheless useful. E.g. for IR: no problems as long as
unique stems for each word are found

20 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK

Two Level Morphology

Most advanced rule based approach


Koskenniemi 1983. Technical Report, University of Helsinki
Two representations:
Lexical level: Stem+Features
Mid-Level: Stem+ abstract Affixes (Strictly not necessary)
Surface: Word
A finite state transducer (FST) translates between the
levels
FST is compiled from rules

21 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK

Two Level Morphology

Lexical f o x +N +pl

Intermediate f o x s #

Surface f o x e s

Figure: Schematische Darstellung der Two Level Morphology

22 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK

Two Level Morphology

Compilers
XFST Xerox Finite State Tool
HFST Helsinki Finite-State Transducer Technology
SFST Stuttgart Finite-State Transducer Tools
OpenFST Google/Apache

Components
Lexicon: Right linear CFG describes lexical and
intermediate level!
Rules: Restrictions on realization of surface level
Compiler computes intersection of all restrictions (Note:
intersection of two FSAs is a FSA!)
No problem of rule ordering!

23 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK

Two Level Morphology I

!Example German Lexicon

Multichar_Symbols
+V +Praes +Sg3 +Sg2

LEXICON Root
Verb;

LEXICON Verb
atm Vschw;
wend Vschw;
frcht Vschw;
denk Vst;
fahr Vst;
sauf Vst;

24 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK

Two Level Morphology II


LEXICON Vschw
+V+Praes+Sg3:%^t #;
+V+Praes+Sg2:%^st #;

LEXICON Vst
+V+Praes+Sg3:%^%"t #;
+V+Praes+Sg2:%^%"st #;

END

25 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK

Two Level Morphology: Example I

! German Twolevel Morphology.

Alphabet
a a: b c d e f g h i j k l m n o p q r s t u v x y w z

Diacritics
%^ %";

Sets

Cons = b c d f g h k l m n p q r s t v x y w z;

Rules

"e epenthese" ! Fge ein e zwischen Stamm und Suffix


%^:e <=> d|t(m) _ Cons ;

26 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK

Two Level Morphology: Example II

"umlaut-realisierung" ! Bring Umlaut in den Stamm


a: <=> _ (u) Cons* %^: %": ;

27 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK

Tokenization

NLTK Tokenizer
Das Paket nltk.tokenizer enthlt eine Vielzahl
unterschiedlicher Tokenizer.
Wenn man kene spezielle Anforderungen hat, nutzt man
an Besten einfach word_tokenize:

import n l t k 1
2
sentence= " At e i g h t o c l o c k on Thursday morning . . . A r t h u r d i d n t 3
f e e l v e r y good . "
tokens = n l t k . word_tokenize ( sentence , language= e n g l i s h ) 4
p r i n t tokens 5

28 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK

Sentence Splitter

Stze
Grammatische Konstruktionen beschrnken sich auf
Stze.
POS-Tagging geschieht satzweise.
Named Entities erstrecken sich nicht ber Satzgrenzen.

Satzende
Punkt, der nicht zu einer ABkrzung oder Zahl gehrt
Leerzeilen
Manchmal eine Abkrzung!
Manchmal eine Ordinalzahl!

29 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK

Tokenization

NLTK Tokenizer
Das Paket nltk.tokenizer enthlt auch Sentence
Splitter

import n l t k 1
import codecs 2
3
t e x t f i l e = codecs . open ( " S y r i e n . t x t " , " r " , " u t f 8" ) 4
t e x t = t e x t f i l e . read ( ) 5
t e x t f i l e . close ( ) 6
7
sentences = n l t k . s e n t _ t o k e n i z e ( t e x t , language= german ) 8
t o k e n i z e d _ s e n t = n l t k . t o k e n i z e . word_tokenize ( sentences [ 2 3 ] , 9
language= german )
10
print tokenized_sent 11

30 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK

Stemming

Porter Stemmer
NLTK entht einen Porter Stemmer
Alternativ knnen wir nutzen:
nltk.LancasterStemmer()

import n l t k 1
p o r t e r = n l t k . PorterStemmer ( ) 2
3
sentence= "We were one o f t h e f i r s t d i g i t a l m a r k e t i n g 4
agencies t o c a p i t a l i z e on i t . "
tokens = n l t k . word_tokenize ( sentence ) 5
stems = [ p o r t e r . stem ( t ) f o r t i n tokens ] 6
p r i n t stems 7

31 / 32
Morphology Morphemes Language types Analysis Approaches Two Level Morphology Morphology in NLTK

Lemmatisierung

Lemmatisierung mit einem Vollformlexikon


NLTK entht einen Lemmatisierer, der Wrter im
Wordnet-Wrterbuch nachschlgt

import n l t k 1
l e m m a t i z e r = n l t k . WordNetLemmatizer ( ) 2
3
sentence= "We were one o f t h e f i r s t d i g i t a l m a r k e t i n g 4
agencies t o c a p i t a l i z e on i t . "
tokens = n l t k . word_tokenize ( sentence ) 5
6
lemmata = [ l e m m a t i z e r . lemmatize ( t ) f o r t i n tokens ] 7
p r i n t lemmata 8

32 / 32

Das könnte Ihnen auch gefallen