Sie sind auf Seite 1von 29

Introduction to

Computational Linguisitics
The Lexicon
Introduction
• An inventory of words is an essential component
of programs for a wide variety of language
sensitive applications, such as:
– Spellchecking, stylechecking
– IR, IE, message understanding
– parsing, generation, MT
– TTS and STT
• Such an inventory usually called a dictionary or
lexicon.
Dictionaries
• The purpose of a dictionary is to provide a
wide range of information about words
• Some of this is linguistic information, e.g.
syntactic category, pronunciation,
distribution.
• But dictionaries also contain definitions of
word senses thus providing knowledge
about not just language but about the
world itself.
What is "dog"?
dog (ANIMAL) Show phonetics
noun [C]
a common four-legged animal, especially kept
by people as a pet or to hunt or guard things:
my pet dog
wild dogs
dog food
We could hear dogs barking in the distance.
(from Cambridge Advanced Learner's
Dictionary)
"Dictionary" versus "Lexicon"
• A dictionary is a collection of words
• A lexicon is a collection of lexemes.
• A lexeme roughly corresponds to a set of
words that are different forms of "the same
word".
• For example, English run, runs, ran and
running are forms of the same lexeme.
• A lexeme can also be regarded as a single
word sense of a word.
Senses of Dog
• dog was found in the Cambridge
Advanced Learner's Dictionary at the
entries listed below.
– dog (ANIMAL)
– dog (PERSON) different senses
– dog (FOLLOW) or lexemes for dog
– dog (PROBLEM)
Two Views of the Lexicon
give rise to different issues
• Lexicon as word database
– How to represent the word collection
– Access: given an arbitrary word, how to access the
relevant entries
– What information to provide and how to express it.
• Lexicon as database about word senses
– What are the relations between word senses?
– How do word senses hook up with concept
knowledge
Representing the Word Collection
• Some possible representations:
– Text file, 1 entry per line
– Finite state automaton.
– Other specialised data structure which allows
for common prefixes, e.g. letter tree
• Full form vs. lexeme + morphological
analysis
FSA for Sublexicon Fragment
o

t h e s

a e
i

s t
Letter Tree
ltree([ [b, [a, [r, [k, bark]]]],
[c, [a, [r, [r, [y, carry]]],
[t, cat,
[e, [g, [o, [r, [y,
category]]]]]]]],
[d, [e, [l, [a, [y, delay]]]]],
[h, [e, [l, [p, help]]],
[o, [p, hop,
[e, hope]]]],
[q, [u, [a, [r, [r, [y, quarry]]]],
[i, [z, quiz]],
[o, [t, [e, quote]]]]]
]).
Informal Definition of a Letter Tree
• Tree is a list of branches
• Each branch is a list
– whose first element is a letter
– whose remaining elements are either
• another branch, or
• a lexical entry for a word
– These elements are in a specific order.
Lexical entry (if any) comes first, and
branches are in alphabetical order by their
first letters.
Branch representing
cat, category and cook

[c,[a,[t,cat,
[e,[g,[o,[r,[y category]]]]]]]
[o,[o,[k,cook]]]]
Full Form Dictionary
• There is an entry for every possible word.
• No need for morphological processing
• Exceptions are handled automatically
• OK when number of entries is not too
large.
• Repeated information.
• Because languages have different
morphological properties, full form is better
for some languages than for others.
Morphological Analysis + Lexicon
LEXICON
Input Word

cat N
Morphological
cats Analysis
s PL
s 3SG
Morphological Analysis
• Very roughly, morphological analysis of a
word involves 2 subproblems:
• A segmentation problem: how to get from
the written text to the sequence of
morphemes that make it up.
• A morphotactic problem: how to combine
the individual morphemes together in a
legitimate way.
Segmentation/Morphotactic
Subproblems
• Segmentation problem:
– enlargement => en + large + ment
• Morphotactic problem: given what we
know about en, large and ment, how can
they be legitimately combined
– enlargement => (en + large) + ment
– enlargement =/> en + (large + ment)
– en + ADJ => V
– V + ment => N
2-Level Morphology
• In 1981 the four Ks (Kimmo Koskenniemi, Lauri
Karttunen, Ronald M. Kaplan and Martin Kay)
were working on morphological analysis (MA)
• Basic idea was that MA is about computing
relation between sets of strings at two levels:
– Surface Level (string of lexical words made from
surface alphabet)
– Lexical Level (string of morphemes made of lexical
alphabet).
• Relation can be computed using finite state
transducers.
• Reversibility of finite-state model
What Information to Provide
• Specific Information – eg "kicks"
• Syntactic Information
– POS = verb
– Tense = pres
– Number = singular
– Person = 3
– Type =Transitive
• Semantic Information
– event-type = Physical Action
– type-of subject = animate
– type-of object = physical
What Information to Provide
• General Information
• Class Attributes
– Agreement has (Number, Gender)
• Enumeration of possible values
– Gender = [masc, fem]
– Number = [sing, plur]
• Class Relationships
– Transitive isa Verb
– Common isa Noun
Two Views of the Lexicon
give rise to different issues
• Lexicon as word database
– How to represent the word collection
– Access: given an arbitrary word, how to access the
relevant entries
– What information to provide and how to express it.
• Lexicon as database about word senses
– What are the relations between word senses?
– How do word senses hook up with conceptual
knowledge
WordNet
• In 1985 a group of psychologists and linguists at
Princeton had the idea of searching dictionaries
conceptually rather than alphabetically.
• Attempt to organise a dictionary in terms of word
meanings rather than word forms.
• What is the nature and organisation of the
lexicalised concepts that words can express?
• Distinction between word forms, word meanings,
and entries.
Lexical Matrix
Word Word Forms
Meanings
F1 F2 .. Fn

M1 E1,1 E1,2 synonymy


M2 E2,1
entries
.. polysemy
Mm Em,n
WordNet
• A key aspect of WordNet is that a given
meaning or word sense is represented as
the set of words that can be used to
express it.
• These meanings are called synsets – sets
of words with synonymous readings.
• Synsets are established empirically
according to a principle of substitutability
that is relativised to context.
The Principle of Substitutability
• Two expressions are synonymous if the
substitution of one for another never alters
the truth value of a sentence in which the
substitution is made.
• Two expressions are synonymous in
linguistic context C if the substitution of
one for the other in C does not alter the
truth value.
• e.g. plank/board in carpentry contexts
Lexical Matrix
Word Word Forms
Meanings
board committee plank.. Fn

board E1,1 E1,2


committee
board E2,1 E2,3
entries
plank
..

Mm Em,n
WordNet
• In Wordnet, the synonymy relation
between words is fundamental.
• Synsets can be thought of as representing
concepts which stand in various semantic
relations to each other.
– X Antonym Y: meaning (synset) X is opposite
to meaning (synset) Y (big, small)
– X Hyponym Y: like isa (e.g. dog, mammal)
– X Meronym Y: X is a part of Y (e.g. leg, man)
Lexicon as a Concept Graph
• We can thus imagine the WordNet Lexicon
as a gigantic graph whose nodes are
synsets and whose arcs are semantic
relations between synsets.
• Such a structure can be regarded as a
semantic map of the concepts used in a
given language.
• Many applications can be created using
the WordNet graph as a resource
Using WordNet to Measure Semantic Orientations of Adjectives
Jaap Kamps, Maarten Marx, Robert J. Mokken, Maarten de Rijke
Conclusion
• Lexicon is a central building block of language-
sensitive systems
• Schizophrenic status of lexical information:
linguistic versus world knowledge.
• As a wordlist, lexicon has to solve problem of
representation and access. Morphological
analysis can help to keep number of entries to a
manageable level.
• As a collection of definitions, lexicon has to deal
with relationships between word meanings.

Das könnte Ihnen auch gefallen