LEXICON

Introduction to
Computational Linguisitics
The Lexicon
Introduction
• An inventory of words is an essential component
of programs for a wide variety of language
sensitive applications, such as:
– Spellchecking, stylechecking
– IR, IE, message understanding
– parsing, generation, MT
– TTS and STT
• Such an inventory usually called a dictionary or
lexicon.
Dictionaries
• The purpose of a dictionary is to provide a
wide range of information about words
• Some of this is linguistic information, e.g.
syntactic category, pronunciation,
distribution.
• But dictionaries also contain definitions of
word senses thus providing knowledge
about not just language but about the
world itself.
What is "dog"?
dog (ANIMAL) Show phonetics
noun [C]
a common four-legged animal, especially kept
by people as a pet or to hunt or guard things:
my pet dog
wild dogs
dog food
We could hear dogs barking in the distance.
(from Cambridge Advanced Learner's
Dictionary)
"Dictionary" versus "Lexicon"
• A dictionary is a collection of words
• A lexicon is a collection of lexemes.
• A lexeme roughly corresponds to a set of
words that are different forms of "the same
word".
• For example, English run, runs, ran and
running are forms of the same lexeme.
• A lexeme can also be regarded as a single
word sense of a word.
Senses of Dog
• dog was found in the Cambridge
Advanced Learner's Dictionary at the
entries listed below.
– dog (ANIMAL)
– dog (PERSON) different senses
– dog (FOLLOW) or lexemes for dog
– dog (PROBLEM)
Two Views of the Lexicon
give rise to different issues
• Lexicon as word database
– How to represent the word collection
– Access: given an arbitrary word, how to access the
relevant entries
– What information to provide and how to express it.
• Lexicon as database about word senses
– What are the relations between word senses?
– How do word senses hook up with concept
knowledge
Representing the Word Collection
• Some possible representations:
– Text file, 1 entry per line
– Finite state automaton.
– Other specialised data structure which allows
for common prefixes, e.g. letter tree
• Full form vs. lexeme + morphological
analysis
FSA for Sublexicon Fragment
o
t h e s
a e
i
s t
Letter Tree
ltree([ [b, [a, [r, [k, bark]]]],
[c, [a, [r, [r, [y, carry]]],
[t, cat,
[e, [g, [o, [r, [y,
category]]]]]]]],
[d, [e, [l, [a, [y, delay]]]]],
[h, [e, [l, [p, help]]],
[o, [p, hop,
[e, hope]]]],
[q, [u, [a, [r, [r, [y, quarry]]]],
[i, [z, quiz]],
[o, [t, [e, quote]]]]]
]).
Informal Definition of a Letter Tree
• Tree is a list of branches
• Each branch is a list
– whose first element is a letter
– whose remaining elements are either
• another branch, or
• a lexical entry for a word
– These elements are in a specific order.
Lexical entry (if any) comes first, and
branches are in alphabetical order by their
first letters.
Branch representing
cat, category and cook
[c,[a,[t,cat,
[e,[g,[o,[r,[y category]]]]]]]
[o,[o,[k,cook]]]]
Full Form Dictionary
• There is an entry for every possible word.
• No need for morphological processing
• Exceptions are handled automatically
• OK when number of entries is not too
large.
• Repeated information.
• Because languages have different
morphological properties, full form is better
for some languages than for others.
Morphological Analysis + Lexicon
LEXICON
Input Word
cat N
Morphological
cats Analysis
s PL
s 3SG
Morphological Analysis
• Very roughly, morphological analysis of a
word involves 2 subproblems:
• A segmentation problem: how to get from
the written text to the sequence of
morphemes that make it up.
• A morphotactic problem: how to combine
the individual morphemes together in a
legitimate way.
Segmentation/Morphotactic
Subproblems
• Segmentation problem:
– enlargement => en + large + ment
• Morphotactic problem: given what we
know about en, large and ment, how can
they be legitimately combined
– enlargement => (en + large) + ment
– enlargement =/> en + (large + ment)
– en + ADJ => V
– V + ment => N
2-Level Morphology
• In 1981 the four Ks (Kimmo Koskenniemi, Lauri
Karttunen, Ronald M. Kaplan and Martin Kay)
were working on morphological analysis (MA)
• Basic idea was that MA is about computing
relation between sets of strings at two levels:
– Surface Level (string of lexical words made from
surface alphabet)
– Lexical Level (string of morphemes made of lexical
alphabet).
• Relation can be computed using finite state
transducers.
• Reversibility of finite-state model
What Information to Provide
• Specific Information – eg "kicks"
• Syntactic Information
– POS = verb
– Tense = pres
– Number = singular
– Person = 3
– Type =Transitive
• Semantic Information
– event-type = Physical Action
– type-of subject = animate
– type-of object = physical
What Information to Provide
• General Information
• Class Attributes
– Agreement has (Number, Gender)
• Enumeration of possible values
– Gender = [masc, fem]
– Number = [sing, plur]
• Class Relationships
– Transitive isa Verb
– Common isa Noun
Two Views of the Lexicon
give rise to different issues
• Lexicon as word database
– How to represent the word collection
– Access: given an arbitrary word, how to access the
relevant entries
– What information to provide and how to express it.
• Lexicon as database about word senses
– What are the relations between word senses?
– How do word senses hook up with conceptual
knowledge
WordNet
• In 1985 a group of psychologists and linguists at
Princeton had the idea of searching dictionaries
conceptually rather than alphabetically.
• Attempt to organise a dictionary in terms of word
meanings rather than word forms.
• What is the nature and organisation of the
lexicalised concepts that words can express?
• Distinction between word forms, word meanings,
and entries.
Lexical Matrix
Word Word Forms
Meanings
F1 F2 .. Fn
M1 E1,1 E1,2 synonymy

M2 E2,1
entries
.. polysemy
Mm Em,n
WordNet
• A key aspect of WordNet is that a given
meaning or word sense is represented as
the set of words that can be used to
express it.
• These meanings are called synsets – sets
of words with synonymous readings.
• Synsets are established empirically
according to a principle of substitutability
that is relativised to context.
The Principle of Substitutability
• Two expressions are synonymous if the
substitution of one for another never alters
the truth value of a sentence in which the
substitution is made.
• Two expressions are synonymous in
linguistic context C if the substitution of
one for the other in C does not alter the
truth value.
• e.g. plank/board in carpentry contexts
Lexical Matrix
Word Word Forms
Meanings
board committee plank.. Fn
board E1,1 E1,2

committee
board E2,1 E2,3
entries
plank
..
Mm Em,n
WordNet
• In Wordnet, the synonymy relation
between words is fundamental.
• Synsets can be thought of as representing
concepts which stand in various semantic
relations to each other.
– X Antonym Y: meaning (synset) X is opposite
to meaning (synset) Y (big, small)
– X Hyponym Y: like isa (e.g. dog, mammal)
– X Meronym Y: X is a part of Y (e.g. leg, man)
Lexicon as a Concept Graph
• We can thus imagine the WordNet Lexicon
as a gigantic graph whose nodes are
synsets and whose arcs are semantic
relations between synsets.
• Such a structure can be regarded as a
semantic map of the concepts used in a
given language.
• Many applications can be created using
the WordNet graph as a resource
Using WordNet to Measure Semantic Orientations of Adjectives
Jaap Kamps, Maarten Marx, Robert J. Mokken, Maarten de Rijke
Conclusion
• Lexicon is a central building block of language-
sensitive systems
• Schizophrenic status of lexical information:
linguistic versus world knowledge.
• As a wordlist, lexicon has to solve problem of
representation and access. Morphological
analysis can help to keep number of entries to a
manageable level.
• As a collection of definitions, lexicon has to deal
with relationships between word meanings.

LEXICON

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

LEXICON

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction to

M1 E1,1 E1,2 synonymy

board E1,1 E1,2

Das könnte Ihnen auch gefallen