Beruflich Dokumente
Kultur Dokumente
Philipp Koehn
27 April 2017
● Language as data
● Language models
● Part of speech
● Morphology
● Semantics
VP
SYNTAX
NP NP
VP
SYNTAX
NP NP
VP
SYNTAX
NP NP
● Question
When was Barack Obama born?
● This is easy.
– just phrase a Google query properly:
"Barack Obama was born on *"
– syntactic rules that convert questions into statements are straight-forward
● Question
What kind of plants grow in Maryland?
● What is hard?
– words may have different meanings
– we need to be able to disambiguate between them
● Question
Does the police use dogs to sniff for drugs?
● What is hard?
– words may have the same meaning (synonyms)
– we need to be able to match them
● Question
What is the name of George Bush’s poodle?
● What is hard?
– we need to know that poodle and terrier are related, so we can give a proper
response
– words need to be group together into semantically related classes
● Question
Which animals love to swim?
● What is hard?
– some words belong to groups which are referred to by other words
– we need to have database of such A is-a B relationships, so-called ontologies
● Question
Did Poland reduce its carbon emissions since 1989?
● What is hard?
– we need more complex semantic database
– we need to do inference
language as data
But also:
f ×r =k
f = frequency of a word
r = rank of a word (if sorted by frequency)
k = a constant
language models
● Sparse data: Many good English sentences will not have been seen before
p(w1, w2, w3, ..., wn) = p(w1) p(w2∣w1) p(w3∣w1, w2)...p(wn∣w1, w2, ...wn−1)
● Markov assumption:
– only previous history matters
– limited memory: only last k words are included in history
(older words less relevant)
→ kth order Markov model
count(w1, w2)
p(w2∣w1) =
count(w1)
the green (total: 1748) the red (total: 225) the blue (total: 54)
word c. prob. word c. prob. word c. prob.
paper 801 0.458 cross 123 0.547 box 16 0.296
group 640 0.367 tape 31 0.138 . 6 0.111
light 110 0.063 army 9 0.040 flag 6 0.111
party 27 0.015 card 7 0.031 , 3 0.056
ecu 21 0.012 , 5 0.022 angel 3 0.056
1
H(W ) = log p(W1n)
n
● Or, perplexity
perplexity(W ) = 2H(W )
● Smoothing
– adjust counts for seen n-grams
– use probability mass for unseen n-grams
– many discount schemes developed
● Backoff
– if 5-gram unseen → use 4-gram instead
parts of speech
● Most of the time, the local context disambiguated the part of speech
● Task: Given a text of English, identify the parts of speech of each word
● Example
– Input: Word sequence
Time flies like an arrow
– Output: Tag sequence
Time/NN flies/VB like/P an/DET arrow/NN
● Local context
– two determiners rarely follow each other
– two base form verbs rarely follow each other
– determiner is almost always followed by adjective or noun
p(S∣T ) = ∏ p(wi∣ti)
i
● p(T ) could be called a part-of-speech language model, for which we can use an
n-gram model (bigram):
● We can estimate p(S∣T ) and p(T ) with maximum likelihood estimation (and
maybe some smoothing)
START VB
NN IN
DET
END
like
flies
VB
● Problem: if we have on average c choices for each of the n words, there are cn
possible tag sequences, maybe too many to efficiently evaluate
VB
NN
START
DET
IN
time
VB VB
NN NN
START
DET DET
IN IN
time flies
VB VB VB VB
NN NN NN NN
START
DET DET DET DET
IN IN IN IN
● Intuition: Since state transition out of a state only depend on the current state
(and not previous states), we can record for each state the optimal path
● We record:
– cheapest cost to state j at step s in δj (s)
– backtrace from that state to best predecessor ψj (s)
morphology
● Plural of nouns
cat+s
small+er
● Formation of adverbs
great+ly
● Verb tenses
walk+ed
● Adjectives
un+friendly
dis+interested
● Verbs
re+consider
abso+bloody+lutely
unbe+bloody+lievable
● Why not:
ab+bloody+solutely
● No example in English
ge+sag+t (German)
● A failure of morphology:
morphology reduces the need to create completely new words
● Alternatives
– Some languages have no verb tenses
→ use explicit time references (yesterday)
– Cased noun phrases often play the same role as prepositional phrases
laugh +s
walk +ed
S 1 E
report +ing
Multiple stems
● implements regular verb morphology
→ laughs, laughed, laughing
walks, walked, walking
reports, reported, reporting
d
e
k s
i n g
l
l s
d
e
w a n t s
i n g
r
d
e
n s
i n g
syntax
● The adjective interesting gives more information about the noun lecture
● The noun lecture is the object of the verb like, specifying what is being liked
● The pronoun I is the subject of the verb like, specifying who is doing the liking
( like/VB
(((( aaa
((((((( aa
((
I/PRO lecture/NN
PP PP
PP
the/DET interesting/JJ
● Internal nodes combine leaf nodes into phrases, such as noun phrases (NP)
((((e
S
((((((( ee
((((
NP (((Z VP
((((((( Z
((( Z
PRO VP NP
XXX
XXX
XXX
I
VB DET JJ NN
● Task: parsing
– given: an input sentence with part-of-speech tags
– wanted: the right syntax tree for it
● Moving up the hierarchy, languages are more expressive and parsing becomes
computationally more expensive
S
((( S (((((((
(((
A
((((((( ((( A
((((((( ee (((( A
NP (((Z VP NP ((
!PP VP
(((((( (((!
(!
(! PP
(((((( ZZ
(((((
! PP
((( ! PP
PRO VP NP
( !
((((
((((
((XXX
XXX PRO VP NP PP
!
I
ee !! @
VB NP PP !! @
e !! @
!!ZZ
I
HHH
!!
see Z
VB DET NN IN NP
DET NN IN NP
Z """e
e
""
see the woman with
ZZ
the woman with
e
DET NN DET NN
the telescope
the telescope
((((H S ((((H S
(((((( HH (((((( HH
(((((( ((((((
NP ((((((
((b VP NP ((((((
((b VP
(((((( bb (((((( bb
NNP VP NP
``` NNP VP ((PP NP
```` ((((((((( PP
` ( P
Mary Mary
VB NP
P PP VB NP CC NP
PPP bb b bb
P b
likes likes
NP CC NP IN NP NNP and NP PP
bb
b
from Jim
NNP and NNP NNP NNP IN NP
Jim John Hoboken John from NNP
Hoboken
PRO
PRO VB DET JJ NN
NP
PRO VB DET JJ NN
NP
PRO VB DET JJ NN
NP VP
PRO VB DET JJ NN
NP VP
PRO VB DET JJ NN
S
NP VP
PRO VB DET JJ NN
S
NP VP NP
PRO VB DET JJ NN
VP
S
NP VP NP
PRO VB DET JJ NN
S
VP
S
NP VP NP
PRO VB DET JJ NN
p(tree) = ∏ p(rulei)
i
semantics
● Example: bank
– financial institution: I put my money in the bank.
– river shore: He rested at the bank of the river.
● More features
– any content words in a 50 word window (animal, equipment, employee, ...)
– syntactically related words, syntactic role in sense
– topic of the text
– part-of-speech tag, surrounding part-of-speech tags
● Specific verbs typically require arguments with specific thematic roles and allow
adjuncts with specific thematic roles.
questions?