Beruflich Dokumente
Kultur Dokumente
in Neural Networks
Chris Dyer
What is linguistics?
• How do human languages represent meaning?
X X
X X X X
The
Thebook
talk X X X The
Thebook
talk X toanybody
appealed to
appealed anybody
did
didnot
not
thatI gave
I bought appeal toanybody
appeal to anybody that I did
I did notnotgive
buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.
(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not
in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence
is not well-formed.
Examples adapted from Everaert et al. (TICS 2015)
Language is hierarchical
(A) (B)
X X
X X X X
The
Thebook
talk X X X The
Thebook
talk X toanybody
appealed to
appealed anybody
did
didnot
not
thatI gave
I bought appeal toanybody
appeal to anybody that I did
I did notnotgive
buy
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not
in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence
is not well-formed.
Examples adapted from Everaert et al. (TICS 2015)
Language is hierarchical
(A) (B)
X X
X X X X
The
Thebook
talk X X X The
Thebook
talk X toanybody
appealed to
appealed anybody
did
didnot
not
thatI gave
I bought appeal toanybody
appeal to anybody that I did
I did notnotgive
buy
containingisanybody.
not empirically
(This structural controversial
configuration is called c(onstituent)-command in the
many[31].)
linguistics -literature different
When thetheories
relationship of the details
between of structure
not and anybody adheres to this
structural configuration, the kids
- hypothesis: sentence is well-formed.
learn language easily because they
In sentence don’t
(3), by consider many “obvious”
contrast, not sequentially structurally
precedes anybody, but the triangleinsensitive
dominating not
in Figure 1Bhypotheses
fails to also dominate the structure containing anybody. Consequently, the sentence
is not well-formed.
Examples adapted from Everaert et al. (TICS 2015)
Recurrent neural networks
A good model of language?
• Recurrent neural networks are incredibly powerful models
of sequences (e.g., of words)
x1 x2 x3 x4 x5 x6
x1
this x2
film x
is3 x4
hardly xa5 x6
treat
Representing a sentence
Recurrent neural network
h1 h2 h3 h4 h5 h6 c = h6
x1 x2 x3 x4 x5 x6
x1
this x2
film xis3 hardly
x4 xa5 treat
x6
How do languages express
meaning?
• Principle of compositionality: the meaning of a complex
expression is determined by the meanings of its constituent
expressions and the rules that combine them.
VP
NP
NP NP
DT NN VBZ RB DT NN
x1 film
this x2 xis3 hardly
x4 xa5 treat
x6
Syntax as Trees
S
VP
NP
NP NP
DT NN VBZ RB DT NN
x1 film
this x2 xis3 hardly
x4 xa5 treat
x6
Syntax as Trees
S
VP
NP
NP NP
DT NN VBZ RB DT NN
x1 film
this x2 xis3 hardly
x4 xa5 treat
x6
Representing a sentence
Recursive Neural Network
c
x1 x2 x3 x4 x5 x6
x1 film
this x2 x x4
is3 hardly xa5 treat
x6
h = tanh(W[`; r] + b) h = tanh(W[`; r] + b)
x1 x2 x3 x4 x5 x6
x1 film
this x2 x x4
is3 hardly xa5 treat
x6
h = tanh(W[`; r] + b)
h = tanh(W[`; r] + b) h = tanh(W[`; r] + b)
x1 x2 x3 x4 x5 x6
x1 film
this x2 x x4
is3 hardly xa5 treat
x6
h = tanh(W[`; r] + b)
h = tanh(W[`; r] + b)
h = tanh(W[`; r] + b) h = tanh(W[`; r] + b)
x1 x2 x3 x4 x5 x6
x1 film
this x2 x x4
is3 hardly xa5 treat
x6
h = tanh(W[`; r] + b)
h = tanh(W[`; r] + b)
h = tanh(W[`; r] + b) h = tanh(W[`; r] + b)
x1 x2 x3 x4 x5 x6
x1 film
this x2 x x4
is3 hardly xa5 treat
x6
VP
NP
NP
NP
x1 x2 x3 x4 x5 x6
x1 film
this x2 x x4
is3 hardly xa5 treat
x6
“Syntactic untying”
Representing a sentence
Recursive Neural Network
S
c
VP
NP NP
h = tanh(W [`; r] + b)
NP
NP
NP
h = tanh(W [`; r] + b)
x1 x2 x3 x4 x5 x6
x1 film
this x2 x x4
is3 hardly xa5 treat
x6
“Syntactic untying”
Representing a sentence
Recursive Neural Network
S
c
VP
NP NP
h = tanh(W [`; r] + b) NP
h = tanh(W [`; r] + b)
NP
NP
NP
h = tanh(W [`; r] + b)
x1 x2 x3 x4 x5 x6
x1 film
this x2 x x4
is3 hardly xa5 treat
x6
“Syntactic untying”
Representing a sentence
Recursive Neural Network
S
c
VP
h = tanh(WVP [`; r] + b)
NP NP
h = tanh(W [`; r] + b) NP
h = tanh(W [`; r] + b)
NP
NP
NP
h = tanh(W [`; r] + b)
x1 x2 x3 x4 x5 x6
x1 film
this x2 x x4
is3 hardly xa5 treat
x6
“Syntactic untying”
Representing a sentence
Recursive Neural Network
S
c S
h = tanh(W [`; r] + b)
VP
h = tanh(WVP [`; r] + b)
NP NP
h = tanh(W [`; r] + b) NP
h = tanh(W [`; r] + b)
NP
NP
NP
h = tanh(W [`; r] + b)
x1 x2 x3 x4 x5 x6
x1 film
this x2 x x4
is3 hardly xa5 treat
x6
“Syntactic untying”
Representing Sentences
• Bag of words/n-grams
Bigram Naive
83.1 19.0 27.3
Bayes
RecNN
85.4 71.4 81.8
(RNTN form)
` r Outer product
Some Results
all “not good” “not terrible”
Bigram Naive
83.1 19.0 27.3
Bayes
RecNN
85.4 71.4 81.8
(RNTN form)
` r Outer product
Some Predictions
• n-ary children
• Disadvantages
(S
stack action probability
NT(S) p(nt(S) | top)
(S (NP
stack action probability
NT(S) p(nt(S) | top)
(S (NP The
stack action probability
NT(S) p(nt(S) | top)
NP
NP The
What head type?
Syntactic composition
NP The hungry
What head type?
Syntactic composition
( NP The v cat ) NP
Modeling the next action
h1 h2 h3 h4
y0
PUSH
;
Stack RNNs
Operation
y0 y1
POP
; x1
Stack RNNs
Operation
y0 y1
; x1
Stack RNNs
Operation
y0 y1
PUSH
; x1
Stack RNNs
Operation
y0 y1 y2
POP
; x1 x2
Stack RNNs
Operation
y0 y1 y2
; x1 x2
Stack RNNs
Operation
y0 y1 y2
PUSH
; x1 x2
Stack RNNs
Operation
y0 y1 y2 y3
; x1 x2 x3
RNNGs
Inductive bias?
• What inductive biases do RNNGs exhibit?
• Then we can say that they have a bias for syntactic recency rather
than sequential recency
“talk”
z }| {
(S (NP The talk (SBAR I did not give)) (VP appealed (PP to …
Parameter estimation
• Generative
• Trained using gold standard trees (here: from a tree bank) to minimize
cross-entropy
• Discriminative
To parse (find a tree for x): we need to compute
Given ⇤a sentence x, predict the sequence to of actions y necessary to
•
y = arg max p(y | x)
build its parse tree
y2Yx
• p(x, y)
Instead of GEN, use SHIFT
= arg max def. conditional prob.
y2Yx p(x)
• We call this conditional distribution q(y | x)
= arg max p(x, y) denominator is constant
y2Yx
Parameter estimation
• Generative
• Trained using gold standard trees (here: from a tree bank) to minimize
cross-entropy
• Discriminative
Type F1
Shindo et al (2012)
Gen 91.1
Single model
Vinyals et al (2015)
Disc 90.5
PTB only
Shindo et al (2012)
Gen 92.4
Ensemble
Vinyals et al (2015)
Disc+SemiS
92.8
Semisupervised up
Discriminative
Disc 91.7
PTB only
Generative
Gen 93.6
PTB only
Choe and Charniak (2016)
Gen
93.8
Semisupervised +SemiSup
English PTB (Parsing)
Type F1
Shindo et al (2012)
Gen 91.1
Single model
Vinyals et al (2015)
Disc 90.5
PTB only
Shindo et al (2012)
Gen 92.4
Ensemble
Vinyals et al (2015)
Disc+SemiS
92.8
Semisupervised up
Discriminative
Disc 91.7
PTB only
Generative
Gen 93.6
PTB only
Choe and Charniak (2016)
Gen
93.8
Semisupervised +SemiSup
English PTB (Parsing)
Type F1
Shindo et al (2012)
Gen 91.1
Single model
Vinyals et al (2015)
Disc 90.5
PTB only
Shindo et al (2012)
Gen 92.4
Ensemble
Vinyals et al (2015)
Disc+SemiS
92.8
Semisupervised up
Discriminative
Disc 91.7
PTB only
Generative
Gen 93.6
PTB only
Choe and Charniak (2016)
Gen
93.8
Semisupervised +SemiSup
English Language Modeling
Perplexity
X
p(x) = p(x, y)
y2Yx
Transition-based parsing
• Widely used
• Good accuracy
I saw
Stack Buffer Action
I saw her duck ROOT SHIFT
REDUCE-R
I saw her duck ROOT
Stack Buffer Action
I saw her duck ROOT SHIFT
REDUCE-R
I saw her duck ROOT
REDUCE-R
I saw her duck ROOT
amod
; an decision
overhasty
} {z | } {z
)
od
am
F T -L(
I D
SH RE
…
S pt B
TO P
P T O
amod
; an decision was made root
overhasty
} {z | } {z
)
od
am
F T -L(
I D
SH RE
…
S pt B
TO P
P T O
amod
; an decision was made root
TO
overhasty P
| {z }
REDUCE-LEFT(amod)
A
SHIFT
…
Some Results
Accuracy
Feed forward
93.1
NN
car c + b = bar
Word representation
ARBITRARINESS (de Saussure, 1916)
car
Word representation
ARBITRARINESS (de Saussure, 1916)
OPPORTUNITY
Word representation
ARBITRARINESS (de Saussure, 1916)
OPPORTUNITY
cool | coooool | coooooooool
Word representation
ARBITRARINESS (de Saussure, 1916)
OPPORTUNITY
cool | coooool | coooooooool
cat+ s = cats
Word representation
ARBITRARINESS (de Saussure, 1916)
OPPORTUNITY
cool | coooool | coooooooool
cat+ s = cats bat+ s = bats
ca
ca t
dots
do g
aa gs
aa rdv
di rdv ark
w sh ark
diash s
loshwer
loud ash
loude er
louder
eaudl st
ea t y
ea ts
at tin
eae g
te
n
Words as structured objects
69
ca
ca t
dots
do g
aa gs
aa rdv
di rdv ark
w sh ark
diash s
loshwer
loud ash
loude er
louder
eaudl st
ea t y
ea ts
at tin
eae g
te
n
Words as structured objects
69
Words as structured objects
ts
ea
69
Words as structured objects
ts
ea
69
Words as structured objects
eats
ts
ea
69
Words as structured objects
eats
ts
ea
69
Words as structured objects
eats
ts
ea
69
Words as structured objects
eats
ts
ea
69
Words as structured objects
eats
ts
ea
69
Words as structured objects
eats
ts
ea
69
Words as structured objects
eats
ts
ea
69
Words as structured objects
eats
ts
ea
e a
69
Words as structured objects
eats
ts
ea
e a t
69
Words as structured objects
eats
ts
ea
e a t s
69
Words as structured objects
eats
ts
ea
e a t s
69
Generating new word forms
71
Generating new word forms
71
Generating new word forms
71
Generating new word forms
71
Generating new word forms
71
Generating new word forms
72
Generating new word forms
72
Generating new word forms
72
Generating new word forms
72
Generating new word forms
72
Generating new word forms
72
Generating new word forms
72
Generating new word forms
72
Generating new word forms
72
Generating new word forms
72
Generating new word forms
72
Generating new word forms
72
Generating new word forms
72
Generating new word forms
72
Putting it all together
73
cats eat loudly
Language modeling results
Turkish Finnish
2.3 2.7
1.73 2.03
1.15 1.35
0.58 0.68
0 0
PureC C CM CW CMW PureC C CM CW CMW
Lower is better.
Columns:
RNN predicts language as a sequences of characters
Compositional character model only
Character+morpheme model
Character+word embedding model
Character+morpheme+word embedding model
Topics
• Recursive neural networks for sentence
representation
{SINGULAR,
The keys to the cabinet P(PL) > P(SG)?
PLURAL}
{SINGULAR,
The keys to the cabinet is/are P(PL) > P(SG)?
PLURAL}
{GRAMMATICAL, P(GRAMMATICAL) >
The keys to the cabinet are here
UNGRAMMATICAL} P(UNGRAMMATICAL)?
{are, is, cat, dog,
The keys to the cabinet P(are) > P(is)?
the, …}
Experimental Variants
Training signal Evaluation Task
{SINGULAR,
The keys to the cabinet P(PL) > P(SG)?
PLURAL}
{SINGULAR,
The keys to the cabinet is/are P(PL) > P(SG)?
PLURAL}
{GRAMMATICAL, P(GRAMMATICAL) >
The keys to the cabinet are here
UNGRAMMATICAL} P(UNGRAMMATICAL)?
{are, is, cat, dog,
The keys to the cabinet P(are) > P(is)?
the, …}
More Results
Summary
• RNN Language Models are not learning the correct generalizations
about syntax
• Open questions
• Any questions?