Beruflich Dokumente
Kultur Dokumente
Semantics
Dense
Vectors
Dan
Jurafsky
2
Dan
Jurafsky
4
Vector
Semantics
Dense
Vectors
via
SVD
Dan
Jurafsky
Intuition
• Approximate
an
N-‐dimensional
dataset
using
fewer
dimensions
• By
first
rotating
the
axes
into
a
new
space
• In
which
the
highest
order
dimension
captures
the
most
variance
in
the
original
dataset
• And
the
next
dimension
captures
the
next
most
variance,
etc.
• Many
such
(related)
methods:
• PCA
– principle
components
analysis
• Factor
Analysis
• SVD
6
Dan
Jurafsky 6
PCA dimension 2
4
7
1 2 3 44 55 66
Dan
Jurafsky
9
Figure A1. Landuaer
Schematic diagram of the singular value and
Dumais 1997 A~
decomposition
Dan
Jurafsky reduce the number of dimensions systematically by, for example,
SVD
applied
to
term-‐document
matrix:
ing those with the smallest effect on the sum-squared error of the a
imation simply by deleting those with the smallest singular valu
Latent
Semantic
Analysis
The actual algorithms used Deerwester
to compute SVDset
al
for large sparse m
(1988)
of the sort involved in LSA are rather sophisticated and are not de
here. Suffice it to say that cookbook versions of SVD adequ
• If
instead
of
keeping
all
m
dimensions,
we
small just
keep
(e.g., 100 × 100) matrices the
top
are available k
place
in several
singular
values.
Let’s
say
300. Mathematica, 1991 ), and a free software version (Berry, 1992) s
10 k
wxc w x m/
Dan
Jurafsky
11
Dan
Jurafsky
12
riginal M. Since the first dimensions encode the most variance, one way to view
Dan
Jurafsky
aking only the top k dimensions after SVD applied to co-occurrence matrix X:
13 (I’m
simplifying
here
by
assuming
the
matrix
has
rank
|V|)
2 3 2 32 3h i
4 5
Dan
Jurafsky
4 54 .... .. . . .. 54 5
0 0 0 . . . sV
|V | ⇥ |V | Truncated
|V | ⇥ SVD
on
term-‐term
matrix|V | ⇥ |V |
|V | |V | ⇥ |V |
king only the top k dimensions after SVD applied to co-occurrence matrix X:
2 3 2 32 3h i
s1 0 0 . . . 0 C
6
6
7
7
6
6
76
76 0 s2 0 . . . 0 7
7 k ⇥ |V |
6
6 X 7
7
6 W
6
76
76 0 0 s3 . . . 0 7
7
6 7= 6 76 .. .. .. . . .. 7
4 5 4 54 . . . . . 5
0 0 0 ... sk
|V | ⇥ |V | |V | ⇥ k k⇥k
gure 19.11 SVD factors a matrix X into a product of three matrices, W, S, and C. Taking
14
first k dimensions gives a |V | ⇥ k matrix Wk that has one k-dimensioned row per word that
Dan
Jurafsky
|V | ⇥ |V | |V | ⇥ |V |
Truncated
SVD
produces
embeddings
Taking only the top k dimensions after SV
2
• Each
row
of
W
matrix
is
a
k-‐dimensional
3
embedding
2 3 2
s
representation
of
each
word
w 6 for
7 6 76
6 word7i 6 76
• K
might
range
from
50
to
1000 6 X 7 6 7 6
6 7 6 W 76
6 7= 6 76
• Generally
we
keep
the
top
k
dimensions,
4 5 4 54
but
some
experiments
suggest
that
getting
rid
of
the
top
1
dimension
or
even
|V | ⇥ |V | |V | ⇥ k
the
top
50
dimensions
is
helpful
(Lapesa
and
Evert
2014). Figure 19.11 SVD factors a matrix X into a
the first k dimensions gives a |V | ⇥ k matrix W
15
can be used as an embedding.
Dan
Jurafsky
for
each
w
1
2
.
.
d
input
embedding
v,
in
the
input
matrix
W
d x |V|
• Column
i of
the
input
matrix
W
is
the
1×d
embedding
vi for
word
i in
the
vocabulary.
W’
1 2 … d
1
2
output
embedding
vʹ′,
in
output
matrix
W’ .
.
.
i
vector
embedding
vʹ′i for
word
i in
the
.
.
.
vocabulary. .
|V|
20
|V| x d
Dan
Jurafsky
Setup
• Walking
through
corpus
pointing
at
word
w(t),
whose
index
in
the
vocabulary
is
j,
so
we’ll
call
it
wj (1
<
j
<
|V
|).
• Let’s
predict
w(t+1)
,
whose
index
in
the
vocabulary
is
k
(1
<
k
<
|V
|).
Hence
our
task
is
to
compute
P(wk|wj).
21
Dan
Jurafsky
context embedding
for word k
22
Dan
Jurafsky
23
Dan
Jurafsky for word k
25
Dan
Jurafsky
Learning
• Start
with
some
initial
embeddings (e.g.,
random)
• iteratively
make
the
embeddings for
a
word
• more
like
the
embeddings of
its
neighbors
• less
like
the
embeddings of
other
words.
26
Dan
Jurafsky
wt xj W C d ⨉ |V| yk wt+1
|V|⨉d
x|V|
y|V|
1⨉d
1⨉|V| 1⨉|V|
27
Dan
Jurafsky
One-‐hot
vectors
• A
vector
of
length
|V|
• 1
for
the
target
word
and
0
for
other
words
• So
if
“popsicle”
is
vocabulary
word
5
• The
one-‐hot
vector
is
• [0,0,0,0,1,0,0,0,0…….0]
w0 w1 wj w|V|
0 0 0 0 0 … 0 0 0 0 1 0 0 0 0 0 … 0 0 0 0
28
Dan
Jurafsky
o
=
Ch
Skip-‐gram ok =
ckh
h
=
vj
ok =
ck·∙vj
Input layer Projection layer Output layer
probabilities of
1-hot input vector embedding for wt context words
x1
x2 y1
y2
wt xj W C d ⨉ |V| yk wt+1
|V|⨉d
x|V|
y|V|
1⨉d
29 1⨉|V| 1⨉|V|
, the dot product ck · v j is not a probability, it’s just a nu
Dan
Jurafsky
exp(ck · v j )
p(wk |w j ) = P
i2|V | exp(c i · v j )
• Instead:
just
sample
a
few
of
those
negative
words
ry, the skip-gram computes the probability p(wk |w j ) by
een the word vector for j (v j ) and the context vector f
dot product v j · ck into a probability by passing it throu
30
les: non-neighbor This sectionwords.
offers a The brief goal
sketchwill
of howbethis
to works.
move In the training phase, the
les or negative
the samples:
embeddings toward non-neighbor
thethe neighbor words.
words The
and goal
away will
frombethe
to move
noise words.
hbor
Dan
Jurafsky
words and
algorithm away
walks from
through the noise
corpus, words.
at each target word For
choosing example,
the in walking
surrounding c1 th
ings toward For the
context neighbor
example,
words words examples,
asinpositive
walking and awayand
through from
the the
example
for eachnoise
textwords.
positivebelow we come
example also to the word
choosing k 2 so we ha
ough
mple, the
innoise
egative apricot, example
walking
and let Ltext
through 2 below
the
so we we4come
example
have text towords
below
context the
we word
come
c1
apricot,
toThe
through thegoal
word
c4:
and let L =
negative samples:
weGoal
i n
l earning
samples or
= non-neighbor words. will be to move
amples
ed 4letcontext
L =the 2 so wordshavec1
embeddings through
4toward
context the c4:
words c1 words
neighbor through andc4:
away fromlemon,
the noise words. The goal
a [tablespoon of a
is
lemon, a [tablespoon of apricot preserves or] jam
Forof
[tablespoon
ricot preserves
apricot,
example,
and
apricot
let L
in walking
c1 or] jamc2
= 2 so we
through
preserves
have
w or]
4
the
c3
example text below we come to the word
• Make
the
word
like
the
context
words
jam
context words
c4
c1 through c4: c1 is high. In pract c2
c1 c2 w c3 c4 1
w
al is to
The
c3
learn
goal
an
is to
c4 learn an embedding
lemon, a [tablespoon of apricot preserves or] jam
embedding whose dot
whose
product with
dot
each
product with
context The each
word
s
goal (x)
context
is =
toword
learn
1+e x .
an So
em
is high. In practice c1 skip-gram c2 uses w a sigmoid c4 s of the dot product, where
function
edding
practice whose
s (x)skip-gram
= 1+e 1 dot uses
. So product
for athe
sigmoid
abovewith each swe
function
example
c3
context
ofwant word
the sdot · w)is+high.
(c1product, s where
(c2 · w) s
In +practice
(c4 w)
s (c3·· w) + to be
skip-gram h
. So• forWe
the
ses a ssigmoid
The w
above
x
ant
goal is ttohis
example
(c4 · w) tofunction
learnt
we o
b
anwante
s h
embeddingigh:
(c1 · whose
w) + s dot ·product
(c2 w) + s with
be high. s of the dot product, where s (x) = 1+ex . So for the above
(c3 · each
w) + context
1 word
In practice skip-gram uses a sigmoid function s of the dot product,In addition, f
x
o be high.isInhigh.
addition,
1 sfor each
where
xample we
s (x)want
= 1+e (c1
. So for· thecontext
w) +
aboves word
(c2 ·
example the+
w) wealgorithm
s (c3
want w)
s·(c1chooses
·+
w) +s k (c2
noise
(c4
s · w)words
· w) +to according
be
s (c3 · high.
w) +
ion, for each context word the algorithm chooses k noise words according
to their haveunigram
x
to their
s (c4unigram
· w) to befrequency.
high. If we let k = 2, for each target/context pair, we’ll 2
gram noise
frequency. If we let k 2, for each target/context pair, we’ll In
have addition,
2 for each contex
word • And
words
thetoalgorithm
s for each of n ot
theunigram like
for each
In addition,
4 context
ofkthe
chooses
=
for each randomly
4 context
context
words:k Ifnoise
the s
wordwords:
words elected
algorithm
according “
choosesnoise
w
k noise words
to theirpair, ords”
according
noise
unigram words
frequency. forIf
their frequency.
[cement metaphysical dear coaxial we let k = 2, for each target/context we’ll
apricot attendant whence forever puddle] have 2
eetaphysical
let k n1 2, for
=noise dear
n2 each
words for target/context
each
coaxial of the
n3 4 pair,
context
apricot
n4 we’ll
words:
attendant have
whence2
n5 noise words
forever for
n7 each of
n6 puddle] n8 the 4 c
ntext
2 words: n3 metaphysical
[cement n4 dear coaxialn5 n6 attendant
apricot n7 whence [cement metaph
n8forever puddle]
•
ke these
We
w
n1 words
noise
ant
this
n2 n to have n3to
a
b
low
e
n4 low:
We’d like these noise words n to have a low dot-product with our target embed-
dot-product with our
[cement
n5target embed-
n6
metaphysical
n7 n8
dear
oaxialding w; in other words
apricot we want s (n1
attendant s (n2 · w) +puddle]
· w) + forever
whence ... +
n1s (n8 · w)n2
n1
to be low. n2 n3
other31
words We’d like s
we want these
(n1 noise s (n2n· w)
· w) +words to have s (n8
+ ...a+low · w) to bewith
dot-product low.our target embed-
More formally, the learning objective for one word/context pair (w, c) is
e these noise words
Dan
Jurafsky
n to have a low dot-product
Skipgram with
negative
sampling: with
her words we want s
Loss
function (n1 · w) + s (n2 · w) + ... + s (n8
mally, the learning objective for one word/context pa
k
X
log s (c · w) + Ewi ⇠p(w) [log s ( wi · w)]
i=1
Properties
of
e17.8
We’ll discuss in Section mbeddings
how to evaluate the quality of different embeddings.
But it is also sometimes helpful to visualize them. Fig. 17.14 shows the words/phrases
that are most similar to some sample words using the phrase-based version of the
• Nearest
skip-gram
words
talgorithm
o
some
(Mikolov
embeddings (Mikolov
et al., 2013a).
et
al.
20131)
One semantic property of various kinds of embeddings that may play in their
usefulness is their ability to capture relational meanings
Mikolov et al. (2013b) demonstrates that the offsets between vector embeddings
34 can capture some relations between words, for example that the result of the ex-
Dan
Jurafsky
35
Vector
Semantics
Brown
clustering
Dan
Jurafsky
Brown
clustering
• An
agglomerative
clustering
algorithm
that
clusters
words
based
on
which
words
precede
or
follow
them
• These
word
clusters
can
be
turned
into
a
kind
of
vector
• We’ll
give
a
very
brief
sketch
here.
37
Dan
Jurafsky
38
Dan
Jurafsky
Brown Algorithm
• Each
intermediate
node
is
a
cluster
• Chairman
is
0010,
“months”
=
01,
and
verbs
=
1
0 1
00 01 10 11
000 001 010 011 100 101 walk
CEO 0010 0011 November October run sprint
39
chairman president
based on immediately neighboring words, Brown clusters are most commonly used
Dan
Jurafsky
for representing the syntactic properties of words, and hence are commonly used as
Brown
cluster
examples
a feature in parsers. Nonetheless, the clusters do represent some semantic properties
as well. Fig. 19.17 shows some examples from a large clustering from Brown et al.
(1992).
Friday Monday Thursday Wednesday Tuesday Saturday Sunday weekends Sundays Saturdays
June March July April January December October November September August
pressure temperature permeability density porosity stress velocity viscosity gravity tension
anyone someone anybody somebody
had hadn’t hath would’ve could’ve should’ve must’ve might’ve
asking telling wondering instructing informing kidding reminding bothering thanking deposing
mother wife father son husband brother daughter sister boss uncle
great big vast sudden mere sheer gigantic lifelong scant colossal
down backwards ashore sideways southward northward overboard aloft downwards adrift
Figure 19.17 Some sample Brown clusters from a 260,741-word vocabulary trained on 366
million words of running text (Brown et al., 1992). Note the mixed syntactic-semantic nature
40
of the clusters.
hemodel
preceding
in whichoreach
Dan
Jurafsky
following
word wwords. 2 V belongs to a class c 2 C with a probabi
orithm makes
Class based LMs useassigns class-based
of thea probability to alanguage model
pair of words wi 1(Brow
and wi
del
g theintransition
which each
betweenwordclasses
Class-‐based
wlanguage
2 Vrather belongsthan
model to a classwords:
between c 2 C with a pro
ass based LMs assigns a probability to a pair of words wi 1 an
• Suppose
P(w i |wi
each
word
1 ) = P(c i |c
was
in
rather i 1 )P(w
some
class
i |c i ) (19
e transition between classes thanci: between words:
class-based LM can be used to assign a probability to an entire corpus gi
ularly clusteringP(w
C asi |w i 1 ) = P(ci |ci 1 )P(wi |ci )
follows:
n
s-based LM can be used to assign a probability to an entire corpu
Y
P(corpus|C) = P(ci |ci 1 )P(wi |ci ) (19
y clustering C as follows:
i 1
Y n
ased language models are generally not used as a language model for
P(corpus|C) = P(c i |ci 1 )P(wi |c i )
ns like machine translation or speech recognition because they don’t w
41
i 1