Lecture 8 - Semantic Similarity Vector Semantic - Vector 2

Vector
Semantics
Dense Vectors
Dan Jurafsky
Sparse versus dense vectors
• PPMI vectors are

• long (length |V|= 20,000 to 50,000)
• sparse (most elements are zero)
• Alternative: learn vectors which are
• short (length 200-‐1000)
• dense (most elements are non-‐zero)
2
Dan Jurafsky
Sparse versus dense vectors

• Why dense vectors?
• Short vectors may be easier to use as features in machine
learning (less weights to tune)
• Dense vectors may generalize better than storing explicit counts
• They may do better at capturing synonymy:
• car and automobile are synonyms; but are represented as
distinct dimensions; this fails to capture similarity between a
word with car as a neighbor and a word with automobile as a
neighbor
3
Dan Jurafsky
Three methods for getting short dense

vectors
• Singular Value Decomposition (SVD)
• A special case of this is called LSA – Latent Semantic Analysis
• “Neural Language Model”-‐inspired predictive models
• skip-‐grams and CBOW
• Brown clustering
4
Vector Semantics
Dense Vectors via SVD
Dan Jurafsky
Intuition
• Approximate an N-‐dimensional dataset using fewer dimensions
• By first rotating the axes into a new space
• In which the highest order dimension captures the most
variance in the original dataset
• And the next dimension captures the next most variance, etc.
• Many such (related) methods:
• PCA – principle components analysis
• Factor Analysis
• SVD
6
Dan Jurafsky 6
Dimensionality reduction PCA dimension 1

5
PCA dimension 2
4
7
1 2 3 44 55 66
Dan Jurafsky
Singular Value Decomposition
Any rectangular w x c matrix X equals the product of 3 matrices:

W: rows corresponding to original but m columns represents a
dimension in a new latent space, such that
• M column vectors are orthogonal to each other
• Columns are ordered by the amount of variance in the dataset each new
dimension accounts for
S: diagonal m x m matrix of singular values expressing the
importance of each dimension.
C: columns corresponding to original but m rows corresponding to
8
singular values
here. Suffice it to say that cookbook versions of SVD adequate for
Dan Jurafsky
small (e.g., 100 × 100) matrices are available in several places (e.g.,
Singular Value Decomposition
Mathematica, 1991 ), and a free software version (Berry, 1992) suitable Th
discu
matr
rows
Contexts conte
ing s
are o
of th
3= loss
m x m m x c curre
See
Dum
wxc w xm
9
Figure A1. Landuaer
Schematic diagram of the singular value and Dumais 1997 A~
decomposition
Dan Jurafsky reduce the number of dimensions systematically by, for example,
SVD applied to term-‐document matrix:
ing those with the smallest effect on the sum-squared error of the a
imation simply by deleting those with the smallest singular valu
Latent Semantic Analysis
The actual algorithms used Deerwester
to compute SVDset al for large sparse m
(1988)
of the sort involved in LSA are rather sophisticated and are not de
here. Suffice it to say that cookbook versions of SVD adequ
• If instead of keeping all m dimensions, we
small just keep
(e.g., 100 × 100) matrices the top
are available k place
in several
singular values. Let’s say 300. Mathematica, 1991 ), and a free software version (Berry, 1992) s
• The result is a least-‐squares approximation to the original X

• But instead of multiplying, Contexts

we’ll just make use of W.
• Each row of W:
3=
• A k-‐dimensional vector m/ x m / /
m x c
• Representing word W k k k
10 k
wxc w x m/
Dan Jurafsky
LSA more details

• 300 dimensions are commonly used
• The cells are commonly weighted by a product of two weights
• Local weight: Log term frequency
• Global weight: either idf or an entropy measure
11
Dan Jurafsky
Let’s return to PPMI word-‐word matrices

• Can we apply to SVD to them?
12
riginal M. Since the first dimensions encode the most variance, one way to view
Dan Jurafsky
he reconstruction is thus as modeling the most important information in the original

ataset. SVD applied to term-‐term matrix
VD applied to co-occurrence matrix X:
2 3 2 32 32 3
s1 0 0 . . . 0
6 7 6 7 6 0 s2 0 . . . 0 76 7
6 7 6 76 76 7
76 0 0 s3 . . .
6 X 7 6 W 7 6 0 76 C 7
6 7 6 76 7
6 7= 6 76 .. .. .. . . .. 76 7
4 5 4 54 . . . . . 54 5
0 0 0 ... sV
|V | ⇥ |V | |V | ⇥ |V | |V | ⇥ |V | |V | ⇥ |V |
aking only the top k dimensions after SVD applied to co-occurrence matrix X:
13 (I’m simplifying here by assuming the matrix has rank |V|)
2 3 2 32 3h i
4 5
Dan Jurafsky
4 54 .... .. . . .. 54 5
0 0 0 . . . sV
|V | ⇥ |V | Truncated
|V | ⇥ SVD on term-‐term matrix|V | ⇥ |V |
|V | |V | ⇥ |V |
king only the top k dimensions after SVD applied to co-occurrence matrix X:
2 3 2 32 3h i
s1 0 0 . . . 0 C
6
6
7
7
6
6
76
76 0 s2 0 . . . 0 7
7 k ⇥ |V |
6
6 X 7
7
6 W
6
76
76 0 0 s3 . . . 0 7
7
6 7= 6 76 .. .. .. . . .. 7
4 5 4 54 . . . . . 5
0 0 0 ... sk
|V | ⇥ |V | |V | ⇥ k k⇥k
gure 19.11 SVD factors a matrix X into a product of three matrices, W, S, and C. Taking
14
first k dimensions gives a |V | ⇥ k matrix Wk that has one k-dimensioned row per word that
Dan Jurafsky
|V | ⇥ |V | |V | ⇥ |V |
Truncated SVD produces embeddings
Taking only the top k dimensions after SV
2
• Each row of W matrix is a k-‐dimensional 3
embedding
2 3 2
s
representation of each word w 6 for
7 6 76
6 word7i 6 76
• K might range from 50 to 1000 6 X 7 6 7 6
6 7 6 W 76
6 7= 6 76
• Generally we keep the top k dimensions,
4 5 4 54
but some experiments suggest that
getting rid of the top 1 dimension or even
|V | ⇥ |V | |V | ⇥ k
the top 50 dimensions is helpful (Lapesa
and Evert 2014). Figure 19.11 SVD factors a matrix X into a
the first k dimensions gives a |V | ⇥ k matrix W
15
can be used as an embedding.
Dan Jurafsky
Embeddings versus sparse vectors

• Dense SVD embeddings sometimes work better than
sparse PPMI matrices at tasks like word similarity
• Denoising: low-‐order dimensions may represent unimportant
information
• Truncation may help the models generalize better to unseen data.
• Having a smaller number of dimensions may make it easier for
classifiers to properly weight the dimensions for the task.
• Dense models may do better at capturing higher order co-‐
16
occurrence.
Vector Semantics
Embeddings inspired by
neural language models:
skip-‐grams and CBOW
Dan Jurafsky
Prediction-‐based models:
An alternative way to get dense vectors
• Skip-‐gram (Mikolov et al. 2013a) CBOW (Mikolov et al. 2013b)

• Learn embeddings as part of the process of word prediction.
• Train a neural network to predict neighboring words
• Inspired by neural net language models.
• In so doing, learn dense embeddings for the words in the training corpus.
• Advantages:
• Fast, easy to train (much faster than SVD)
• Available online in the word2vec package
18 • Including sets of pretrained embeddings!
ntinuous bag of words) (Mikolov et al. 2013, Mikolov
Dan Jurafsky
n from the neural methods for language modeling intro-

Skip-‐grams
e neural language models, these models train a network
s, and while doing so learn dense embeddings for the
• Predict e ach n eighboring
. The advantage of these methods is that they are fast,w ord
• in a context window of 2C words
available online in the word2vec package; code and
• from the current word.
oth available.
• So f or C =2, w e
ip-gram model. The skip-gram model a re given w ord w t and p redicting
t
predicts each hese
4 words:
ext window of 2C words from the current word. So
the context is [wt 2 , wt 1 , wt+1 , wt+2 ] and we are pre-
ord wt . Fig. 17.12 sketches the architecture for a sample
19
Dan Jurafsky
W
Skip-‐grams learn 2 embeddings 1 2 .
. i … |V|
for each w
1
2
.
.
d
input embedding v, in the input matrix W
d x |V|
• Column i of the input matrix W is the 1×d
embedding vi for word i in the vocabulary. W’
1 2 … d
1
2
output embedding vʹ′, in output matrix W’ .
.
.
• Row i of the output matrix Wʹ′ is a d × 1

.
i
vector embedding vʹ′i for word i in the .
.
.
vocabulary. .
|V|
20
|V| x d
Dan Jurafsky
Setup
• Walking through corpus pointing at word w(t), whose index in
the vocabulary is j, so we’ll call it wj (1 < j < |V |).
• Let’s predict w(t+1) , whose index in the vocabulary is k (1 < k <
|V |). Hence our task is to compute P(wk|wj).
21
Dan Jurafsky
Intuition: similarity as dot-‐product

between a target vector and context vector
W C
target embeddings context embeddings
target embedding
1. .. … d
for word j 1.2…….j………|Vw|
1
1 .
. .
Similarity( j , k) . k
. .
d .
|Vw|
context embedding
for word k
22
Dan Jurafsky
Similarity is computed from dot product

• Remember: two vectors are similar if they have a high
dot product
• Cosine is just a normalized dot product
• So:
• Similarity(j,k) ∝ ck ·∙ vj
• We’ll need to normalize to get a probability
23
Dan Jurafsky for word k
Turning dot products into probabilities

he dot product c k · v is
• Similarity(j,k) = ck · vj
j not a probability, it’s just a numb
We can use the softmax function from Chapter 7 to norma
babilities. Computing this denominator requires comput
• We
n each use
other wordsoftmax
w in theto vocabulary
turn into pwith
robabilities
the target wor
exp(ck · v j )
p(wk |w j ) = P
i2|V | exp(ci · v j )
the skip-gram computes the probability p(w |w ) by tak

24
Dan Jurafsky
Embeddings from W and W’

• Since we have two embeddings, vj and cj for each word wj
• We can either:
• Just use vj
• Sum them
• Concatenate them to make a double-‐length embedding
25
Dan Jurafsky
Learning
• Start with some initial embeddings (e.g., random)
• iteratively make the embeddings for a word
• more like the embeddings of its neighbors
• less like the embeddings of other words.
26
Dan Jurafsky
Visualizing W and C as a network for doing

error backprop
Input layer Projection layer Output layer
probabilities of
1-hot input vector embedding for wt context words
x1
x2 y1
y2
wt xj W C d ⨉ |V| yk wt+1
|V|⨉d
x|V|
y|V|
1⨉d
1⨉|V| 1⨉|V|
27
Dan Jurafsky
One-‐hot vectors
• A vector of length |V|
• 1 for the target word and 0 for other words
• So if “popsicle” is vocabulary word 5
• The one-‐hot vector is
• [0,0,0,0,1,0,0,0,0…….0]
w0 w1 wj w|V|
0 0 0 0 0 … 0 0 0 0 1 0 0 0 0 0 … 0 0 0 0
28
Dan Jurafsky
o = Ch
Skip-‐gram ok = ckh
h = vj
ok = ck·∙vj
Input layer Projection layer Output layer
probabilities of
1-hot input vector embedding for wt context words
x1
x2 y1
y2
wt xj W C d ⨉ |V| yk wt+1
|V|⨉d
x|V|
y|V|
1⨉d
29 1⨉|V| 1⨉|V|
, the dot product ck · v j is not a probability, it’s just a nu
Dan Jurafsky
•. We can use the softmax function from Chapter 7 to nor

Problem with the softamx
probabilities. Computing this denominator requires com
een• each other wordhave
The denominator: w in
to cthe vocabulary
ompute with
over every word in the target w
vocab
exp(ck · v j )
p(wk |w j ) = P
i2|V | exp(c i · v j )
• Instead: just sample a few of those negative words
ry, the skip-gram computes the probability p(wk |w j ) by
een the word vector for j (v j ) and the context vector f
dot product v j · ck into a probability by passing it throu
30
les: non-neighbor This sectionwords.
offers a The brief goal
sketchwill
of howbethis
to works.
move In the training phase, the
les or negative
the samples:
embeddings toward non-neighbor
thethe neighbor words.
words The
and goal
away will
frombethe
to move
noise words.
hbor
Dan
Jurafsky
words and
algorithm away
walks from
through the noise
corpus, words.
at each target word For
choosing example,
the in walking
surrounding c1 th
ings toward For the
context neighbor
example,
words words examples,
asinpositive
walking and awayand
through from
the the
example
for eachnoise
textwords.
positivebelow we come
example also to the word
choosing k 2 so we ha
ough
mple, the
innoise
egative apricot, example
walking
and let Ltext
through 2 below
the
so we we4come
example
have text towords
below
context the
we word
come
c1
apricot,
toThe
through thegoal
word
c4:
and let L =
negative samples:
weGoal i n l earning
samples or
= non-neighbor words. will be to move
amples
ed 4letcontext
L =the 2 so wordshavec1
embeddings through
4toward
context the c4:
words c1 words
neighbor through andc4:
away fromlemon,
the noise words. The goal
a [tablespoon of a
is
lemon, a [tablespoon of apricot preserves or] jam
Forof
[tablespoon
ricot preserves
apricot,
example,
and
apricot
let L
in walking
c1 or] jamc2
= 2 so we
through
preserves
have
w or]
4
the
c3
example text below we come to the word
• Make the word like the context words
jam
context words
c4
c1 through c4: c1 is high. In pract c2
c1 c2 w c3 c4 1
w
al is to
The
c3
learn
goal
an
is to
c4 learn an embedding
lemon, a [tablespoon of apricot preserves or] jam
embedding whose dot
whose
product with
dot
each
product with
context The each
word
s
goal (x)
context
is =
toword
learn
1+e x .
an So
em
is high. In practice c1 skip-gram c2 uses w a sigmoid c4 s of the dot product, where
function
edding
practice whose
s (x)skip-gram
= 1+e 1 dot uses
. So product
for athe
sigmoid
abovewith each swe
function
example
c3
context
ofwant word
the sdot · w)is+high.
(c1product, s where
(c2 · w) s
In +practice
(c4 w)
s (c3·· w) + to be
skip-gram h
. So• forWe
the
ses a ssigmoid
The w
above
x
ant
goal is ttohis
example
(c4 · w) tofunction
learnt
we o b
anwante
s h
embeddingigh:
(c1 · whose
w) + s dot ·product
(c2 w) + s with
be high. s of the dot product, where s (x) = 1+ex . So for the above
(c3 · each
w) + context
1 word
In practice skip-gram uses a sigmoid function s of the dot product,In addition, f
x
o be high.isInhigh.
addition,
1 sfor each
where
xample we
s (x)want
= 1+e (c1
. So for· thecontext
w) +
aboves word
(c2 ·
example the+
w) wealgorithm
s (c3
want w)
s·(c1chooses
·+
w) +s k (c2
noise
(c4
s · w)words
· w) +to according
be
s (c3 · high.
w) +
ion, for each context word the algorithm chooses k noise words according
to their haveunigram
x
to their
s (c4unigram
· w) to befrequency.
high. If we let k = 2, for each target/context pair, we’ll 2
gram noise
frequency. If we let k 2, for each target/context pair, we’ll In
have addition,
2 for each contex
word • And
words
thetoalgorithm
s for each of n ot
theunigram like
for each
In addition,
4 context
ofkthe
chooses
=
for each randomly
4 context
context
words:k Ifnoise
the s
wordwords:
words elected
algorithm
according “
choosesnoise w
k noise words
to theirpair, ords”
according
noise
unigram words
frequency. forIf
their frequency.
[cement metaphysical dear coaxial we let k = 2, for each target/context we’ll
apricot attendant whence forever puddle] have 2
eetaphysical
let k n1 2, for
=noise dear
n2 each
words for target/context
each
coaxial of the
n3 4 pair,
context
apricot
n4 we’ll
words:
attendant have
whence2
n5 noise words
forever for
n7 each of
n6 puddle] n8 the 4 c
ntext
2 words: n3 metaphysical
[cement n4 dear coaxialn5 n6 attendant
apricot n7 whence [cement metaph
n8forever puddle]
•
ke these
We w
n1 words
noise
ant this
n2 n to have n3to
a
b
low
e
n4 low:
We’d like these noise words n to have a low dot-product with our target embed-
dot-product with our
[cement
n5target embed-
n6
metaphysical
n7 n8
dear
oaxialding w; in other words
apricot we want s (n1
attendant s (n2 · w) +puddle]
· w) + forever
whence ... +
n1s (n8 · w)n2
n1
to be low. n2 n3
other31
words We’d like s
we want these
(n1 noise s (n2n· w)
· w) +words to have s (n8
+ ...a+low · w) to bewith
dot-product low.our target embed-
More formally, the learning objective for one word/context pair (w, c) is
e these noise words
Dan Jurafsky
n to have a low dot-product
Skipgram with negative sampling: with
her words we want s
Loss function (n1 · w) + s (n2 · w) + ... + s (n8
mally, the learning objective for one word/context pa
k
X
log s (c · w) + Ewi ⇠p(w) [log s ( wi · w)]
i=1
we want to maximize the dot product of the word with

inimize the dot products of the word with the k nega
rds. The noise words wi are sampled from the vocab
32
Dan Jurafsky
Relation between skipgrams and PMI!

• If we multiply WW’T
• We get a |V|x|V| matrix M , each entry mij corresponding to
some association between input word i and output word j
• Levy and Goldberg (2014b) show that skip-‐gram reaches its
optimum just when this matrix is a shifted version of PMI:
WWʹ′T =MPMI −log k
• So skip-‐gram is implicitly factoring a shifted version of the PMI
matrix into the two embedding matrices.
33
19.5 Properties of embeddings
Dan Jurafsky
Properties of e17.8
We’ll discuss in Section mbeddings
how to evaluate the quality of different embeddings.
But it is also sometimes helpful to visualize them. Fig. 17.14 shows the words/phrases
that are most similar to some sample words using the phrase-based version of the
• Nearest skip-gram
words talgorithm
o some (Mikolov
embeddings (Mikolov
et al., 2013a).
et al. 20131)
target: Redmond Havel ninjutsu graffiti capitulate

Redmond Wash. Vaclav Havel ninja spray paint capitulation
Redmond Washington president Vaclav Havel martial arts grafitti capitulated
Microsoft Velvet Revolution swordsmanship taggers capitulating
Figure 19.14 Examples of the closest tokens to some target words using a phrase-based
extension of the skip-gram algorithm (Mikolov et al., 2013a).
One semantic property of various kinds of embeddings that may play in their
usefulness is their ability to capture relational meanings
Mikolov et al. (2013b) demonstrates that the offsets between vector embeddings
34 can capture some relations between words, for example that the result of the ex-
Dan Jurafsky
Embeddings capture relational meaning!

vector(‘king’) -‐ vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’)
vector(‘Paris’) -‐ vector(‘France’) + vector(‘Italy’) ≈ vector(‘Rome’)
35
Vector Semantics
Brown clustering
Dan Jurafsky
Brown clustering
• An agglomerative clustering algorithm that clusters words based
on which words precede or follow them
• These word clusters can be turned into a kind of vector
• We’ll give a very brief sketch here.
37
Dan Jurafsky
Brown clustering algorithm

• Each word is initially assigned to its own cluster.
• We now consider consider merging each pair of clusters. Highest
quality merge is chosen.
• Quality = merges two words that have similar probabilities of preceding
and following words
• (More technically quality = smallest decrease in the likelihood of the
corpus according to a class-‐based language model)
• Clustering proceeds until all words are in one big cluster.
38
Dan Jurafsky
Brown Clusters as vectors

• By tracing the order in which clusters are merged, the model
builds a binary tree from bottom to top.
• Each word represented by binary string = path from root to leaf
Brown Algorithm
• Each intermediate node is a cluster
• Chairman is 0010, “months” = 01, and verbs = 1
0 1
00 01 10 11
000 001 010 011 100 101 walk
CEO 0010 0011 November October run sprint
39
chairman president
based on immediately neighboring words, Brown clusters are most commonly used
Dan Jurafsky
for representing the syntactic properties of words, and hence are commonly used as
Brown cluster examples
a feature in parsers. Nonetheless, the clusters do represent some semantic properties
as well. Fig. 19.17 shows some examples from a large clustering from Brown et al.
(1992).
Friday Monday Thursday Wednesday Tuesday Saturday Sunday weekends Sundays Saturdays
June March July April January December October November September August
pressure temperature permeability density porosity stress velocity viscosity gravity tension
anyone someone anybody somebody
had hadn’t hath would’ve could’ve should’ve must’ve might’ve
asking telling wondering instructing informing kidding reminding bothering thanking deposing
mother wife father son husband brother daughter sister boss uncle
great big vast sudden mere sheer gigantic lifelong scant colossal
down backwards ashore sideways southward northward overboard aloft downwards adrift
Figure 19.17 Some sample Brown clusters from a 260,741-word vocabulary trained on 366
million words of running text (Brown et al., 1992). Note the mixed syntactic-semantic nature
40
of the clusters.
hemodel
preceding
in whichoreach
Dan Jurafsky
following
word wwords. 2 V belongs to a class c 2 C with a probabi
orithm makes
Class based LMs useassigns class-based
of thea probability to alanguage model
pair of words wi 1(Brow
and wi
del
g theintransition
which each
betweenwordclasses
Class-‐based wlanguage
2 Vrather belongsthan
model to a classwords:
between c 2 C with a pro
ass based LMs assigns a probability to a pair of words wi 1 an
• Suppose P(w i |wi
each word 1 ) = P(c i |c
was in rather i 1 )P(w
some class i |c i ) (19
e transition between classes thanci: between words:
class-based LM can be used to assign a probability to an entire corpus gi
ularly clusteringP(w
C asi |w i 1 ) = P(ci |ci 1 )P(wi |ci )
follows:
n
s-based LM can be used to assign a probability to an entire corpu
Y
P(corpus|C) = P(ci |ci 1 )P(wi |ci ) (19
y clustering C as follows:
i 1
Y n
ased language models are generally not used as a language model for
P(corpus|C) = P(c i |ci 1 )P(wi |c i )
ns like machine translation or speech recognition because they don’t w
41
i 1

Lecture 8 - Semantic Similarity Vector Semantic - Vector 2

Hochgeladen von

Dokumentinformationen

Copyright

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Lecture 8 - Semantic Similarity Vector Semantic - Vector 2

Hochgeladen von

Copyright:

Vector

Sparse versus dense vectors

• PPMI vectors are

Sparse versus dense vectors

Three methods for getting short dense

Dimensionality reduction PCA dimension 1

Singular Value Decomposition

Any rectangular w x c matrix X equals the product of 3 matrices:

• The result is a least-­‐squares approximation to the original X

LSA more details

Let’s return to PPMI word-­‐word matrices

he reconstruction is thus as modeling the most important information in the original

Embeddings versus sparse vectors

• Skip-­‐gram (Mikolov et al. 2013a) CBOW (Mikolov et al. 2013b)

n from the neural methods for language modeling intro-

• Row i of the output matrix Wʹ′ is a d × 1

Intuition: similarity as dot-­‐product

Similarity is computed from dot product

Turning dot products into probabilities

the skip-gram computes the probability p(w |w ) by tak

Embeddings from W and W’

Visualizing W and C as a network for doing

•. We can use the softmax function from Chapter 7 to nor

we want to maximize the dot product of the word with

Relation between skipgrams and PMI!

target: Redmond Havel ninjutsu graffiti capitulate

Embeddings capture relational meaning!

Brown clustering algorithm

Brown Clusters as vectors

Das könnte Ihnen auch gefallen

• The result is a least-‐squares approximation to the original X

Let’s return to PPMI word-‐word matrices

• Skip-‐gram (Mikolov et al. 2013a) CBOW (Mikolov et al. 2013b)

Intuition: similarity as dot-‐product