Model With One-Word Context: 2vec 2vec 2vec 2vec

Notes on word2vec
word2vec is not one algorithm.
word2vec is not deep learning.
First and foremost, it is important to understand that word2vec is not a single

monolithic algorithm. In fact, word2vec contains two dis!nct models (CBOW and skip-
gram), each with two different training methods (with/without nega!ve sampling) and
other varia!ons (e.g. hierarchical so"max), which amount to small “space” of algorithms.
To top that, it also contains a honed pre-processing pipeline, whose effects on the
overall performance have yet to be studied properly.
Another thing that’s important to understand, is that word2vec is not deep learning;
both CBOW and skip-gram are “shallow” neural models. Tomas Mikolov even said that
the whole idea behind word2vec was to demonstrate that you can get be#er word
representa!ons if you trade the model’s complexity for efficiency, i.e. the ability to
learn from much bigger datasets. In his papers, Mikolov recommends using the skip-
gram model with nega!ve sampling, as it outperformed the other variants on analogy
tasks.
Model with one-word context
We start from the simplest version. We assume that there is only one word considered
per context. The CBOW model will predict one target word given one context word. The
skip-gram model will predict one context word given one target word. In the one-word
context, the CBOW model and skip-gram model are actually completely same.
Figure 1 shows the model under the simplified one-word context.
Model defini!on
The vocabulary size is V , and it means there are V unique words in the vocabulary.
Each input word can be be mapped with a func!on onehot(vk ) to a V -dimension one-
hot encoded vector x. For example, when the input word is vk (1 ≤ k ≤ V ), the k -th
word in vocabulary.
⎡ x1 ⎤ ⎡0⎤
⎢ ⋮ ⎥ ⎢⋮⎥
x=⎢ ⎥ ⎢ ⎥
⎢ xk ⎥ = onehot(vk ) = ⎢1⎥
⎢ ⋮ ⎥ ⎢⋮⎥
⎣x ⎦ ⎣0⎦
V
where only the k -th element of x, xk = 1, and all other elements xk′ = 0 when k ′ ≠
k.
The hidden layer can be represented as a N -dimension vector

⎡ h1 ⎤
⎢ ⋮ ⎥
h=⎢ ⎥
⎢ hi ⎥
⎢ ⋮ ⎥
⎣h ⎦
N
For an element hi in h, we have
V
hi = ∑ wk,i × xk
k=1
Note: we use the subscript wk,i instead of wi,k because it’s more convenient for
vectorized representa!ons shown later.
We can show the calcula!on of hi in a vectorized representa!on.
⎡ x1 ⎤
⎢ ⋮ ⎥
hi = [w1,i ⋯ ⋯ wV ,i ] × ⎢
⎢ ⎥
⎥ = wT × x
⎢ ⋮ ⎥
wk,i xk i
⎣x ⎦
V
where wi is also V -dimension vector.
⎡ w1,i ⎤
⎢ ⋮ ⎥
wi = ⎢ ⎥
⎢ wk,i ⎥
⎢ ⋮ ⎥
⎣wV ,i ⎦
For 1 ≤ i ≤ N , we can assemble all wi into a V × N matrix
⎡ w1,1 ⋯ w1,i ⋯ w1,N ⎤

⎢ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥
W = [w1 ⋯ ⋯ wN ] = ⎢
⎢ wk,1 ⋯ ⋯ wk,N ⎥⎥
⎢ ⋮ ⋮ ⎥
wi wk,i
⋮ ⋱ ⋮
⎣w ⋯ wV ,i ⋯ wV ,N ⎦
V ,1
We have
⎡ w1,1 ⋯ wk,1 ⋯ wV ,1 ⎤ ⎡ 0 ⎤ ⎡ wk,1 ⎤
⎢ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥ ⎢⋮⎥ ⎢ ⋮ ⎥
h = WT × x = ⎢
⎢ w1,i ⋯ ⋯ wV ,i ⎥ ⎢ ⎥ ⎢ ⎥
⎥ × ⎢xk ⎥ = ⎢ wk,i ⎥
⎢ ⋮ ⋮ ⎥ ⎢⋮⎥ ⎢ ⋮ ⎥
wk,i
⋮ ⋱ ⋮
⎣w ⋯ wk,N ⋯ wV ,N ⎦ ⎣ 0 ⎦ ⎣wk,N ⎦
1,N
Note: As men!oned before, the input x is a V -dimension one-hot vector with only
xk = 1.
From the above equa!on, we can find h is actually same as the k -th column of W T or
the k -th row of the matrix W .
We define the the k -th row of the matrix W as Wk,∗ , and we can get
⎡ wk,1 ⎤
⎢ ⋮ ⎥
h=⎢ ⎥
⎢ wk,i ⎥ = Wk,∗
⎢ ⋮ ⎥
⎣wk,N ⎦
The output y is also a V -dimension vector
⎡ y1 ⎤
⎢ ⋮ ⎥
y=⎢ ⎥
⎢ yj ⎥
⎢ ⋮ ⎥
⎣y ⎦
V
Each element yj in y is the predic!on probability that the output word is vj , the j -th
word in the vocabulary, given the input word’s one-hot vector representa!on is x =
onehot(vk ).
yj = p(vj ∣x = onehot(vk )) = p(vj ∣vk )

V
We have ∑j=1 yj = 1, so we can use sof tmax ac!va!on func!on to calculate
N
uj = ∑ wi,j
′
× hi
i=1
euj
yj = V
∑l=1 eul
We can show the calcula!on of uj in a vectorized representa!on.
⎡ h1 ⎤
⎢ ⋮ ⎥
′
uj = [w1,j ⋯ ′
⋯ ′ ⎢ ⎥
,j ] × ⎢ hi ⎥ = (wj ) × x
′ T
⎢ ⋮ ⎥
wi,j wN
⎣hN ⎦
where wj′ is also N -dimension vector.
⎡ ⎤
′
w 1,j
⎢ ⋮ ⎥
wj′ = ⎢
⎢
′ ⎥
⎥
⎢ ⋮ ⎥
w i,j
⎣w ′ ⎦
N ,j
For 1 ≤ j ≤ V , we can assemble all wj′ into a N × V matrix
⎡ ⎤
′ ′ ′
w 1,1 ⋯ w1,j ⋯ w1,V
⎢ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥
W ′ = [w1′ ⋯ wj′ ⋯ wV′ ] = ⎢
⎢
′
⋯ ′
⋯ ′ ⎥
⎥
⎢ ⋮ ⋮ ⎥
w i,1 wi,j wi,V
⎣w ′ ⎦
⋮ ⋱ ⋮
′ ′
N ,1 ⋯ wN ,j ⋯ wN ,V
We define a V -dimension vector
⎡ u1 ⎤
⎢ ⋮ ⎥
u=⎢ ⎥
⎢ uj ⎥
⎢ ⋮ ⎥
⎣uV ⎦
We have
⎡ ,1 ⎤ ⎡ h1 ⎤
′ ′ ′
w1,1 ⋯ wi,1 ⋯ wN
⎢ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥ ⎢ ⋮ ⎥
u = (W ′ )T × h = ⎢
⎢
′
⋯ ′
⋯ ′ ⎥ ⎢ ⎥
,j ⎥ × ⎢ hi ⎥
⎢ ⋮ ⋮ ⎥ ⎢ ⋮ ⎥
w 1,j wi,j wN
⋮ ⋱ ⋮
⎣w ′ ⋯ ′
wi,V ⋯ ′
wN ⎦ ⎣h ⎦
1,V ,V N
We define wj′ as W∗,j

′
, which is the j -th column of the matrix W ′ .
N
uj = ∑ wi,j
′
× hi = (wj′ )T × h = (W∗,j
′ T
) × Wk,∗
i=1
We can get
′ T
e(W∗,j ) Wk,∗
yj = p(vj ∣vk ) = V ′ T
(W∗,l ) Wk,∗
∑l=1 e
Parameter fi%ng
Calcula!on of Par!al Deriva!ves
{(x(m) , y (m) ); m = 1, ..., M } is a training set containing M training examples.

(x(m) , y (m) ) is the m-th training example in the training set. In each pair (x(m) , y (m) ),
x(m) is an input and y (m) is a corresponding output.
In the CBOW model, each input x(m) is the context word, and each output y (m) is the
target word. In the skip-gram model, each input x(m) is the target word, and each
output y (m) is the context word. As shown before, x(m) is a one-hot V -dimension
vector, and y (m) is also a a one-hot V -dimension vector.
M training examples mean we extract M context/target word pairs from the whole
corpus.
(m) (m)
When xk = 1 or x(m) = onehot(vk ), and yj = 1 or y (m) = onehot(vj ), it
means that the current input word is vk , the k -th word in the vocabulary, and the
corresponding output word is vj , the j -th word in the vocabulary.
(m)
Note: yj and yj are different.
Assuming that the M training examples were generated independently, we can then
write down the likelihood of the parameters as
M
L(θ) = P (Y ∣X; θ) = ∏ p(y (m) ∣x(m) ; θ)
m=1
M
ℓ(θ) = log L(θ) = ∑ log p(y (m) ∣x(m) ; θ)
m=1
′
where θ = {W∗,j , Wk,∗ }.
We know that p(y (m) ∣x(m) ; θ) is defined by the sof tmax func!on, when x(m) =
onehot(vk ) we can get
V V
log p(y (m) ∣x(m) ; θ) = log ∏ p(vl ∣vk ) 1{yl(m) =1}
= ∑ 1{yl
(m)
= 1} log p(vl ∣vk )
l=1 l=1
Here, we introduce one more very useful piece of nota!on. An indicator func!on 1{⋅}
takes on a value of 1 if its argument is true, and 0 otherwise (1{T rue} =
1, 1{F alse} = 0).
If the current output y (m) is corresponding to the j -th word vj in the vocabulary, then
(m) (m) (m) (m)
yj = 1 and 1{yj = 1} = 1, and for any l ≠ j , we get yl = 0 and 1{yl =
1} = 0.
= 1} log p(vj ∣vk ) + ∑ 1{yl
(m) (m)
log p(y (m) ∣x(m) ; θ) = 1{yj = 1} log p(vl ∣vk )
l≠j
= 1 × log p(vj ∣vk ) + ∑ 0 × log p(vl ∣vk )

l≠j
= log p(vj ∣vk )
′ T
e(W∗,j ) Wk,∗
= log V ′ T
(W∗,l ) Wk,∗
∑l=1 e
V
) Wk,∗ − log ∑ e(W∗,l )
′ T
′ T Wk,∗
= (W∗,j
l=1
Note: If a and b are both vectors, and we can get
∂aT b ∂bT a
= =b
∂a ∂a
So we can get
′ T
(m) (m) ′ T V (W ) Wk,∗
∂ log p(y ∣x ; θ) ∂(W∗,j ) × Wk,∗ ∂ log ∑l=1 e ∗,l
′ = ′ − ′
∂W∗,j ∂W∗,j ∂W∗,j
′ T
∂ log ∑l=1 e(W∗,l ) Wk,∗
V
= Wk,∗ − ′
∂W∗,j
We first calculate
′ T ′ T
V (W∗,l ) Wk,∗ V (W∗,l ) Wk,∗
∂ log ∑l=1 e 1 ∂ ∑l=1 e
′ = ′ T × ′ )T × W )
∂W∗,j ∑l=1 e(W∗,l ) Wk,∗ ∂exp((W∗,j
V
k,∗
′ T ′ T
(W∗,j ) Wk,∗
∂((W∗,j ) × Wk,∗ )
∂e
× ′ )T × W ) × ′
∂((W∗,j k,∗ ∂W∗,j
1 ′ T
(W∗,j ) Wk,∗
= V (W ′ )T Wk,∗ × 1 × e × Wk,∗
∑l=1 e ∗,l
′ T
(W∗,j ) Wk,∗
e
= V ′ T
(W∗,l ) Wk,∗
× Wk,∗
∑l=1 e
= p(vj ∣vk ) × Wk,∗
We can get
∂ log p(y (m) ∣x(m) ; θ)

′ = Wk,∗ − p(vj ∣vk ) × Wk,∗
∂W∗,j
= (1 − p(vj ∣vk )) × Wk,∗
We can also have

′ T
′ T
∂ log ∑l=1 e(W∗,l )
V
∂ log p(y (m) ∣x(m) ; θ) ∂(W∗,j ) × Wk,∗ Wk,∗
= −
∂Wk,∗ ∂Wk,∗ ∂Wk,∗
′ T
∂ log ∑l=1 e(W∗,l ) ×Wk,∗
V
′
= W∗,j −
∂Wk,∗
We calculate
′ T
∂ log ∑l=1 e(W∗,l )
V Wk,∗
∂Wk,∗
′ T
V (W∗,l ) Wk,∗
1 ∂ ∑l=1 e
= ′ ×
∑l=1 e(W∗,l ) ∂Wk,∗
V TW
k,∗
′ T
∑l=1 ∂e(W∗,l )
V Wk,∗
1
= ′ T
(W∗,l ) Wk,∗
×
V
∑l=1 e ∂Wk,∗
V
1
× ∑e
′ T
(W∗,l ) Wk,∗ ′
= V ′ T
(W∗,l ) Wk,∗
× W∗,l
∑l=1 e l=1
′
V (W∗,l )T Wk,∗
e
= ∑( V ′ )T W
(W∗,l
′
× W∗,l )
k,∗
l=1 ∑l′ =1 e ′
V
= ∑(p(vl ∣vk ) × W∗,l
′
)
l=1
We can get
V
∂ log p(y (m) ∣x(m) ; θ) ′
= W∗,j − ∑(p(vl ∣vk ) × W∗,l
′
)
∂Wk,∗
l=1
Note:
Because from the input x to the hidden layer h, only a linear func!on is used, and h is
just the k -th row of the weight matrix W , we can treat the model as “dynamic”
so"max, and use the above method to calculate the par!al deriva!ves, which is
illustrated in the lectures of Richard Socher and Ali Ghodsi. Actually the original papers
of Tomas Mikolov also used this method. In these literature, generally the word
embedding vector of the input word is defined as win , and the word embedding vector
of the output word is defined as wout . Actually, win is same as Wk,∗ and wout is same
′
as W∗,l in our nota!on.
But we can also treat the above model as the conven!onal neural network, and use the
back-propaga!on method to calculate the par!al deriva!ves. The par!al deriva!ves
′
with respect to W∗,j is between the hidden layer h and the output layer y . Because
′
W∗,j is the j -th column of W ′ , so for calcula!ng the par!al deriva!ves if we go through
′
all j from 1 to V for each column vector W∗,j , then we can go through each element
′ ′
Wi,j in the matrix W ′ . For W∗,j , the back-propaga!on method is same as the above
calcula!on of “dynamic” so#max we showed. The par!al deriva!ves with respect to
Wk,∗ is between the hidden layer x and the output layer h, because h is a linear
combina!on of x based on the weight matrix W , and due to the one-hot nature of x, h
is same as Wk,∗ when xk =1. The par!al deriva!ves with respect to Wk,∗ is same as that
∂h
with respect to h. For other Wk′ ,∗ when k ′ ≠ k , we have ∂W = 0, so based on the
k′ ,∗
∂ log p(y (m) ∣x(m) ;θ)
chain rule ∂Wk′ ,∗
= 0 as well. Even for the back-propaga!on method, we only
need to consider the par!al deriva!ves with respect to Wk,∗ , and ignore other Wk′ ,∗ ,
and that is same as the above calcula!on of “dynamic” so#max we showed.
But in prac!ce, the above method for parameter fi$ng is not prac!cal. Because yj =
′ T
p(vj ∣vk ) is very expensive to compute due to the summa!on ∑l=1 e(W∗,l ) Wk,∗ over all
V
V context words (there can be hundreds of thousands of them). One way of making the
computa!on more tractable is to replace the so"max with an hierarchical so"max, but
we will not elaborate on this direc!on. Instead, we will focus on the recommended skip-
gram model with nega!ve sampling. Other methods are very similar in terms of
mathema!cal deriva!on.
skip-gram model with nega!ve sampling

Original skip-gram model
The target words in the corpus are represented as vin , and each target word has c
context words vout . In the skip-gram model, the input is x = onehot(vin ) and the
output is y = onehot(vout ). We want to maximize the corpus probability
L(θ) = ∏ ( ∏ p(vout ∣vin ; θ))

vin ∈Corpus vout ∈Context(vin )
Alterna!vely, we can maximize
L(θ) = ∏ p(vout ∣vin ; θ)

(vin ,vout )∈D
here D is the set of all target (input) word and context (output) pairs we extract from the
corpus.
nega!ve sampling
As we discussed before, if we use the so"max to calculate the p(vout ∣vin ; θ), it is
imprac!cal. So Mikolov et al. present the nega!ve-sampling approach as a more
efficient way of deriving word vectors. The jus!fica!on of nega!ve-sampling approach
is based on Noise Contras!ve Es!ma!on (NCE), and please refer to the paper of
Gutmann et al.
Consider a pair (vin , vout ) of target and context. Did this pair come from the training
data? Let’s denote by p(D = 1∣vin , vout ) the probability that (vin , vout ) came from the
corpus data. Correspondingly, p(D = 0∣vin , vout ) = 1 − p(D = 1∣vin , vout ) will be
the probability that (vin , vout ) did not come from the corpus data. As before, assume
there are parameters θ controlling the distribu!on: p(D = 1∣vin , vout ; θ).
Our goal is now to find parameters to maximize the probabili!es that all of the
observa!ons indeed came from the data:
L(θ) = ∏ p(D = 1∣vin , vout ; θ)

(vin ,vout )∈D
It is equal to maximize the log likelihood
ℓ(θ) = log ∏ p(D = 1∣vin , vout ; θ)

(vin ,vout )∈D
= ∑ log p(D = 1∣vin , vout ; θ)

(vin ,vout )∈D
′
Note: For simplicity, we will use win and wout instead of Wk,∗ and W∗,l to represent
the word vectors of vin and vout respec!vely.
The sigmoid func!on can be used to define
1
p(D = 1∣vin , vout ; θ) =
1 + e−win wout
T
So
1
ℓ(θ) = ∑ log
1 + e−win wout
T
(vin ,vout )∈D
This objec!ve has a trivial solu!on if we set θ such that p(D = 1∣vin , vout ; θ) = 1
for every pair (vin , vout ). This can be easily achieved by se$ng θ such that win = wout
T
and win wout = K for all win , wout , where K is large enough number (prac!cally, we
get a probability of 1 as soon as K ≈ 40).
We need a mechanism that prevents all the vectors from having the same value, by
disallowing some (vin , vout ) combina!ons. One way to do so, is to present the model
with some (vin , vout ) pairs for which p(D = 1∣vin , vout ; θ) must be low, i.e. pairs
which are not in the data. This is achieved by genera!ng the set D ′ of random
(vin , vout ) pairs, assuming they are all incorrect (the name “nega!ve sampling” stems
from the set D ′ of randomly sampled nega!ve examples). The op!miza!on objec!ve
now becomes:
L(θ) = ∏ p(D = 1∣vin , vout ; θ) ∏ p(D = 0∣vin , vout ; θ)

(vin ,vout )∈D (vin ,vout )∈D′
= ∏ p(D = 1∣vin , vout ; θ) ∏ (1 − p(D = 1∣vin , vout ; θ))

(vin ,vout )∈D (vin ,vout )∈D′
We can get
ℓ(θ) = ∑ log p(D = 1∣vin , vout ; θ)

(vin ,vout )∈D
+ ∑ log(1 − p(D = 1∣vin , vout ; θ))

(vin ,vout )∈D′
1
= ∑ log T
(vin ,vout )∈D

1 + e−win wout
1
+ ∑ log(1 − )
e−win wout
T
1+
(vin ,vout )∈D′
1
= ∑ log T
(vin ,vout )∈D

1 + e−win wout
1
+ ∑ log( T w )
1 + ewin out
(vin ,v )∈D′
out
= ∑ T
log σ(win wout )
(vin ,vout )∈D
+ ∑ T
log σ(−win wout )
(vin ,vout )∈D′
where σ is the sigmoid func!on.
The above equa!on is almost same as the one in the paper of Mikolov et al.
The difference from Mikolov et al. is that here we present the objec!ve for the en!re
corpus D ∪ D ′ , while they present it for one example (vin , vout ) ∈ D and n examples
(vin , vout,j ) ∈ D′ , following a par!cular way of construc!ng D′ .
Specifically, with nega!ve sampling of n, Mikolov et al.’s constructed D ′ is n !mes larger

than D , and for each (vin , vout ) ∈ D we construct n samples
(vin , vout,1 ), ⋯ , (vin , vout,j ), ⋯ , (vin , vout,n ), where each vout,j is drawn
according to its unigram distribu!on raised to the 3/4 power.
This is equivalent to drawing the samples (vin , vout,j ) ∈ D ′ from the distribu!on
puni (vin ) × (puni (vout ))3/4
(vin , vout,j ) ∼
Z
where puni (vin ) and puni (vout ) are the unigram distribu!ons of targets and contexts
respec!vely, and Z is a normaliza!on constant.
count(vin )
puni (vin ) =
∣Corpus∣
count(vout )
puni (vout ) =
∣Corpus∣
Note:
Unlike the original skip-gram model described in the begining, the formula!on in this
sec!on of nega!ve sampling does not model p(vout ∣vin ) but instead models a quan!ty
related to the joint distribu!on of vout and vin .
If we fix the targets representa!on and learn only the contexts representa!on, or fix the
contexts representa!on and learn only the targets representa!ons, the model reduces to
logis!c regression, and is convex. However, in this model the targets and contexts
representa!ons are learned jointly, making the model non-convex.
Parameter fi%ng
We use an indicator func!on 1{(vin , vout ) ∈ D} to indicate if a pair (vin , vout ) ∈ D .
We can rewrite the above likelihood
L(θ) = ∏ p(D = 1∣vin , vout ; θ)

(vin ,vout )∈D
× ∏ (1 − p(D = 1∣vin , vout ; θ))

(vin ,vout )∈D′
= ∏ (p(D = 1∣vin , vout ; θ)1{(vin ,vout )∈D}

(vin ,vout )∈D∪D′
×(1 − p(D = 1∣vin , vout ; θ))(1−1{(vin ,vout )∈D}) )

And we can also get
ℓ(θ) = ∑ log(p(D = 1∣vin , vout ; θ)1{(vin ,vout )∈D}

×(1 − p(D = 1∣vin , vout ; θ))(1−1{(vin ,vout )∈D}) )

= ∑ (1{(vin , vout ) ∈ D} log(p(D = 1∣vin , vout ; θ)
+ (1 − 1{(vin , vout ) ∈ D}) log(1 − p(D = 1∣vin , vout ; θ)))
= ∑ T
(1{(vin , vout ) ∈ D} log σ(win wout )
T
+ (1 − 1{(vin , vout ) ∈ D}) log(1 − σ(win wout )))
We calculate par!al deriva!ves with respect to win and wout
∂ℓ(θ) 1
= ∑ (1{(vin , vout ) ∈ D} ×
∂wout Tw )
σ(win out
T T
× σ(win wout )(1
− σ(win wout )) × win
1
+ (1 − 1{(vin , vout ) ∈ D}) × T w ) × (−1)
1 − σ(win out
T T
× σ(win wout )(1 − σ(win wout )) × win )
= ∑ T
(1{(vin , vout ) ∈ D} − σ(win wout )) × win
∂ℓ(θ)
Because win and wout are completely symmetrical, we can get ∂win from the
∂ℓ(θ)
calcula!on same as ∂wout .
∂ℓ(θ)
= ∑ T
(1{(vin , vout ) ∈ D} − σ(win wout )) × wout
∂win
In our above calcula!on, we first finish sampling and then mix all generated (posi!ve and
nega!ve) pairs of (vin , vout ) into one unified train set D ∪ D ′ .
But in some papers or notes, the calcula!on of the par!al deriva!ves looks a li&le
different, because they use a different index mechanism. During scanning the whole
corpus, there are n posi!ve output (context) words vpos for each input (target) word vin .
It means there are n posi!ve pairs of (vin , vpos ) for each scanned vin . Further, for each
posi!ve pair (vin , vpos ), n nega!ve pairs (vin , vneg ) are generated. As men!oned
before, the nega!ve pair number is n !mes of the posi!ve pair number.
L(θ) = ∏ ( ∏ (p(vpos ∣vin ; θ)

vin ∈Corpus vpos ∈P ositiveContext(vin )
∏ p(vneg ∣vin ; θ)))

vneg ∈N egativeContext(vin ,vpos )
Here we only consider one posi!ve pair (vin , vpos ) of current input vin , and n nega!ve
pairs (vin , vneg,s ) with (1 ≤ s ≤ n) corresponding to (vin , vpos ).
The log-likelihood
n
T
ℓvin (θ) = log(σ(win wpos )) + ∑ log(1 − σ(win
T
wneg,s ))
s=1
Further Reading
Omer Levy and Yoav Goldberg later found out that skip-gram model with nega!ve
sampling is implicitly factorizing a word-context matrix, whose cells are the pointwise
mutual informa!on (PMI) of the respec!ve target and context pairs, shi#ed by a global
constant. PMI matrices are commonly used by the tradi!onal approach to represent
words (o#en dubbed “distribu!onal seman!cs”). What’s really striking about this
discovery, is that word2vec (specifically, skip-gram model with nega!ve sampling) is
doing something very similar to what the NLP community has been doing for about 20
years; it’s just doing it really well.
Reference
Omer Levy, h&ps://www.quora.com/How-does-word2vec-work

Yoav Goldberg and Omer Levy, word2vec Explained: Deriving Mikolov et al.'s Nega!ve-
Sampling Word-Embedding Method
Richard Socher, CS224D Lecture 2 - 31st Mar 2016, h&ps://www.youtube.com/watch?

v=xhHOL3TNyJs
Ali Ghodsi, Lec [3,2]: Deep Learning, Word2vec, h&ps://www.youtube.com/watch?

v=nuirUEmbaJU
Xin Rong, word2vec Parameter Learning Explained, arXiv arXiv:1411.2738 (2014)
Omer Levy and Yoav Goldberg, Neural Word Embedding as Implicit Matrix
Factoriza!on, NIPS-2014
h&p://web.stanford.edu/class/cs224n/lecture_notes/cs224n-2017-notes1.pdf
Gutmann, Michael, and Aapo Hyvärinen. “Noise-contras!ve es!ma!on: A new

es!ma!on principle for unnormalized sta!s!cal models.” Interna!onal Conference on
Ar!ficial Intelligence and Sta!s!cs. 2010.

Model With One-Word Context: 2vec 2vec 2vec 2vec

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Model With One-Word Context: 2vec 2vec 2vec 2vec

Hochgeladen von

Copyright:

Verfügbare Formate

Notes on word2vec

word2vec is not one algorithm.

word2vec is not deep learning.

First and foremost, it is important to understand that word2vec is not a single

Model with one-word context

The hidden layer can be represented as a N -dimension vector

For an element hi in h, we have

We can show the calcula!on of hi in a vectorized representa!on.

where wi is also V -dimension vector.

For 1 ≤ i ≤ N , we can assemble all wi into a V × N matrix

⎡ w1,1 ⋯ w1,i ⋯ w1,N ⎤

The output y is also a V -dimension vector

yj = p(vj ∣x = onehot(vk )) = p(vj ∣vk )

We can show the calcula!on of uj in a vectorized representa!on.

where wj′ is also N -dimension vector.

For 1 ≤ j ≤ V , we can assemble all wj′ into a N × V matrix

We define a V -dimension vector

We define wj′ as W∗,j

Calcula!on of Par!al Deriva!ves

{(x(m) , y (m) ); m = 1, ..., M } is a training set containing M training examples.

= 1 × log p(vj ∣vk ) + ∑ 0 × log p(vl ∣vk )

Note: If a and b are both vectors, and we can get

∂ log p(y (m) ∣x(m) ; θ)

We can also have

skip-gram model with nega!ve sampling

L(θ) = ∏ ( ∏ p(vout ∣vin ; θ))

Alterna!vely, we can maximize

L(θ) = ∏ p(vout ∣vin ; θ)

L(θ) = ∏ p(D = 1∣vin , vout ; θ)

ℓ(θ) = log ∏ p(D = 1∣vin , vout ; θ)

= ∑ log p(D = 1∣vin , vout ; θ)

The sigmoid func!on can be used to define

(vin ,vout )∈D

L(θ) = ∏ p(D = 1∣vin , vout ; θ) ∏ p(D = 0∣vin , vout ; θ)

= ∏ p(D = 1∣vin , vout ; θ) ∏ (1 − p(D = 1∣vin , vout ; θ))

ℓ(θ) = ∑ log p(D = 1∣vin , vout ; θ)

+ ∑ log(1 − p(D = 1∣vin , vout ; θ))

(vin ,vout )∈D

(vin ,vout )∈D

where σ is the sigmoid func!on.

Specifically, with nega!ve sampling of n, Mikolov et al.’s constructed D ′ is n !mes larger

We use an indicator func!on 1{(vin , vout ) ∈ D} to indicate if a pair (vin , vout ) ∈ D .

We can rewrite the above likelihood

L(θ) = ∏ p(D = 1∣vin , vout ; θ)

× ∏ (1 − p(D = 1∣vin , vout ; θ))

= ∏ (p(D = 1∣vin , vout ; θ)1{(vin ,vout )∈D}

×(1 − p(D = 1∣vin , vout ; θ))(1−1{(vin ,vout )∈D}) )

ℓ(θ) = ∑ log(p(D = 1∣vin , vout ; θ)1{(vin ,vout )∈D}

×(1 − p(D = 1∣vin , vout ; θ))(1−1{(vin ,vout )∈D}) )

We calculate par!al deriva!ves with respect to win and wout

L(θ) = ∏ ( ∏ (p(vpos ∣vin ; θ)

∏ p(vneg ∣vin ; θ)))

Omer Levy, h&ps://www.quora.com/How-does-word2vec-work

Richard Socher, CS224D Lecture 2 - 31st Mar 2016, h&ps://www.youtube.com/watch?

Ali Ghodsi, Lec [3,2]: Deep Learning, Word2vec, h&ps://www.youtube.com/watch?

Xin Rong, word2vec Parameter Learning Explained, arXiv arXiv:1411.2738 (2014)

Gutmann, Michael, and Aapo Hyvärinen. “Noise-contras!ve es!ma!on: A new

Das könnte Ihnen auch gefallen