Beruflich Dokumente
Kultur Dokumente
Deep Learning
Lecture 08
Word Vector Representation using Deep Learning
2
Sec. 9.2.2
But:
motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
These two vectors are orthogonal.
There is no natural notion of similarity for one-hot vectors!
Solution:
• learn to encode similarity in the vectors themselves
3
Representing words by their context
0.286
0.792
−0.177
−0.107
banking = 0.109
−0.542
0.349
0.271
0.286
0.792
−0.177
−0.107
expect = 0.109
−0.542
0.349
0.271
0.487
6
Word2vec: Overview
Word2vec (Mikolov et al. 2013) is a framework for learning
word vectors
Idea:
• We have a large corpus of text
• Every word in a fixed vocabulary is represented by a vector
• Go through each position t in the text, which has a center word
c and context (“outside”) words o
• Use the similarity of the word vectors for c and o to calculate
the probability of o given c (or vice versa)
• Keep adjusting the word vectors to maximize this probability
7
Word2Vec Overview
• Exampleꢀwindowsꢀandꢀprocessꢀforꢀcomputing ꢀ ꢂꢃꢄꢅ | ꢂꢃ
ꢀ ꢂꢃꢈꢇ | ꢂꢃ ꢀ ꢂ |ꢂ
ꢃꢄꢇ ꢃ
ꢀ ꢂꢃꢈꢆ | ꢂꢃ ꢀ ꢂ |ꢂ
ꢃꢄꢆ ꢃ
8
Word2Vec Overview
• Exampleꢀwindowsꢀandꢀprocessꢀforꢀcomputing ꢀ ꢂꢃꢄꢅ | ꢂꢃ
ꢀ ꢂꢃꢈꢇ | ꢂꢃ ꢀ ꢂ |ꢂ
ꢃꢄꢇ ꢃ
ꢀ ꢂꢃꢈꢆ | ꢂꢃ ꢀ ꢂ |ꢂ
ꢃꢄꢆ ꢃ
9
Word2vec: objective function
For each position ꢉ = 1, …, ꢊ, predict context words within a
window of fixed size m, given center word ꢂꢅ.
ꢏ
Likelihood = ꢋ ꢌ = ꢍ ꢍ ꢀ ꢂꢃꢄꢅ | ꢂꢃ ; ꢌ
ꢃꢎꢆ ꢈꢐꢑꢅꢑꢐ
ꢅꢒꢓ
ꢌ is all variables
to be optimized
sometimes called cost or loss function
exp(ꢘꢛꢏ ꢖꢜ )
ꢀ ꢙꢚ =
∑ꢗ∈ꢝ exp(ꢘꢗꢏ ꢖꢜ ) 11
Word2Vec Overview with Vectors
• Exampleꢀwindowsꢀandꢀprocessꢀforꢀcomputing ꢀ ꢂꢃꢄꢅ | ꢂꢃ
• ꢀ ꢘꢞꢟꢛꢠꢡꢢꢐꢣ | ꢖꢤꢥꢃꢛ short for P ꢦꢧꢙꢨꢩꢪꢫꢬ | ꢭꢮꢉꢙ ; ꢘꢞꢟꢛꢠꢡꢢꢐꢣ, ꢖꢤꢥꢃꢛ, ꢌ
12
Word2vec: prediction function
② Exponentiation makes anything positive
① Dot product compares similarity of o and c.
ꢘꢏ ꢖ = ꢘ. ꢖ = ∑ꢥꢤꢎꢆꢘ ꢤꢖ ꢤ
exp(ꢘꢛꢏ ꢖꢜ )
ꢀ ꢙꢚ = Larger dot product = larger probability
∑ꢗ∈ꢝ exp(ꢘꢏꢗꢖ ꢜ)
③ Normalize over entire vocabulary
to give probability distribution
′
𝑊𝑉×𝑁
𝑉: 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑣𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦
𝑁: 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑑𝑖𝑚
𝑥: 𝑜𝑛𝑒 ℎ𝑜𝑡 𝑣𝑒𝑐𝑡𝑜𝑟 𝑒𝑛𝑐𝑜𝑑𝑖𝑛𝑔 𝑜𝑓 𝑖𝑛𝑝𝑢𝑡 𝑤𝑜𝑟𝑑
ℎ: 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑤𝑜𝑟𝑑 𝑣𝑒𝑐𝑡𝑜𝑟 ℎ = 𝑊𝑁×𝑉 𝑥
′ ′
𝑊𝑁×𝑉 𝑊𝑉×𝑁 𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊𝑉×𝑁 ℎ)
′
𝑊𝑉×𝑁
16
Word2vec: Implementing using Neural network
(Continuous Bag of Words Model (CBOW))
17
Word2vec: Implementing using Neural network
18
*See word2vec.ipynb and Gensim_tutorial.ipynb on lms
Improving training time – Negative Sampling
• To compute the probability of the context word requires
summation over 𝑉 , which is computationally huge.
20