Sie sind auf Seite 1von 20

CS490 Advanced Topics in Computing -

Deep Learning

Lecture 08
Word Vector Representation using Deep Learning

Slides taken from http://web.stanford.edu/class/cs224n/


1
Also go through the tutorial http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-g
Representing words as discrete symbols

The first task in all NLP is the representation of words.

Means one 1, the rest 0s

Words can be represented by one-hot vectors:


house = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
motel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]

Vector dimension = number of words in vocabulary (e.g., 500,000)

2
Sec. 9.2.2

Problem with words as discrete symbols


Example: in web search, if user searches for “Islamabad motel”, we
would like to match documents containing “Islamabad hotel”.

But:
motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
These two vectors are orthogonal.
There is no natural notion of similarity for one-hot vectors!

Solution:
• learn to encode similarity in the vectors themselves

3
Representing words by their context

• Distributional semantics: A word’s meaning is given


by the words that frequently appear close-by
• “You shall know a word by the company it keeps” (J. R. Firth 1957: 11)
• One of the most successful ideas of modern statistical NLP!

• When a word w appears in a text, its context is the set of words


that appear nearby (within a fixed-size window).
• Use the many contexts of w to build up a representation of w

…government debt problems turning into banking crises as happened in 2009…


…saying that Europe needs unified banking regulation to replace the hodgepodge…
…India has just given its banking system a shot in the arm…

These context words will represent banking 4


Word vectors

We will build a dense vector for each word, chosen so that it is


similar to vectors of words that appear in similar contexts

0.286
0.792
−0.177
−0.107
banking = 0.109
−0.542
0.349
0.271

Note: word vectors are sometimes called word embeddings or


word representations. They are a distributed representation.
5
Word meaning as a neural word vector – visualization

0.286
0.792
−0.177
−0.107
expect = 0.109
−0.542
0.349
0.271
0.487

6
Word2vec: Overview
Word2vec (Mikolov et al. 2013) is a framework for learning
word vectors

Idea:
• We have a large corpus of text
• Every word in a fixed vocabulary is represented by a vector
• Go through each position t in the text, which has a center word
c and context (“outside”) words o
• Use the similarity of the word vectors for c and o to calculate
the probability of o given c (or vice versa)
• Keep adjusting the word vectors to maximize this probability

7
Word2Vec Overview
• Exampleꢀwindowsꢀandꢀprocessꢀforꢀcomputing ꢀ ꢂꢃꢄꢅ | ꢂꢃ

ꢀ ꢂꢃꢈꢇ | ꢂꢃ ꢀ ꢂ |ꢂ
ꢃꢄꢇ ꢃ
ꢀ ꢂꢃꢈꢆ | ꢂꢃ ꢀ ꢂ |ꢂ
ꢃꢄꢆ ꢃ

… problems turning into banking crises as …

outside context words center word outside context words


in window of size 2 at position t in window of size 2

8
Word2Vec Overview
• Exampleꢀwindowsꢀandꢀprocessꢀforꢀcomputing ꢀ ꢂꢃꢄꢅ | ꢂꢃ

ꢀ ꢂꢃꢈꢇ | ꢂꢃ ꢀ ꢂ |ꢂ
ꢃꢄꢇ ꢃ
ꢀ ꢂꢃꢈꢆ | ꢂꢃ ꢀ ꢂ |ꢂ
ꢃꢄꢆ ꢃ

… problems turning into banking crises as …

outside context words center word outside context words


in window of size 2 at position t in window of size 2

9
Word2vec: objective function
For each position ꢉ = 1, …, ꢊ, predict context words within a
window of fixed size m, given center word ꢂꢅ.

Likelihood = ꢋ ꢌ = ꢍ ꢍ ꢀ ꢂꢃꢄꢅ | ꢂꢃ ; ꢌ
ꢃꢎꢆ ꢈꢐꢑꢅꢑꢐ
ꢅꢒꢓ
ꢌ is all variables
to be optimized
sometimes called cost or loss function

The objective function ꢔ ꢌ is the (average) negative log likelihood:



1 1
ꢔ ꢌ = − log ꢋ(ꢌ) = − ꢕ ꢕ log ꢀ ꢂ ꢃꢄꢅ| ꢂꢃ ; ꢌ
ꢊ ꢊ
ꢃꢎꢆ ꢈꢐꢑꢅꢑꢐ
ꢅꢒꢓ

Minimizing objective function ⟺ Maximizing predictive accuracy


10
Word2vec: objective function

• We want to minimize the objective function:



1
ꢔ ꢌ =− ꢕ ꢕ log ꢀ ꢂ ꢃꢄꢅ | ꢂꢃ ; ꢌ

ꢃꢎꢆ ꢈꢐꢑꢅꢑꢐ
ꢅꢒꢓ

• Question: How to calculate ꢀ ꢂ ꢃꢄꢅ | ꢂ ꢃ; ꢌ ?

• Answer: We will use two vectors per word w:


• ꢖꢗ when w is a center word
• ꢘ ꢗ when w is a context word

• Then for a center word c and a context word o:

exp(ꢘꢛꢏ ꢖꢜ )
ꢀ ꢙꢚ =
∑ꢗ∈ꢝ exp(ꢘꢗꢏ ꢖꢜ ) 11
Word2Vec Overview with Vectors
• Exampleꢀwindowsꢀandꢀprocessꢀforꢀcomputing ꢀ ꢂꢃꢄꢅ | ꢂꢃ
• ꢀ ꢘꢞꢟꢛꢠꢡꢢꢐꢣ | ꢖꢤꢥꢃꢛ short for P ꢦꢧꢙꢨꢩꢪꢫꢬ | ꢭꢮꢉꢙ ; ꢘꢞꢟꢛꢠꢡꢢꢐꢣ, ꢖꢤꢥꢃꢛ, ꢌ

ꢀ ꢘꢞꢟꢛꢠꢡꢢꢐꢣ | ꢖꢤꢥꢃꢛ ꢀ ꢘꢜꢟꢤꢣꢤꢣ |ꢖꢤꢥꢃꢛ

ꢀ ꢘꢃꢲꢥꢤꢥꢱ | ꢖꢤꢥꢃꢛ ꢀ ꢘꢠꢯꢥꢰꢤꢥꢱ |ꢖꢤꢥꢃꢛ

… problems turning into banking crises as …

outside context words center word outside context words


in window of size 2 at position t in window of size 2

12
Word2vec: prediction function
② Exponentiation makes anything positive
① Dot product compares similarity of o and c.
ꢘꢏ ꢖ = ꢘ. ꢖ = ∑ꢥꢤꢎꢆꢘ ꢤꢖ ꢤ
exp(ꢘꢛꢏ ꢖꢜ )
ꢀ ꢙꢚ = Larger dot product = larger probability
∑ꢗ∈ꢝ exp(ꢘꢏꢗꢖ ꢜ)
③ Normalize over entire vocabulary
to give probability distribution

• This is an example of the softmax function ℝ ꢥ→ (0,1)ꢥ


exp(ꢳꢤ ) Open
softmax ꢳꢤ = ꢥ = ꢦꢤ region
∑ꢅꢎꢆexp(ꢳ ꢅ)
• The softmax function maps arbitrary values ꢳ ꢤ to a probability
distribution ꢦꢤ
• “max” because amplifies probability of largest ꢳꢤ
• “soft” because still assigns some probability to smaller ꢳꢤ
• Frequently used in Deep Learning
13
14
15
Word2vec: Implementing using Neural network
(Skip-Gram model)


𝑊𝑉×𝑁
𝑉: 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑣𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦
𝑁: 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑑𝑖𝑚
𝑥: 𝑜𝑛𝑒 ℎ𝑜𝑡 𝑣𝑒𝑐𝑡𝑜𝑟 𝑒𝑛𝑐𝑜𝑑𝑖𝑛𝑔 𝑜𝑓 𝑖𝑛𝑝𝑢𝑡 𝑤𝑜𝑟𝑑
ℎ: 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑤𝑜𝑟𝑑 𝑣𝑒𝑐𝑡𝑜𝑟 ℎ = 𝑊𝑁×𝑉 𝑥
′ ′
𝑊𝑁×𝑉 𝑊𝑉×𝑁 𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊𝑉×𝑁 ℎ)

Columns of 𝑊𝑁×𝑉 are the desired embedding


𝑊𝑉×𝑁

16
Word2vec: Implementing using Neural network
(Continuous Bag of Words Model (CBOW))

CBOW aims to predict a center word from


surrounding context words while skip-gram
does the opposite and predicts the
probability of context words from a center
word

17
Word2vec: Implementing using Neural network

18
*See word2vec.ipynb and Gensim_tutorial.ipynb on lms
Improving training time – Negative Sampling
• To compute the probability of the context word requires
summation over 𝑉 , which is computationally huge.

• Any update we do or evaluation of the objective function would


take 𝑂(|𝑉|)

• Negative sampling addresses this by having each training


sample only modify a small percentage of the weights, rather
than all of them.
• It does that by randomly selecting a small number of
negative examples for each training word pair.
19
Thank you

20

Das könnte Ihnen auch gefallen