CS490 Advanced Topics in Computing - Deep Learning

CS490 Advanced Topics in Computing -
Deep Learning
Lecture 08
Word Vector Representation using Deep Learning
Slides taken from http://web.stanford.edu/class/cs224n/

1
Also go through the tutorial http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-g
Representing words as discrete symbols
The first task in all NLP is the representation of words.
Means one 1, the rest 0s
Words can be represented by one-hot vectors:

house = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
motel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
Vector dimension = number of words in vocabulary (e.g., 500,000)
2
Sec. 9.2.2
Problem with words as discrete symbols

Example: in web search, if user searches for “Islamabad motel”, we
would like to match documents containing “Islamabad hotel”.
But:
motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
These two vectors are orthogonal.
There is no natural notion of similarity for one-hot vectors!
Solution:
• learn to encode similarity in the vectors themselves
3
Representing words by their context
• Distributional semantics: A word’s meaning is given

by the words that frequently appear close-by
• “You shall know a word by the company it keeps” (J. R. Firth 1957: 11)
• One of the most successful ideas of modern statistical NLP!
• When a word w appears in a text, its context is the set of words

that appear nearby (within a fixed-size window).
• Use the many contexts of w to build up a representation of w
…government debt problems turning into banking crises as happened in 2009…

…saying that Europe needs unified banking regulation to replace the hodgepodge…
…India has just given its banking system a shot in the arm…
These context words will represent banking 4

Word vectors
We will build a dense vector for each word, chosen so that it is

similar to vectors of words that appear in similar contexts
0.286
0.792
−0.177
−0.107
banking = 0.109
−0.542
0.349
0.271
Note: word vectors are sometimes called word embeddings or

word representations. They are a distributed representation.
5
Word meaning as a neural word vector – visualization
0.286
0.792
−0.177
−0.107
expect = 0.109
−0.542
0.349
0.271
0.487
6
Word2vec: Overview
Word2vec (Mikolov et al. 2013) is a framework for learning
word vectors
Idea:
• We have a large corpus of text
• Every word in a fixed vocabulary is represented by a vector
• Go through each position t in the text, which has a center word
c and context (“outside”) words o
• Use the similarity of the word vectors for c and o to calculate
the probability of o given c (or vice versa)
• Keep adjusting the word vectors to maximize this probability
7
Word2Vec Overview
• Exampleꢀwindowsꢀandꢀprocessꢀforꢀcomputing ꢀ ꢂꢃꢄꢅ | ꢂꢃ
ꢀ ꢂꢃꢈꢇ | ꢂꢃ ꢀ ꢂ |ꢂ
ꢃꢄꢇ ꢃ
ꢀ ꢂꢃꢈꢆ | ꢂꢃ ꢀ ꢂ |ꢂ
ꢃꢄꢆ ꢃ
… problems turning into banking crises as …
outside context words center word outside context words

in window of size 2 at position t in window of size 2
8
Word2Vec Overview
ꢀ ꢂꢃꢈꢇ | ꢂꢃ ꢀ ꢂ |ꢂ
ꢃꢄꢇ ꢃ
ꢀ ꢂꢃꢈꢆ | ꢂꢃ ꢀ ꢂ |ꢂ
ꢃꢄꢆ ꢃ

9
Word2vec: objective function
For each position ꢉ = 1, …, ꢊ, predict context words within a
window of fixed size m, given center word ꢂꢅ.
ꢏ
Likelihood = ꢋ ꢌ = ꢍ ꢍ ꢀ ꢂꢃꢄꢅ | ꢂꢃ ; ꢌ
ꢃꢎꢆ ꢈꢐꢑꢅꢑꢐ
ꢅꢒꢓ
ꢌ is all variables
to be optimized
sometimes called cost or loss function
The objective function ꢔ ꢌ is the (average) negative log likelihood:

ꢏ
1 1
ꢔ ꢌ = − log ꢋ(ꢌ) = − ꢕ ꢕ log ꢀ ꢂ ꢃꢄꢅ| ꢂꢃ ; ꢌ
ꢊ ꢊ
ꢅꢒꢓ
Minimizing objective function ⟺ Maximizing predictive accuracy

10
Word2vec: objective function
• We want to minimize the objective function:

ꢏ
1
ꢔ ꢌ =− ꢕ ꢕ log ꢀ ꢂ ꢃꢄꢅ | ꢂꢃ ; ꢌ
ꢊ
ꢅꢒꢓ
• Question: How to calculate ꢀ ꢂ ꢃꢄꢅ | ꢂ ꢃ; ꢌ ?
• Answer: We will use two vectors per word w:

• ꢖꢗ when w is a center word
• ꢘ ꢗ when w is a context word
• Then for a center word c and a context word o:
exp(ꢘꢛꢏ ꢖꢜ )
ꢀ ꢙꢚ =
∑ꢗ∈ꢝ exp(ꢘꢗꢏ ꢖꢜ ) 11
Word2Vec Overview with Vectors
• ꢀ ꢘꢞꢟꢛꢠꢡꢢꢐꢣ | ꢖꢤꢥꢃꢛ short for P ꢦꢧꢙꢨꢩꢪꢫꢬ | ꢭꢮꢉꢙ ; ꢘꢞꢟꢛꢠꢡꢢꢐꢣ, ꢖꢤꢥꢃꢛ, ꢌ
ꢀ ꢘꢞꢟꢛꢠꢡꢢꢐꢣ | ꢖꢤꢥꢃꢛ ꢀ ꢘꢜꢟꢤꢣꢤꢣ |ꢖꢤꢥꢃꢛ
ꢀ ꢘꢃꢲꢥꢤꢥꢱ | ꢖꢤꢥꢃꢛ ꢀ ꢘꢠꢯꢥꢰꢤꢥꢱ |ꢖꢤꢥꢃꢛ

12
Word2vec: prediction function
② Exponentiation makes anything positive
① Dot product compares similarity of o and c.
ꢘꢏ ꢖ = ꢘ. ꢖ = ∑ꢥꢤꢎꢆꢘ ꢤꢖ ꢤ
exp(ꢘꢛꢏ ꢖꢜ )
ꢀ ꢙꢚ = Larger dot product = larger probability
∑ꢗ∈ꢝ exp(ꢘꢏꢗꢖ ꢜ)
③ Normalize over entire vocabulary
to give probability distribution
• This is an example of the softmax function ℝ ꢥ→ (0,1)ꢥ

exp(ꢳꢤ ) Open
softmax ꢳꢤ = ꢥ = ꢦꢤ region
∑ꢅꢎꢆexp(ꢳ ꢅ)
• The softmax function maps arbitrary values ꢳ ꢤ to a probability
distribution ꢦꢤ
• “max” because amplifies probability of largest ꢳꢤ
• “soft” because still assigns some probability to smaller ꢳꢤ
• Frequently used in Deep Learning
13
14
15
Word2vec: Implementing using Neural network
(Skip-Gram model)
′
𝑊𝑉×𝑁
𝑉: 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑣𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦
𝑁: 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑑𝑖𝑚
𝑥: 𝑜𝑛𝑒 ℎ𝑜𝑡 𝑣𝑒𝑐𝑡𝑜𝑟 𝑒𝑛𝑐𝑜𝑑𝑖𝑛𝑔 𝑜𝑓 𝑖𝑛𝑝𝑢𝑡 𝑤𝑜𝑟𝑑
ℎ: 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑤𝑜𝑟𝑑 𝑣𝑒𝑐𝑡𝑜𝑟 ℎ = 𝑊𝑁×𝑉 𝑥
′ ′
𝑊𝑁×𝑉 𝑊𝑉×𝑁 𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊𝑉×𝑁 ℎ)
Columns of 𝑊𝑁×𝑉 are the desired embedding
′
𝑊𝑉×𝑁
16
(Continuous Bag of Words Model (CBOW))
CBOW aims to predict a center word from

surrounding context words while skip-gram
does the opposite and predicts the
probability of context words from a center
word
17
18
*See word2vec.ipynb and Gensim_tutorial.ipynb on lms
Improving training time – Negative Sampling
• To compute the probability of the context word requires
summation over 𝑉 , which is computationally huge.
• Any update we do or evaluation of the objective function would

take 𝑂(|𝑉|)
• Negative sampling addresses this by having each training

sample only modify a small percentage of the weights, rather
than all of them.
• It does that by randomly selecting a small number of
negative examples for each training word pair.
19
Thank you
20

CS490 Advanced Topics in Computing - Deep Learning

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

CS490 Advanced Topics in Computing - Deep Learning

Hochgeladen von

Copyright:

Verfügbare Formate

CS490 Advanced Topics in Computing -

Slides taken from http://web.stanford.edu/class/cs224n/

The first task in all NLP is the representation of words.

Means one 1, the rest 0s

Words can be represented by one-hot vectors:

Vector dimension = number of words in vocabulary (e.g., 500,000)

Problem with words as discrete symbols

• Distributional semantics: A word’s meaning is given

• When a word w appears in a text, its context is the set of words

…government debt problems turning into banking crises as happened in 2009…

These context words will represent banking 4

We will build a dense vector for each word, chosen so that it is

Note: word vectors are sometimes called word embeddings or

… problems turning into banking crises as …

outside context words center word outside context words

… problems turning into banking crises as …

outside context words center word outside context words

The objective function ꢔ ꢌ is the (average) negative log likelihood:

Minimizing objective function ⟺ Maximizing predictive accuracy

• We want to minimize the objective function:

• Question: How to calculate ꢀ ꢂ ꢃꢄꢅ | ꢂ ꢃ; ꢌ ?

• Answer: We will use two vectors per word w:

• Then for a center word c and a context word o:

ꢀ ꢘꢞꢟꢛꢠꢡꢢꢐꢣ | ꢖꢤꢥꢃꢛ ꢀ ꢘꢜꢟꢤꢣꢤꢣ |ꢖꢤꢥꢃꢛ

ꢀ ꢘꢃꢲꢥꢤꢥꢱ | ꢖꢤꢥꢃꢛ ꢀ ꢘꢠꢯꢥꢰꢤꢥꢱ |ꢖꢤꢥꢃꢛ

… problems turning into banking crises as …

outside context words center word outside context words

• This is an example of the softmax function ℝ ꢥ→ (0,1)ꢥ

Columns of 𝑊𝑁×𝑉 are the desired embedding

CBOW aims to predict a center word from

• Any update we do or evaluation of the objective function would

• Negative sampling addresses this by having each training

Das könnte Ihnen auch gefallen