Sie sind auf Seite 1von 41

Vector

 Semantics
Dense  Vectors  
Dan  Jurafsky

Sparse  versus  dense  vectors

• PPMI  vectors  are


• long (length  |V|=   20,000  to  50,000)
• sparse   (most  elements   are  zero)
• Alternative:  learn  vectors  which  are
• short (length  200-­‐1000)
• dense (most  elements   are  non-­‐zero)

2
Dan  Jurafsky

Sparse  versus  dense  vectors


• Why  dense  vectors?
• Short   vectors  may   be  easier  to  use  as  features  in  machine  
learning  (less  weights  to  tune)
• Dense   vectors  may   generalize   better  than  storing  explicit  counts
• They  may   do  better  at  capturing  synonymy:
• car and   automobile are   synonyms;   but  are  represented  as  
distinct  dimensions;   this  fails  to  capture   similarity  between  a  
word   with  car as  a  neighbor  and   a  word   with  automobile as  a  
neighbor
3
Dan  Jurafsky

Three  methods  for  getting  short  dense  


vectors
• Singular  Value  Decomposition  (SVD)
• A  special  case  of  this  is  called  LSA   – Latent  Semantic  Analysis
• “Neural  Language  Model”-­‐inspired  predictive  models
• skip-­‐grams   and  CBOW
• Brown  clustering

4
Vector  Semantics
Dense  Vectors  via  SVD
Dan  Jurafsky

Intuition
• Approximate  an  N-­‐dimensional  dataset  using  fewer  dimensions
• By  first  rotating  the  axes  into  a  new  space
• In  which   the  highest  order   dimension  captures   the  most  
variance   in  the  original  dataset
• And  the  next  dimension   captures  the  next  most  variance,  etc.
• Many   such  (related)  methods:
• PCA  – principle  components  analysis
• Factor  Analysis
• SVD
6
Dan  Jurafsky 6

Dimensionality  reduction PCA dimension 1


5

PCA dimension 2
4

7
1 2 3 44 55 66
Dan  Jurafsky

Singular  Value  Decomposition

Any  rectangular  w  x  c  matrix  X  equals  the  product  of  3  matrices:


W:   rows  corresponding   to  original  but  m  columns  represents   a  
dimension   in  a  new  latent  space,   such  that  
• M  column  vectors  are  orthogonal  to  each  other
• Columns  are  ordered  by  the  amount  of  variance  in  the  dataset  each  new  
dimension  accounts  for
S:    diagonal  m x  m matrix   of  singular  values   expressing   the  
importance   of  each  dimension.
C:  columns   corresponding   to  original  but  m  rows   corresponding   to  
8
singular  values
here. Suffice it to say that cookbook versions of SVD adequate for
Dan  Jurafsky
small (e.g., 100 × 100) matrices are available in several places (e.g.,
Singular  Value  Decomposition
Mathematica, 1991 ), and a free software version (Berry, 1992) suitable Th
discu
matr
rows
Contexts conte
ing s
are o
of th
3= loss
m x m m x c curre
See
Dum
wxc w xm

9
Figure A1. Landuaer
Schematic diagram of the singular value and  Dumais 1997 A~
decomposition
Dan  Jurafsky reduce the number of dimensions systematically by, for example,
SVD  applied  to  term-­‐document  matrix:
ing those with the smallest effect on the sum-squared error of the a
imation simply by deleting those with the smallest singular valu
Latent  Semantic  Analysis
The actual algorithms used Deerwester
to compute SVDset  al  for large sparse m
(1988)
of the sort involved in LSA are rather sophisticated and are not de
here. Suffice it to say that cookbook versions of SVD adequ
• If  instead  of  keeping   all  m  dimensions,  we  
small just  keep  
(e.g., 100 × 100) matrices the  top  
are available k   place
in several
singular  values.  Let’s  say   300. Mathematica, 1991 ), and a free software version (Berry, 1992) s

• The  result  is  a  least-­‐squares  approximation  to  the  original  X


• But  instead  of  multiplying,                              Contexts
                                                                                                         
we’ll  just  make   use  of  W.
• Each  row   of  W:
3=
• A  k-­‐dimensional  vector m/ x m / /
m x c
• Representing  word  W k k k

10 k
wxc w x m/
Dan  Jurafsky

LSA  more  details


• 300   dimensions  are   commonly   used
• The  cells  are   commonly   weighted  by  a  product   of  two  weights
• Local  weight:    Log  term  frequency
• Global  weight:  either  idf or  an  entropy  measure

11
Dan  Jurafsky

Let’s  return  to  PPMI  word-­‐word  matrices


• Can  we   apply  to  SVD  to  them?

12
riginal M. Since the first dimensions encode the most variance, one way to view
Dan  Jurafsky

he reconstruction is thus as modeling the most important information in the original


ataset. SVD  applied  to  term-­‐term  matrix
VD applied to co-occurrence matrix X:
2 3 2 32 32 3
s1 0 0 . . . 0
6 7 6 7 6 0 s2 0 . . . 0 76 7
6 7 6 76 76 7
76 0 0 s3 . . .
6 X 7 6 W 7 6 0 76 C 7
6 7 6 76 7
6 7= 6 76 .. .. .. . . .. 76 7
4 5 4 54 . . . . . 54 5
0 0 0 ... sV
|V | ⇥ |V | |V | ⇥ |V | |V | ⇥ |V | |V | ⇥ |V |

aking only the top k dimensions after SVD applied to co-occurrence matrix X:
13 (I’m  simplifying  here  by  assuming  the  matrix  has  rank  |V|)
2 3 2 32 3h i
4 5
Dan  Jurafsky
4 54 .... .. . . .. 54 5
0 0 0 . . . sV
|V | ⇥ |V | Truncated  
|V | ⇥ SVD  on  term-­‐term  matrix|V | ⇥ |V |
|V | |V | ⇥ |V |

king only the top k dimensions after SVD applied to co-occurrence matrix X:
2 3 2 32 3h i
s1 0 0 . . . 0 C
6
6
7
7
6
6
76
76 0 s2 0 . . . 0 7
7 k ⇥ |V |
6
6 X 7
7
6 W
6
76
76 0 0 s3 . . . 0 7
7
6 7= 6 76 .. .. .. . . .. 7
4 5 4 54 . . . . . 5
0 0 0 ... sk
|V | ⇥ |V | |V | ⇥ k k⇥k

gure 19.11 SVD factors a matrix X into a product of three matrices, W, S, and C. Taking
14
first k dimensions gives a |V | ⇥ k matrix Wk that has one k-dimensioned row per word that
Dan  Jurafsky

|V | ⇥ |V | |V | ⇥ |V |
Truncated  SVD  produces  embeddings
Taking only the top k dimensions after SV
2
• Each  row   of  W  matrix   is  a  k-­‐dimensional   3
embedding
2 3 2
s
representation   of  each  word   w 6 for
7 6 76
6 word7i 6 76
• K  might  range   from  50   to  1000 6 X 7 6 7 6
6 7 6 W 76
6 7= 6 76
• Generally   we  keep  the  top  k  dimensions,  
4 5 4 54
but  some  experiments  suggest  that  
getting  rid  of  the  top  1  dimension  or    even  
|V | ⇥ |V | |V | ⇥ k
the  top  50   dimensions  is  helpful  (Lapesa
and  Evert  2014). Figure 19.11 SVD factors a matrix X into a
the first k dimensions gives a |V | ⇥ k matrix W
15
can be used as an embedding.
Dan  Jurafsky

Embeddings versus  sparse  vectors


• Dense  SVD  embeddings sometimes  work  better  than  
sparse  PPMI  matrices  at  tasks  like  word  similarity
• Denoising:  low-­‐order  dimensions  may  represent  unimportant  
information
• Truncation   may  help  the   models  generalize  better   to   unseen  data.
• Having  a  smaller   number  of  dimensions  may  make  it  easier   for  
classifiers  to   properly  weight   the   dimensions  for  the   task.
• Dense   models   may  do   better   at  capturing   higher  order  co-­‐
16
occurrence.  
Vector  Semantics
Embeddings inspired  by  
neural  language  models:  
skip-­‐grams  and  CBOW
Dan  Jurafsky
Prediction-­‐based  models:
An  alternative  way  to  get  dense  vectors

• Skip-­‐gram (Mikolov et  al.  2013a)    CBOW (Mikolov et  al.  2013b)


• Learn   embeddings as  part  of  the  process   of  word   prediction.
• Train  a  neural   network   to  predict  neighboring   words
• Inspired  by  neural  net  language  models.
• In  so  doing,  learn  dense  embeddings for  the  words  in  the  training  corpus.
• Advantages:
• Fast,  easy  to  train  (much  faster  than  SVD)
• Available  online  in  the  word2vec package
18 • Including  sets  of  pretrained embeddings!
ntinuous bag of words) (Mikolov et al. 2013, Mikolov
Dan  Jurafsky

n from the neural methods for language modeling intro-


Skip-­‐grams
e neural language models, these models train a network
s, and while doing so learn dense embeddings for the
• Predict   e ach   n eighboring  
. The advantage of these methods is that they are fast,w ord  
• in  a  context   window  of  2C  words  
available online in the word2vec package; code and
• from  the  current   word.  
oth available.
• So  f or   C =2,  w e  
ip-gram model. The skip-gram model a re  given  w ord  w t and  p redicting  
t
predicts each hese  
4  words:
ext window of 2C words from the current word. So
the context is [wt 2 , wt 1 , wt+1 , wt+2 ] and we are pre-
ord wt . Fig. 17.12 sketches the architecture for a sample
19
Dan  Jurafsky
W
Skip-­‐grams  learn  2  embeddings 1 2 .
. i … |V|

for  each  w
1
2
.
.

d
input  embedding  v,   in  the  input  matrix   W
d x |V|
• Column   i of  the  input  matrix   W  is  the  1×d  
embedding   vi for  word   i in  the  vocabulary.   W’
1 2 … d
1
2
output  embedding  vʹ′,   in  output  matrix   W’ .
.
.

• Row  i of  the  output  matrix   Wʹ′  is  a  d  × 1  


.

i
vector   embedding  vʹ′i for  word   i in  the   .
.
.

vocabulary. .
|V|
20
|V| x d
Dan  Jurafsky

Setup
• Walking  through   corpus  pointing  at  word   w(t),  whose  index  in  
the  vocabulary  is  j,  so  we’ll  call  it  wj (1  <  j  <  |V  |).  
• Let’s  predict  w(t+1)  ,  whose  index   in  the  vocabulary  is  k  (1  <  k  <  
|V  |).  Hence  our   task  is  to  compute  P(wk|wj).  

21
Dan  Jurafsky

Intuition:  similarity  as  dot-­‐product


between  a  target  vector  and  context  vector
W C
target embeddings context embeddings
target embedding
1. .. … d
for word j 1.2…….j………|Vw|
1
1 .
. .
Similarity( j , k) . k
. .
d .
|Vw|

context embedding
for word k

22
Dan  Jurafsky

Similarity  is  computed  from  dot  product


• Remember:  two  vectors  are  similar  if  they  have  a  high  
dot  product
• Cosine  is  just  a  normalized   dot  product
• So:
• Similarity(j,k) ∝ ck ·∙  vj
• We’ll  need  to  normalize  to  get  a  probability

23
Dan  Jurafsky for word k

Turning  dot  products  into  probabilities


he dot product c k · v is
• Similarity(j,k) = ck · vj
j not a probability, it’s just a numb
We can use the softmax function from Chapter 7 to norma
babilities. Computing this denominator requires comput
• We  
n each use  
other wordsoftmax
w in theto  vocabulary
turn  into  pwith
robabilities
the target wor
exp(ck · v j )
p(wk |w j ) = P
i2|V | exp(ci · v j )

the skip-gram computes the probability p(w |w ) by tak


24
Dan  Jurafsky

Embeddings from  W  and  W’


• Since  we   have  two  embeddings,  vj and  cj for  each  word   wj
• We  can  either:
• Just  use  vj
• Sum  them
• Concatenate  them  to  make  a  double-­‐length  embedding

25
Dan  Jurafsky

Learning
• Start  with  some   initial  embeddings (e.g.,  random)
• iteratively  make   the  embeddings for  a  word  
• more  like  the  embeddings of  its  neighbors  
• less  like  the  embeddings of  other  words.  

26
Dan  Jurafsky

Visualizing  W  and  C  as  a  network  for  doing  


error  backprop
Input layer Projection layer Output layer
probabilities of
1-hot input vector embedding for wt context words
x1
x2 y1
y2

wt xj W C d ⨉ |V| yk wt+1
|V|⨉d

x|V|
y|V|
1⨉d
1⨉|V| 1⨉|V|
27
Dan  Jurafsky

One-­‐hot  vectors
• A  vector  of  length  |V|  
• 1  for  the  target  word  and  0  for  other   words
• So  if  “popsicle”  is  vocabulary  word  5
• The  one-­‐hot  vector   is
• [0,0,0,0,1,0,0,0,0…….0]

w0 w1 wj w|V|

0 0 0 0 0 … 0 0 0 0 1 0 0 0 0 0 … 0 0 0 0

28
Dan  Jurafsky

o  =  Ch
Skip-­‐gram ok =  ckh
h  =  vj
ok =  ck·∙vj
Input layer Projection layer Output layer
probabilities of
1-hot input vector embedding for wt context words
x1
x2 y1
y2

wt xj W C d ⨉ |V| yk wt+1
|V|⨉d

x|V|
y|V|
1⨉d
29 1⨉|V| 1⨉|V|
, the dot product ck · v j is not a probability, it’s just a nu
Dan  Jurafsky

•. We can use the softmax function from Chapter 7 to nor


Problem  with  the  softamx
probabilities. Computing this denominator requires com
een• each other wordhave  
The  denominator:   w in
to  cthe vocabulary
ompute   with
over   every   word   in  the target w
vocab

exp(ck · v j )
p(wk |w j ) = P
i2|V | exp(c i · v j )
• Instead:  just  sample  a  few  of  those  negative   words
ry, the skip-gram computes the probability p(wk |w j ) by
een the word vector for j (v j ) and the context vector f
dot product v j · ck into a probability by passing it throu
30
les: non-neighbor This sectionwords.
offers a The brief goal
sketchwill
of howbethis
to works.
move In the training phase, the
les or negative
the samples:
embeddings toward non-neighbor
thethe neighbor words.
words The
and goal
away will
frombethe
to move
noise words.
hbor
Dan  
Jurafsky
words and
algorithm away
walks from
through the noise
corpus, words.
at each target word For
choosing example,
the in walking
surrounding c1 th
ings toward For the
context neighbor
example,
words words examples,
asinpositive
walking and awayand
through from
the the
example
for eachnoise
textwords.
positivebelow we come
example also to the word
choosing k 2 so we ha
ough
mple, the
innoise
egative apricot, example
walking
and let Ltext
through 2 below
the
so we we4come
example
have text towords
below
context the
we word
come
c1
apricot,
toThe
through thegoal
word
c4:
and let L =
negative samples:
weGoal   i n   l earning
samples or
= non-neighbor words. will be to move
amples
ed 4letcontext
L =the 2 so wordshavec1
embeddings through
4toward
context the c4:
words c1 words
neighbor through andc4:
away fromlemon,
the noise words. The goal
a [tablespoon of a
is
lemon, a [tablespoon of apricot preserves or] jam
Forof
[tablespoon
ricot preserves
apricot,
example,
and
apricot
let L
in walking
c1 or] jamc2
= 2 so we
through
preserves
have
w or]
4
the
c3
example text below we come to the word
• Make  the  word  like  the  context  words
jam
context words
c4
c1 through c4: c1 is high. In pract c2
c1 c2 w c3 c4 1
w
al is to
The
c3
learn
goal
an
is to
c4 learn an embedding
lemon, a [tablespoon of apricot preserves or] jam
embedding whose dot
whose
product with
dot
each
product with
context The each
word
s
goal (x)
context
is =
toword
learn
1+e x .
an So
em
is high. In practice c1 skip-gram c2 uses w a sigmoid c4 s of the dot product, where
function
edding
practice whose
s (x)skip-gram
= 1+e 1 dot uses
. So product
for athe
sigmoid
abovewith each swe
function
example
c3
context
ofwant word
the sdot · w)is+high.
(c1product, s where
(c2 · w) s
In +practice
(c4 w)
s (c3·· w) + to be
skip-gram h
. So• forWe  
the
ses a ssigmoid
The w
above
x
ant  
goal is ttohis  
example
(c4 · w) tofunction
learnt
we o  b
anwante  
s h
embeddingigh:
(c1 · whose
w) + s dot ·product
(c2 w) + s with
be high. s of the dot product, where s (x) = 1+ex . So for the above
(c3 · each
w) + context
1 word
In practice skip-gram uses a sigmoid function s of the dot product,In addition, f
x
o be high.isInhigh.
addition,
1 sfor each
where
xample we
s (x)want
= 1+e (c1
. So for· thecontext
w) +
aboves word
(c2 ·
example the+
w) wealgorithm
s (c3
want w)
s·(c1chooses
·+
w) +s k (c2
noise
(c4
s · w)words
· w) +to according
be
s (c3 · high.
w) +
ion, for each context word the algorithm chooses k noise words according
to their haveunigram
x
to their
s (c4unigram
· w) to befrequency.
high. If we let k = 2, for each target/context pair, we’ll 2
gram noise
frequency. If we let k 2, for each target/context pair, we’ll In
have addition,
2 for each contex
word • And  
words
thetoalgorithm
s for each of n ot  
theunigram like  
for each
In addition,
4 context
ofkthe
chooses
=
for each randomly  
4 context
context
words:k Ifnoise
the s
wordwords:
words elected  
algorithm
according “
choosesnoise   w
k noise words
to theirpair, ords”
according
noise
unigram words
frequency. forIf
their frequency.
[cement metaphysical dear coaxial we let k = 2, for each target/context we’ll
apricot attendant whence forever puddle] have 2
eetaphysical
let k n1 2, for
=noise dear
n2 each
words for target/context
each
coaxial of the
n3 4 pair,
context
apricot
n4 we’ll
words:
attendant have
whence2
n5 noise words
forever for
n7 each of
n6 puddle] n8 the 4 c
ntext
2 words: n3 metaphysical
[cement n4 dear coaxialn5 n6 attendant
apricot n7 whence [cement metaph
n8forever puddle]

ke these
We  w
n1 words
noise
ant  this  
n2 n to have n3to  
a
b
low
e  
n4 low:
We’d like these noise words n to have a low dot-product with our target embed-
dot-product with our
[cement
n5target embed-
n6
metaphysical
n7 n8
dear
oaxialding w; in other words
apricot we want s (n1
attendant s (n2 · w) +puddle]
· w) + forever
whence ... +
n1s (n8 · w)n2
n1
to be low. n2 n3
other31
words We’d like s
we want these
(n1 noise s (n2n· w)
· w) +words to have s (n8
+ ...a+low · w) to bewith
dot-product low.our target embed-
More formally, the learning objective for one word/context pair (w, c) is
e these noise words
Dan  Jurafsky
n to have a low dot-product
Skipgram with  negative  sampling: with
her words we want s
Loss  function (n1 · w) + s (n2 · w) + ... + s (n8
mally, the learning objective for one word/context pa
k
X
log s (c · w) + Ewi ⇠p(w) [log s ( wi · w)]
i=1

we want to maximize the dot product of the word with


inimize the dot products of the word with the k nega
rds. The noise words wi are sampled from the vocab
32
Dan  Jurafsky

Relation  between  skipgrams and  PMI!


• If  we  multiply  WW’T
• We  get  a  |V|x|V|  matrix   M ,  each  entry   mij corresponding   to  
some  association   between  input  word   i and  output  word   j  
• Levy   and  Goldberg  (2014b)   show  that  skip-­‐gram   reaches   its  
optimum   just  when   this  matrix   is  a  shifted  version  of  PMI:
WWʹ′T  =MPMI  −log  k  
• So  skip-­‐gram   is  implicitly  factoring  a  shifted  version  of  the  PMI  
matrix   into  the  two  embedding   matrices.
33
19.5 Properties of embeddings
Dan  Jurafsky

Properties   of  e17.8
We’ll discuss in Section mbeddings
how to evaluate the quality of different embeddings.
But it is also sometimes helpful to visualize them. Fig. 17.14 shows the words/phrases
that are most similar to some sample words using the phrase-based version of the
• Nearest  skip-gram
words   talgorithm
o  some  (Mikolov
embeddings (Mikolov
et al., 2013a).
et   al.  20131)

target: Redmond Havel ninjutsu graffiti capitulate


Redmond Wash. Vaclav Havel ninja spray paint capitulation
Redmond Washington president Vaclav Havel martial arts grafitti capitulated
Microsoft Velvet Revolution swordsmanship taggers capitulating
Figure 19.14 Examples of the closest tokens to some target words using a phrase-based
extension of the skip-gram algorithm (Mikolov et al., 2013a).

One semantic property of various kinds of embeddings that may play in their
usefulness is their ability to capture relational meanings
Mikolov et al. (2013b) demonstrates that the offsets between vector embeddings
34 can capture some relations between words, for example that the result of the ex-
Dan  Jurafsky

Embeddings capture  relational  meaning!


vector(‘king’)   -­‐ vector(‘man’)   +  vector(‘woman’)   ≈  vector(‘queen’)
vector(‘Paris’)  -­‐ vector(‘France’)   +  vector(‘Italy’)  ≈ vector(‘Rome’)

35
Vector  Semantics
Brown  clustering
Dan  Jurafsky

Brown  clustering
• An  agglomerative   clustering  algorithm   that  clusters  words   based  
on  which  words   precede   or  follow  them
• These  word  clusters  can   be  turned  into  a  kind   of  vector
• We’ll  give  a  very   brief  sketch  here.

37
Dan  Jurafsky

Brown  clustering  algorithm


• Each  word   is  initially  assigned  to  its  own  cluster.  
• We  now  consider   consider  merging   each  pair   of  clusters.  Highest  
quality  merge   is  chosen.
• Quality  =  merges  two  words  that  have  similar  probabilities  of  preceding  
and  following  words
• (More  technically  quality  =  smallest  decrease  in  the  likelihood  of  the  
corpus  according  to  a  class-­‐based  language  model)  
• Clustering  proceeds   until  all  words  are   in  one   big  cluster.  

38
Dan  Jurafsky

Brown  Clusters  as  vectors


• By  tracing  the  order  in  which  clusters  are   merged,   the  model  
builds  a  binary  tree  from   bottom  to  top.
• Each  word   represented   by  binary   string  =  path  from   root  to  leaf

Brown Algorithm
• Each  intermediate   node  is  a  cluster  
• Chairman  is  0010,  “months”  =  01,   and  verbs   =  1
0 1
00 01 10 11
000 001 010 011 100 101 walk
CEO 0010 0011 November October run sprint
39
chairman president
based on immediately neighboring words, Brown clusters are most commonly used
Dan  Jurafsky

for representing the syntactic properties of words, and hence are commonly used as
Brown  cluster  examples
a feature in parsers. Nonetheless, the clusters do represent some semantic properties
as well. Fig. 19.17 shows some examples from a large clustering from Brown et al.
(1992).

Friday Monday Thursday Wednesday Tuesday Saturday Sunday weekends Sundays Saturdays
June March July April January December October November September August
pressure temperature permeability density porosity stress velocity viscosity gravity tension
anyone someone anybody somebody
had hadn’t hath would’ve could’ve should’ve must’ve might’ve
asking telling wondering instructing informing kidding reminding bothering thanking deposing
mother wife father son husband brother daughter sister boss uncle
great big vast sudden mere sheer gigantic lifelong scant colossal
down backwards ashore sideways southward northward overboard aloft downwards adrift
Figure 19.17 Some sample Brown clusters from a 260,741-word vocabulary trained on 366
million words of running text (Brown et al., 1992). Note the mixed syntactic-semantic nature
40
of the clusters.
hemodel
preceding
in whichoreach
Dan  Jurafsky
following
word wwords. 2 V belongs to a class c 2 C with a probabi
orithm makes
Class based LMs useassigns class-based
of thea probability to alanguage model
pair of words wi 1(Brow
and wi
del
g theintransition
which each
betweenwordclasses
Class-­‐based   wlanguage  
2 Vrather belongsthan
model to a classwords:
between c 2 C with a pro
ass based LMs assigns a probability to a pair of words wi 1 an
• Suppose   P(w i |wi
each  word   1 ) = P(c i |c
was  in  rather i 1 )P(w
some  class   i |c i ) (19
e transition between classes thanci: between words:
class-based LM can be used to assign a probability to an entire corpus gi
ularly clusteringP(w
C asi |w i 1 ) = P(ci |ci 1 )P(wi |ci )
follows:
n
s-based LM can be used to assign a probability to an entire corpu
Y
P(corpus|C) = P(ci |ci 1 )P(wi |ci ) (19
y clustering C as follows:
i 1
Y n
ased language models are generally not used as a language model for
P(corpus|C) = P(c i |ci 1 )P(wi |c i )
ns like machine translation or speech recognition because they don’t w
41
i 1

Das könnte Ihnen auch gefallen