Sie sind auf Seite 1von 15

A Unified Architecture for Natural Language Processing: Deep

Neural Networks with Multitask Learning


Ronan Collobert
Jason Weston
NEC Labs America, 4 Independence Way, Princeton, NJ 08540 USA
Currently,
most
research
analyzes
those tasks sepa-rately.
We describe a single convolutional
Many systems possess
neural net-work architecture that, given
few
characteristics
a sentence, out-puts a host of language
that
would
help
processing predic-tions: part-of-speech
develop
a
unified
tags, chunks, named en-tity tags,
architecture
which
semantic roles, semantically similar
would presumably be
words and the likelihood that the
sentence makes sense (grammatically
necessary for deeper
and semanti-cally) using a language
semantic
tasks.
In
model. The entire network is trained
particular,
many
jointly on all these tasks using weightsystems possess three
sharing, an instance of multitask learning.
failings in this regard:
All the tasks use labeled data ex-cept
(i) they are shallow in
the language model which is learnt
the sense that the
from unlabeled text and represents a
novel form of semi-supervised learning for
clas-sifier
is
often
the shared tasks. We show how both
linear, (ii) for good
multitask learning and semi-supervised
performance with a
learning improve the general-ization of
linear classifier they
the shared tasks, resulting in state-ofmust
incorporate
the-art performance.
many
handengineered
features
specific for the task;
and (iii) they cascade
1. Introduction
features
learnt
The field of Natural Language Processing (NLP) separately from other
thus
aims to convert human language into a formal tasks,
propagating
errors.
representa-tion that is easy for computers to
manipulate. Current end applications include In
this
work
we
to define a
information extraction, ma-chine translation, attempt
unified
architecture
summarization, search and human-computer for Natural Language
Processing that learns
interfaces.
features
that
are
relevant to the tasks
While complete semantic understanding is still at hand given very
prior
a far-distant goal, researchers have taken a lim-ited
divide and con-quer approach and identified knowledge. This is
by training a
several sub-tasks useful for application achieved
neural network,
development and analysis. These range from deep
building upon work by
the syntactic, such as part-of-speech tagging, (Bengio & Ducharme,
chunking and parsing, to the semantic, such as 2001) and (Collobert &
word-sense
disambiguation,
semantic-role Weston, 2007). We
a
rather
labeling, named entity extraction and anaphora define
general convolutional
resolution.
network architec-ture
and
describe
its
th
to many
Appearing in Proceedings of the 25 International Confer-application
known NLP tasks
ence on Machine Learning, Helsinki, Finland, 2008. Copy- well
including
part-ofright 2008 by the author(s)/owner(s).
speech
tagging,
chunking,
named-

Abstract

collober@nec-labs.com
jasonw@nec-labs.com
entity
recognition,
learning a language
model and the task of
semantic
rolelabeling.
All of these tasks are
integrated
into
a
single system which is
trained jointly. All the
tasks except the language
model
are
supervised tasks with
labeled training data.
The language model
is
trained
in
an
unsuper-vised fashion
on
the
entire
Wikipedia
website.
Train-ing
this
task
jointly with the other
tasks comprises a
novel form of semisupervised learning.
We focus on, in our
opinion,
the
most
dicult
of
these
tasks: the semantic
role-labeling problem.
We show that both (i)
multitask learning and
(ii)
semi-supervised
learning significantly
improve performance
on this task in the
absence
of
handengineered features.
We also show how the
combined tasks, and
in
par-ticular
the
unsupervised
task,
learn
powerful
features with clear
semantic information
given no human supervision other than
the (labeled) data
from the tasks (see
Table 1).

A Unified Architecture for Natural Language Processing

Recognition
labels
The
article
is(NER)
structured
as
follows. In Sectionatomic
in
2 we describeelements
each of the NLPthe sentence
tasks
we
consider, and ininto categories
Sec-tion
3
we
define the generalsuch as PERarchitecture thatSON,
we use to solveCOMPANY, or
all
the
tasks.
Section
4LOCATION.
describes
how
this ar-chitectureSemantic Role
is employed forLabeling (SRL)
multitask learningaims at giving
on all the labeleda
se-mantic
tasks
werole
to
a
consider,
andsyntactic
Section
5constituent of
describes
thea sentence. In
unlabeled task ofthe PropBank
building
a
et al.,
language model(Palmer
in some detail.2005)
one
Section 6 givesformalism
assigns roles
experimental
ARG0-5
to
results of our sys-words that are
tem, and Sectionarguments of a
7 concludes withpredicate
in
a discussion ofthe sentence,
our results ande.g.
the
possible
directions
forfollowing
future research. sentence
might
be
tagged
[John]ARG0
2. NLP Tasks
[ate]REL
[the
We consider six apple]ARG1
,
where ate is
standard NLP
the predicate.
tasks in this
The
pre-cise
arguments
paper.
depend on a
verbs
frame
Part-Of-Speech
and if there
Tagging
(POS)are
multiple
in
a
aims at labelingverbs
sentence some
each word with awords
might
unique tag thathave multiple
In
indicates its syn-tags.
addition to the
tactic role, e.g.ARG0-5 tags,
there are
plural
noun,there
13
modifier
adverb, . . .
tags such as
ARGM-LOC
Chunking,
also(loca-tional)
ARGM-TMP
called
shallowand
(temporal)
parsing, aims atthat operate in
a similar way
label-ing
for all verbs.
segments of a
sentence
withLanguage
Models
A
syntactic
language
constituents suchmodel
as noun or verbtraditionally
phrase (NP or VP).estimates the
Each
word
isprobability of
assigned only onethe next word
unique tag, oftenbeing w in a
sequence. We
encoded
as
aconsider
a
begin-chunk (e.g.dierent
B-NP) or inside-setting: predict
the
chunk tag (e.g. I-whether
given
NP).
sequence
in
Named
Entityexists

nature, or not,
following
the
methodology
of (Okanohara
& Tsujii, 2007).
This
is
achieved
by
labeling
real
texts as positive examples,
and
generating
fake
negative text.
Semantically
Related Words
(Synonyms)
This is the task
of
predicting
whether
two
words
are
seman-tically
related
(synonyms,
holonyms,
hypernyms...)
which
is
measured
using
the
WordNet
database
(http://wordnet.pr
inceton.edu) as
ground truth.

Our main interest isautomatically


SRL, as it is, in ourtrained
by
opinion, the most
complex of thesebackpropagati
tasks. We use allon to be relthese tasks to:
(i)
show
theevant to the
generality
of
ourtask.
We
proposed
in
architecture; and (ii)describe
improve SRL throughthis section a
multitask learning.

for processing

3. General Deep
Architecture for
NLP

eciency)

general deep
architecture
suitable for all
our NLP tasks,
and
easily
generalizable
to other NLP
tasks.

All the NLP tasks


above
can
beOur
seen
as
tasksarchitecture is
assign-ing labels
to
words.
Thesummarized in
traditional
NLPFigure 1. The
approach
is:
layer
extract from thefirst
sentence a richextracts
set
of
hand-features
for
designed featureseach
word.
which are then
second
fed to a classicalThe
shallow
clas-layer extracts
sification
algorithm, e.g. afeatures from
Support
Vectorthe sentence
Machine
(SVM),treating it as a
often with a linear
kernel. The choicese-quence with
and
of fea-tures is alocal
completely
global
empirical process,
mainly based onstructure (i.e.,
trial and error,it
is
not
and the featuretreated like a
selection is task
bag of words).
dependent,
implying
The following
additional
layers
are
research for each
NN
new NLP task.classical
Complex
taskslayers.
like
SRL
then
require a large
number
of3.1.
possibly complexTransforming
features (e.g., extracted from aIndices into
parse tree) whichVectors
makes
such
systems slow and
intractable
forAs
our
large-scale
applications.
architecture
Instead
wedeals with raw
advocate a deep
neural
networkwords and not
(NN)
architecture,
en-gineered
trained in an endto-end
fashion.features,
the
The
in-put
sentence
isfirst layer has
processed
by
several layers ofto map words
feature
realextraction.
Theinto
features in deep
layers
of
thevalued vectors
network
are

by subsequent
layers of

the

NN.

For

simplicity (and
we

consider words
as indices in a
finite
dictionary

of

words D N.
Lookup-Table Layer
Each word i D
is embed-ded into
a d-dimensional
space using a
lookup table

LTW ():
LT
W

(i)
=
Wi ,

where W R
D|

d|

is a matrix of

parameters

to

be learnt, Wi
R

is

the

th

column of W and
d

is

the

word

vector
(wsz

size
)

to

be

chosen

by

the

user. In the first


layer

of

our

architecture
input

an

sentence

{s1, s2, . . . sn} of


n words in D is
thus
transformed into
a

series

vectors

{Ws1

of
,

Ws2 , . . . Wsn } by
apply-

A Unified Architecture for Natural Language Processing

ing the lookup-table to each of its words.


It is important to note that the parameters W of the
layer are automatically trained during the learning
process using backpropagation.

feature K

sK(1) sK(2) sK(3)


sK(4) sK(5) sK(6)

...

Lookup Tables

(d1
+d2
+...
dK)
*n

LTw1

LTwK
.. ... .. ... .. .. ... .

Convolution Layer
#hidden units * (n-2)

...

When a word is decomposed into K elements (fea1 2


K
tures), it can be represented as a tuple i = {i , i , . . . i }
1
K
k
D D , where D is the dic-tionary for the
th
k -element. We associate to each ele-ment a lookupk
d |D |
table LTW k (), with parameters W R k k where dk
N is a user-specified vector size. A word i is then

Max Over Time

embedded in a d =
k d dimensional space by
concatenating all lookup-table outputs:
1

n words, K features

feature 1 (text) the cat sat on the mat


feature 2
s1(1) s1(2) s1(3) s1(4) s1(5) s1(6)
...

..
.

Variations on Word Representations In practice, one


may want to introduce some basic pre-processing,
such as word-stemming or dealing with upper and
lower case. In our experiments, we limited ourselves
to converting all words to lower case, and represent
the capitalization as a separate feature (yes or no).

Input Sentence

LTW 1,...,W K (i)T = (LTW 1 (i )T, . . . , LTW K (i )T)

Classifying with Respect to a Predicate In a complex


task like SRL, the class label of each word in a
sentence depends on a given predicate. It is thus
neces-sary to encode in the NN architecture which
predicate we are considering in the sentence.
Optional Classical NN Layer(s)

Figure 1. A general
deep
NN
architecture for NLP.
Given
an
input
sentence, the NN
outputs
class
probabilities for one
chosen
word. A
classical
window
approach
is
a
special case where
the input has a fixed
size ksz, and the
TDNN kernel size is
ksz; in that case the
TDNN layer outputs
only one vector and
the
Max
layer
performs an identity.

to the idea that a


sequence has a
notion of order. A
TDNN reads the

#hidden units

We propose to add a feature for each word that encodes


th
its relative distance to the chosen predicate. For the i
word in the sentence, if the predicate is at position pos p
dist
we use an additional lookup table LT p (i posp).

nt

o(t) =

3.2. Variable Sentence Length


The lookup table layer maps the original sentence into
a sequence x() of n identically sized vectors:
(x1, x2, . . . , xn), t xt R

A classical TDNN layer performs a convolution on a


given sequence x(), outputting another sequence o()
whose value at time t is:

(1)

L x ,
j

j=1

t+j

(2)

X
n d
where Lj R hu (n j n) are the parameters
of the layer (with nhu hidden units) trained by backpropagation. One usually constrains this convolution
by defining a kernel width, ksz, which enforces

sentence
Obviously the size n of thedis Net

|j|
>
(ksz

1)/2,
L
feeding
sequence varies dependingta wor
on
the
sentence.nc ks A classical window approach
Unfortunately normal NNse (TD only considers words in a
are not able to handlede NN window of size ksz around the
sequences
of
variablepe s) word to be labeled. Instead, if
nd (Wa we use (2) and (3), a TDNN
length.
en ibel considers at the same time all
The simplest solution is to
cie et windows of ksz words in the
use a window approach:
s al., sentence.
consider a window of fixed
is 198
size ksz around each word
TDNN layers can also be
we want to label. Whileim 9) stacked so that one can ex-tract
this approach works withpo are local features in lower layers,
great success on simpler- a and more global features in
tasks like POS, it fails on ta bett subsequent ones. This is an
more complex tasks likent, er approach typ-ically used in
SRL. In the latter case it is Ti choi convolutional
networks
for
common for the role of am ce. vision tasks, such as the LeNet
word to depend on wordse- Her architecture (LeCun et al.,
far away in the sentence,De e, 1998).
and hence outside of thelay time
considered window.
Ne refe We then add to our architecture a
layer which captures the most
When
modeling
long-ur rs
relevant
features
over
the
al

by

A Unified Architecture for Natural Language Processing

the NN one or
the TDNN layer(s)
into
a
Maxmore classical
Layer,
whichNN layers. The
takes
the
maximum
overoutput of the
time (over the th
sentence) in (2)l
layer
for each of the nhucontaining n
hul
output features.
hidden
units
is
As
the
layers
output is of fixedcomputed with
dimension
l
l
(indepen-dent ofo = tanh(L
sentence
size)ol1),
where
subsequent layers
can be classicalthe matrix of
NN
layers.parameters Ll
Provided we have
n
n
a way to indicate R hul hul1 is
to
ourtrained
by
architecture
the
word to be labeled,backpropagati
it is then able to
use
featureson.
extracted from allThe size of the
windows of kszlast (parametric)
output
words
in
thelayers
last
is
the
sentence
too
of
compute the labelnumber
of one word ofclasses
considered
in
interest.
the

NLP

task.

We indicate theThis layer is


word to be labeledfollowed by a
to the NN with ansoftmax
layer
additional lookup-(Bridle,
1990)
table,
as
makes
suggested
inwhich
Section
3.1.sure the outputs
Considering
theare positive and
word at position
to
1,
posw we encodesum
the
relativeallowing us to
distance
betweeninterpret
the
th
the i word in theoutputs of the
sentence and thisNN
as
word
using dista
wprobabilities
for
lookup-table LT
each class. The
(i posw).
th

3.3. Deep
Architecture

i
output
given by
olast

is

olast

e i / j e j . The
A
TDNN
(orwhole network is
trained with the
window)
layercross-entropy
performs a linearcriterion (Bridle,
1990).

oper-ation
over
the input words.
3.4. Related
While
linearArchitectures
approaches work
In (Collobert &
fairly well for POSWeston, 2007)
described a
or
NER,
morewe
NN suited for
complex tasks likeSRL. This work
also used a
SRL
requirelookup-table to
generate word
nonlinear models.features (see
(Bengio &
One can add toalso
Ducharme,

2001)).
The
issue
of
labeling
with
respect to a
predi-cate was
handled with a
special hidden
layer: its output,
given
input
sequence (1),
predicate
position posp,
and the word
of interest posw
was
defined
as:
o(t) =
C(t
posw,
t
posp)
xt.
The
function
C() is shared
through time t:
one could say
that this is a
variant of a
TDNN
layer
with a ker-nel
width ksz = 1
but where the
parameters
are con-ditioned
with
other
variables
(distances with
respect to the
verb and word
of interest).
The fact that C()
does not combine
several words in
the
same
neighborhood as
in
our
TDNN
approach lim-

its
thepredictions are
dependencies
often used as
between words itfeatures
for
can model. Also
C() is itself a NNSRL and NER.
inside a NN. NotImproving
only does onegeneralization
have to carefully
design
thison the POS
task
might
additional
architecture, buttherefore
it also makes theimprove both
approach
more
complicated
toSRL and NER.
train
and
A
NN
implement.
Integrating all theautomatically
desired featureslearns features
in x() (includingfor the desired
the
predicatetasks in the
position)
viadeep layers of
lookup-tables
its
makes
ourarchitecture. In
approach simpler,the case of our
more general andgeneral
easier to tune.
architecture
for
NLP
presented
in
4. Multitasking Sec-tion 3, the
deepest layer
with Deep NN
(consisting of
Multitask learninglookup-tables)
(MTL)
is
theimplicitly learns
procedure
ofrelevant
learning
severalfeatures
for
tasks at the same
time with the aimeach word in
of mutual benefit.the dictionary.
This an old idea inIt
is
thus
machine learning;reasonable to
a good overview,expect
that
especially
focusing on NNs,when training
can be found inNNs on related
(Caruana, 1997). tasks, sharing
deep layers in
these
NNs
4.1. Deep Joint
would improve
Training
features
produced
by
If one considersthese
deep
related tasks, itlayers,
and
makes sense thatthus improve
fea-tures
usefulgeneralization
for one task mightperfor-mance.
last layers
be
useful
forThe
of the network
other ones. In NLPcan then be
for example, POStask specific.

In this paper
we show this
procedure
performs very
well for NLP
tasks
when
sharing
the
lookup-tables
of
each
considered
task,
as
depicted
in
Figure
2.
Training
is
achieved in a
stochastic
manner
by
looping
over
the tasks:

1. Select the
next task.

2. Select
random
3.

4.

a
training
example for
this task.
Update the
NN for this
task
by
taking
a
gradient
step
with
respect
to
this
example.
Go to 1.

It is worth
noticing that
labeled data for
training each
task can come
from completely
dierent
datasets.

4.2. Previous
Work in MTL for
NLP
The NLP field
contains many
related tasks.
This makes it a
natural field for
applying MTL,
and sev-

A Unified Architecture for Natural Language Processing

propagat-ing
errors
from
one classifier
Lookup Tables
to the next.

LT 2
w

LT 3
w Shallow
Training

Joint
If one
possesses
a
dataset
laConvolution
beled
for
several tasks,
Maxit
is
then
possible
to
Classical NN Layer(s)
train
these
tasks jointly in
Softmax
a
shallow
manner:
one
Task
1
unique model
can predict all
task labels at
the same time.
Figure 2. Example ofUsing
this
the
deep multitasking withscheme,
of
NN. Task 1 and Task 2authors
(Sutton et al.,
are two tasks trained2007)
a
with the architectureproposed
conditional
presented in Figure 1.random
field
One lookup-table (inapproach
where
they
black) is shared (theshowed
improvements
other lookup-tables and
from
joint
layers are task specific).training
on
tagging
The principle is thePOS
nounsame with more thanand
phrase
two tasks.
chunking
tasks.
However
the
requirement of
jointly
eral techniques annotated
data
is
a
have already
limitation, as
this
is
often
been explored.
not the case.
in
Cascading FeaturesSimilarly,
(Miller et al.,
The most obvious2000)
NER,
and
way to achieveparsing
relation
MTL is to trainextraction
one
task,
andwere
jointly
then use this tasktrained in a
as a feature forstatistical
parsing model
another task. Thisachieving
is a very com-improved
mon approach inperfor-mance
on all tasks.
NLP. For example,This work has
in the case ofthe same joint
SRL,
severallabel-ing
requirement
methods
(e.g.,problem,
the
(Pradhan et al.,which
2004)) train aauthors
avoided
by
POS classifier andusing
a
use the output aspredictor to fill
in
the
missing
features
forannotations.
training a parser,
which
is
thenIn (Sutton &
used for buildingMcCallum,
the
features for SRL2005a)
authors
itself.
showed
that
Unfortunately,
one
could
tasks
(features)learn the tasks
are
learntindependently,
separately in suchhence
using
a cascade, thusdierent
training sets,

by
only
leveraging
predic-tions
jointly in a test
time decoding
step, and still
ob-tain
improved
results.
The
problem
is,
however, that
this will not
make use of
the
shared
tasks at training
time. The NN
approach used
here
seems
more flexible
in
these
regards.

Finally,
thesupervised
on
authors
oftasks
labeled
data
(Musillo & Merlo,and
2006) made anunsupervised
tasks on unattempt
atlabeled data.
now
improving
theWe
an
semantic
rolepresent
unsupervised
suitable
labeling task bytask
joint
inferencefor NLP.
with
syntacticLanguage Model
parsing, but theirWe consider a
model
results are notlanguage
based on a
state-of-the-art. simple
fixed
The authors ofwindow of text
(Sutton
&of size ksz using our NN
McCallum, 2005b)architecture,
also describe agiven in Figure
negative result at2. We trained
language
the same jointour
model
to
task.
discriminate a
two-class
classi-fication
task: if the
5. Leveraging
in the
Unlabeled Data word
middle of the
input window
Labeling
ais related to its
dataset can be ancontext or not.
expensive
task,We construct a
dataset for this
especially in NLPtask
by
where
labelingconsidering all
ksz
often
requirespossible
of
skilled
linguists.windows
text from the
On
the
otherentire
of
hand, unlabeledEnglish
data is abundantWikipedia
and
freely(http://en.wikiped
ia.org). Positive
available on theexamples are
web. Leveragingwindows from
unlabeled data inWikipedia,
negative
NLP tasks seemsexamples are
to be a verythe
same
but
attractive,
andwindows
where
the
chal-lenging,
middle
word
goal.
has
been
replaced by a
In
our
MTLrandom word.
framework
presented
inWe train this
Figure 2, there isproblem with a
nothing stoppingranking-type
us from jointly
training
cost:

X X max (0, 1 f (s) + f (sw)) ,


sS wD

where S is the
set of sentence
windows
of
text, D is the
dictionary
of
words, and f ()
represents our
NN
architecture without
the
softmax
w
layer and s is
a
sentence
window where
the
middle
word has been
replaced by the
word
w.
We
sample
this
cost online w.r.t.
(s, w).

We will see in
our
experiments
that
the
features (embedding)
learnt by the
lookup-table
layer of this
NN
clusters
semantically
similar words.
These
discovered
features
will
prove
very
useful for our
shared tasks.
Previous Work
on
Language
Models (Bengio
& Ducharme,
2001)
and
(Schwenk
&
Gauvain,
2002) al-ready
presented very
similar
language
models. However,
their
goal was to
give
a
probability of a
word
given
previous ones
in a sentence.
Here, we only
want to have a
good
representation
of words: we
take
advantage of the
complete
context of a
word
(before
and af-

A Unified Architecture for Natural Language Processing

cation, which
ter) to predict itsis a kind of
self-training
relevance.
Perhaps this is themethod).
reason
theThese
authors
weremethods
the
never
able
toaugment
training
set
obtain a goodwith
labeled
embed-ding
ofexam-ples
their words. Also,from
the
using probabilitiesunlabeled set
imposes using awhich
are
cross-entropy
predicted
by
type criterion andthe
model
can require manyitself. This can
tricks to speed-upgive
large
the training, dueimprovements
to normal-izationin a model, but
issues.
Ourcare must be
criterion (4) istaken as the
much simpler inpredictions are
of
course
that respect.
prone to noise.
The authors of
authors of
(Okanohara
&The
(Ando
&
Tsujii, 2007), likeZhang, 2005)
propose
a
us, also take a
setup
more
two-class
similar to ours:
they
learn
approach
from unlabeled
(true/fake
data as an
sentences). Theyauxiliary task
a
MTL
use
a
shallowin
framework.
(kernel) classifier. The
main
dierence
is
that they use
shallow
Previous Work in
classifiers;
Semi-Supervised
however they
report positive
Learning
results on POS
For an overviewand NER tasks.
of
semisupervised
learning,
seeSemantically
(Chapelle et al.,Related Words
2006). There haveTask We found
interesting
been several usesit
compare
of
semi-to
the embedding
supervised
learning in NLPobtained with
language
before, for exam-a
on
ple
in
NERmodel
(Rosenfeld
&unlabeled data
Feldman, 2007),with an embedding
machine
obtained with
translation
(Ueng et al.,labeled data.
2007),
parsingWordNet is a
(McClosky et al.,database
2006) and textwhich contains
classification
semantic
(Joachims, 1999).relations (synThe first work is aonyms,
highly
problem-holonyms,
specific approachhypernyms, ...)
whereas the lastbetween
three all use aaround
150,
self-training type000 words. We
ap-proach
used it to train
(Transductive
a NN similar to
SVMs in the casethe language
of text classifi-

model one. We
considered the
problem as a
two-class
classification
task: positive
examples are
pairs with a
relation
in
Wordnet, and
negative
exam-ples are
random pairs.

6.
Experiments
We
used
Sections 02-21
of
the
PropBank
dataset
version 1 (about 1
million words)
for
training
and
Sec-tion
23 for testing
as standard in
all
SRL
experiments.
POS
and
chunking tasks
use the same
data split via
the
Penn
TreeBank. NER
labeled
data
was obtained
by running the
Stanford
Named Entity
Recognizer (a

to

train

the

Table 1. Languagesynonyms
model performance for(semanti-cally
related words)
learning
an
em-task.
bedding in wsz = 50
dimensions (dictionaryAll tasks use
same
size: 30, 000). Forthe
dictionary
of
each
column
thethe 30, 000
queried
word
ismost common
from
followed by its index inwords
the dictionary (higherWikipedia,
to
means more rare) andconverted
lower
case.
its
10
nearestOther
words
neighbors
(arbitrarywere
using the Euclideanconsidered as
unknown
and
metric).
france
454
spain
italy
russia
poland
england
denmark
germany
portugal
sweden
austria

jesus
1973
christ
god
resurrection
prayer
yahweh
josephus
moses
sin
heaven
salvation

mapped to a
special word.

Architectures All
tasks
were
trained using
the NN shown
in Figure 1.
POS, NER, and
chunking tasks
were
trained
with
the
window
CRF based
version
with
ksz = 5. We
classifier) over
chose
linear
the same data.
models
for
Language modelsPOS and NER.
were trained onFor chunk-ing
Wikipedia. In allwe chose a
cases,
anyhidden layer of
numeric number200 units. The
was converted as
language
NUM-BER.
model
task
Accentuated
characters werehad a window
transformed
tosize ksz = 11,
their
non-and a hidden
accentuated
layer of 100
versions.
All
paragraphs con-units. All these
taining other non-tasks used two
ASCII characterslookup-tables:
of
were
discarded.one
For Wikipedia, wedimension wsz
obtain a databasefor the word in
of 631M words.lower
case,
We used WordNetand one of

dimension
2
specifying
if
the first let-ter
of the word is
a capital letter
or not.
For SRL, the
network had a
convolution
layer with ksz
= 3 and 100
hidden units,
followed
by
another hidden
layer of 100
hidden units. It
had
three
lookup-tables
in
the
first
layer: one for
the word (in
lower
case),
and two that
encode
relative
distances (to
the word of
interest
and
the verb). The
last
two
lookup-tables
embed in 5
dimensional
spaces.
Verb
positions
are
obtained with
our
POS
classifier.
The language
model network
had only one
lookup-table
(the word in
lower
case)
and
100
hidden units. It
used a window
of size ksz =
11.
We show results
for
dierent
encoding sizes of
the word in lower
case: wsz = 15,
50 and 100.

A Unified Architecture for Natural Language Processing


Table 2. A Deep Architecture for SRL improves by learning auxiliary tasks that share the first layer that represents words
as wsz-dimensional vectors. We give word error rates for wsz=15, 50 and 100 and various shared tasks.
wsz=15 wsz=50 wsz=100
16.54
15.99
16.42
16.67
15.46
14.42
16.46
16.45
16.33
15.71
14.63

SRL
SRL + POS
SRL + Chunking
SRL + NER
SRL + Synonyms
SRL + Language model
SRL + POS + Chunking
SRL + POS + NER
SRL + POS + Chunking + NER
SRL + POS + Chunking + NER + Synonyms
SRL + POS + Chunking + NER + Language model

wsz=15

Wsz=50

20

Test Error

19

18

SRL
SRL+POS
SRL+CHUNK
SRL+POS+CHUNK
SRL+POS+CHUNK+NER
SRL+SYNONYMS
SRL+POS+CHUNK+NER+SYNONYMS
SRL+LANG.MODEL
SRL+POS+CHUNK+NER+LANG.MODEL

21

20

19

Test Error

21

18

22

20

19

18

17

17

16

16

16

15

14

31

14

SRL
SRL+POS
SRL+CHUNK
SRL+POS+CHUNK
SRL+POS+CHUNK+NER
SRL+SYNONYMS
SRL+POS+CHUNK+NER+SYNONYMS
SRL+LANG.MODEL
SRL+POS+CHUNK+NER+LANG.MODEL

21

17

15

18.40
16.53
16.48
17.21
15.17
14.46
16.41
16.29
16.27
15.48
14.50

Wsz=100

22

SRL
SRL+POS
SRL+CHUNK
SRL+POS+CHUNK
SRL+POS+CHUNK+NER
SRL+SYNONYMS
SRL+POS+CHUNK+NER+SYNONYMS
SRL+LANG.MODEL
SRL+POS+CHUNK+NER+LANG.MODEL

Test Error

22

17.33
16.57
16.39
17.29
15.17
14.30
15.95
16.89
16.36
14.76
14.44

15

3.5

8.5

Epoch

13.5

16

18.5

Epoch

14

3.5

8.5

13.5

Epoch

Figure 3. Test error versus number of training epochs over PropBank, for the SRL task alone and SRL jointly trained with
various other NLP tasks, using deep NNs.

(given one verb).

Results: Language Model Because the language


model was trained on a huge database we first
trained it alone. It takes about a week to train on
one com-puter. The embedding obtained in the
word lookup-table was extremely good, even for
uncommon words, as shown in Table 1. The
embedding obtained by training on labeled data
from WordNet synonyms is also good (results
not shown) however the coverage is not as
good as using unlabeled data, e.g. Dreamcast is not in the database.
The resulting word lookup-table from the language
model was used as an initializer of the lookuptable used in MTL experiments with a language
model.

Results: SRL Our main interest was improving


SRL performance, the most complex of our
tasks. In Ta-ble 2, we show results comparing
the SRL task alone with the SRL task jointly
trained with dierent com-binations of the other
tasks. For all our experiments, training was
achieved in a few epochs (about a day) over the
PropBank dataset as shown in Figure 3. Test-ing
takes 0.015s to label a complete sentence

All MTL experiments


performed better than
SRL alone. With larger
wsz (and thus large
capacity) the relative
improvement
becomes larger from
using MTL compared
to the task alone,
which shows MTL is a
good
way
of
regularizing: in fact
with MTL results are
fairly
stable
with
capacity changes.
The
semi-supervised
training of SRL using
the lan-guage model
performs better than
other
combinations.
Our
best
model
performed as low as
14.30% in per-word
error rate, which is to
be compared to previously
published
results of 16.36% with
an NN archi-tecture
(Collobert & Weston,
2007) and 16.54% for
a
state-of-the-art

method
based on parse trees (Pradhan et al.,We obtained modest
1
2004) . Further, our system is the only one not improvements to POS
to use POS tags or parse tree features.
and chunking results
usResults: POS and Chunking Training takes about
30 min for these tasks alone. Testing time for 1Our
loss
function
label-ing a complete sentence is about 0.003s.

optimized per-word error


rate. We note that many
SRL results e.g. the CONLL
2005 evalua-tion use F1 as
a standard measure.

A Unified Architecture for Natural Language Processing

simultaneousl
ing MTL. Withouty can improve
MTL (for wsz =gener50) we obtainalization
2.95% test errorperformance.
for POS and 4.5%
(91.1 F-measure)In particular,
for chunking. Withwhen training
MTL we obtainthe SRL task
2.91% for POS
with
and 3.8% (92.71jointly
F-measure)
forour language
chunking.
POS
our
error rates in themodel
3%
range
arearchitecture
state-of-the-art. achieved
For
chunk-ing,
although we use astate-of-thedierent train/testart
setup
to
theperformance
CoNLL-2000
in SRL without
shared
taskany
explicit
(http://www.cnts.ua.
syntactic
ac.
be/conll2000/chunki features. This
ng) our systemis
an
seems
competitive with existingimportant
systems
(betterresult, given
than 9 of the 11that the NLP
submitted
community
systems).
However,
ourconsiders
system is the onlysyntax as a
one that does notmandatory
use POS tags as
feature
for
input features.
semantic
Note, we did notextraction
evaluate NER error(Gildea
&
rates because wePalmer,
used
non-gold2001).

standard
annotations in ourReferences
setup. Future work
will
moreAndo, R., &
Zhang,
T.
thoroughly
(2005).
A
Framework
evaluate
these
for
Learn-ing
tasks.
Predictive

7. Conclusion

Structures
from Multiple
Tasks and Unlabeled Data.
JMLR,
6,
18171853.

We proposed a
general deep NN
architecture
forBengio, Y., &
NLP.
Our Ducharme, R.
architecture
is (2001).
A
extremely
fast neural
enabling us to probabilistic
take advantage of language
huge
databases
(e.g. 631 million model. NIPS
words
from 13.
Wikipedia).
We
showed our deepBridle, J. (1990).
NN
could
be Probabilistic
interpretation
applied to various of
feedfortasks
such
as ward
classification
SRL, NER, POS, network
with
chunking
and outputs,
relationships
to
statistical
language
pattern
modeling.
We recognition. In
F. F. Soulie
demonstrated
and J. Herault
that
learning (Eds.),
Neurocomputing
tasks
:
Algorithms,
archi-tectures

and
applications,
227236.
NATO
ASI
Series.

Caruana, R.
(1997).
Multitask
Learning.
Machine Learning, 28, 4175.

Chapelle,
O.,
Schlkopf, B.,
& Zien, A.
(2006). Semisupervised
learning.
Adaptive
computation
and machine
learning.
Cambridge,
Mass., USA:
MIT Press.
Collobert,
R.,
&
Weston, J. (2007).
Fast
semantic
extrac-tion using
a novel neural
network
architecture.
Proceed-ings of the
45th Annual Meeting
of the ACL (pp.
560

Gildea, D., & Palmer, M.


(2001). The necessity
Palmer,
M.,
of
parsing
forGildea, D., &
P.
predicate
argumentKingsbury,
(2005).
The
recognition.
proposition
Proceedings of the 40thbank:
An
Annual Meeting of theannotated
ACL, 239246.
corpus
of

the conference
on
Human
Language
Technology and
Empirical
Methods
in
Natural
Language
Processing,
748 754.

Joachims,
T.
(1999).Comput. Linguist.,
Transductive inference for31, 71106.
text clas-sification using
support vector machines.
Pradhan, S., Ward,
ICML.

Sutton,
C.,
&
McCallum,
A.
(2005b). Joint
parsing
and
semantic role
labeling.
Proceedings of
CoNLL-2005
(pp. 225228).

semantic roles.

W., Hacioglu, K.,


Martin, J., & JurafLeCun, Y., Bottou, L.,sky, D. (2004).
Bengio,
Y.,
&Shallow semantic
using
Haner, P. (1998).parsing
support
vector
Gradient-Based
Learning Applied tomachines.
Document
Recog-Proceedings
of
nition. Proceedings ofHLT/NAACL-2004.

the IEEE, 86.

Rosenfeld, B., &


McClosky,
D.,
Feldman,
R.
Charniak,
E.,
&(2007).
Using
Johnson, M. (2006).Corpus StatisEntities
Ef-fective
self-tics on Improve
training for parsing.to
SemiProceedings of HLT-supervised
Relation
ExNAACL 2006.
traction
from
the
Web.
Miller, S., Fox, H.,Proceedings
of
Ramshaw,
L.,
&the 45th Annual
Weischedel,
R.Meeting of the
(2000). A novel useACL, 600607.
of statistical parsing
to extract informaSchwenk, H., &
tion from text. 6th
J.
Applied
NaturalGauvain,
Language Processing(2002).
Connectionist
Conference.
lan-guage

for
Musillo, G., & Merlo, P.modeling
large vocabulary
(2006). Robust Parsingcontinuous
of
the
Propositionspeech
IEEE
Bank. ROMAND 2006:recognition.
International
Robust
Methods
inConference
on
Analysis
of
NaturalAcous-tics,
Speech,
and
language Data.
Signal Processing
(pp. 765768).

Okanohara, D., & Tsujii, J.


(2007).
A
Sutton,
C.,
&
A.
discriminative
lan-McCallum,
(2005a).
guage
model
withComposition of
pseudo-negative
con-ditional
fields
samples. Proceedings ofrandom
transfer
the 45th Annual Meetingfor
learning.
of the ACL, 7380.
Proceedings
of
567).

Sutton,
C.,
McCallum, A.,
&
Rohanimanesh,
K.
(2007).
Dynamic
Conditional
Random Fields:
Factorized
Prob-abilistic
Models
for
Labeling
and
Segmenting
Sequence
Data. JMLR, 8,
693723.
Ueng, N., Haari,
G., & Sarkar, A.
(2007).
Transductive
learning
for
statistical
machine
translation.
Proceedings of the
45th Annual Meeting
of the ACL, 2532.

Waibel, A., abd


G. Hinton, T.
H., Shikano, K.,
&
Lang,
K.
(1989).
Phoneme
recognition
using
timedelay
neural
networks. IEEE
Transactions on
Acoustics,
Speech,
and
Signal
Processing, 37,
328339.

Das könnte Ihnen auch gefallen