Beruflich Dokumente
Kultur Dokumente
The article is structured as follows. In Section 2 we Our main interest is SRL, as it is, in our opinion, the
describe each of the NLP tasks we consider, and in Sec- most complex of these tasks. We use all these tasks to:
tion 3 we define the general architecture that we use (i) show the generality of our proposed architecture;
to solve all the tasks. Section 4 describes how this ar- and (ii) improve SRL through multitask learning.
chitecture is employed for multitask learning on all the
labeled tasks we consider, and Section 5 describes the 3. General Deep Architecture for NLP
unlabeled task of building a language model in some
detail. Section 6 gives experimental results of our sys- All the NLP tasks above can be seen as tasks assign-
tem, and Section 7 concludes with a discussion of our ing labels to words. The traditional NLP approach is:
results and possible directions for future research. extract from the sentence a rich set of hand-designed
features which are then fed to a classical shallow clas-
2. NLP Tasks sification algorithm, e.g. a Support Vector Machine
(SVM), often with a linear kernel. The choice of fea-
We consider six standard NLP tasks in this paper. tures is a completely empirical process, mainly based
on trial and error, and the feature selection is task
Part-Of-Speech Tagging (POS) aims at labeling
dependent, implying additional research for each new
each word with a unique tag that indicates its syn-
NLP task. Complex tasks like SRL then require a
tactic role, e.g. plural noun, adverb, . . .
large number of possibly complex features (e.g., ex-
Chunking, also called shallow parsing, aims at label- tracted from a parse tree) which makes such systems
ing segments of a sentence with syntactic constituents slow and intractable for large-scale applications.
such as noun or verb phrase (NP or VP). Each word
Instead we advocate a deep neural network (NN) ar-
is assigned only one unique tag, often encoded as a
chitecture, trained in an end-to-end fashion. The in-
begin-chunk (e.g. B-NP) or inside-chunk tag (e.g. I-
put sentence is processed by several layers of feature
NP).
extraction. The features in deep layers of the network
Named Entity Recognition (NER) labels atomic are automatically trained by backpropagation to be rel-
elements in the sentence into categories such as PER- evant to the task. We describe in this section a general
SON, COMPANY, or LOCATION. deep architecture suitable for all our NLP tasks, and
easily generalizable to other NLP tasks.
Semantic Role Labeling (SRL) aims at giving a se-
mantic role to a syntactic constituent of a sentence. Our architecture is summarized in Figure 1. The first
In the PropBank (Palmer et al., 2005) formalism one layer extracts features for each word. The second layer
assigns roles ARG0-5 to words that are arguments extracts features from the sentence treating it as a se-
of a predicate in the sentence, e.g. the following quence with local and global structure (i.e., it is not
sentence might be tagged [John]ARG0 [ate]REL [the treated like a bag of words). The following layers are
apple]ARG1 , where ate is the predicate. The pre- classical NN layers.
cise arguments depend on a verbs frame and if there
are multiple verbs in a sentence some words might have 3.1. Transforming Indices into Vectors
multiple tags. In addition to the ARG0-5 tags, there
there are 13 modifier tags such as ARGM-LOC (loca- As our architecture deals with raw words and not en-
tional) and ARGM-TMP (temporal) that operate in a gineered features, the first layer has to map words into
similar way for all verbs. real-valued vectors for processing by subsequent layers
of the NN. For simplicity (and efficiency) we consider
Language Models A language model traditionally words as indices in a finite dictionary of words D N.
estimates the probability of the next word being w in
a sequence. We consider a different setting: predict Lookup-Table Layer Each word i D is embed-
whether the given sequence exists in nature, or not, ded into a d-dimensional space using a lookup table
following the methodology of (Okanohara & Tsujii, LTW ():
2007). This is achieved by labeling real texts as posi- LTW (i) = Wi ,
tive examples, and generating fake negative text.
where W Rd|D| is a matrix of parameters to be
Semantically Related Words (Synonyms) This learnt, Wi Rd is the ith column of W and d is the
is the task of predicting whether two words are seman- word vector size (wsz ) to be chosen by the user. In
tically related (synonyms, holonyms, hypernyms...) the first layer of our architecture an input sentence
which is measured using the WordNet database {s1 , s2 , . . . sn } of n words in D is thus transformed
(http://wordnet.princeton.edu) as ground truth. into a series of vectors {Ws1 , Ws2 , . . . Wsn } by apply-
A Unified Architecture for Natural Language Processing
LTw1
Variations on Word Representations In practice,
...
...
...
...
...
...
...
one may want to introduce some basic pre-processing, LTwK
such as word-stemming or dealing with upper and
lower case. In our experiments, we limited ourselves to Convolution Layer
converting all words to lower case, and represent the #hidden units * (n-2)
...
capitalization as a separate feature (yes or no).
When a word is decomposed into K elements (fea- Max Over Time ...
tures), it can be represented as a tuple i = #hidden units
k k
Rd |D | where dk N is a user-specified vector size.
A word i is then embedded in a d = k dk dimensional
P
Figure 1. A general deep NN architecture for NLP. Given
space by concatenating all lookup-table outputs: an input sentence, the NN outputs class probabilities for
one chosen word. A classical window approach is a special
LTW 1 ,...,W K (i)T = (LTW 1 (i1 )T , . . . , LTW K (iK )T ) case where the input has a fixed size ksz, and the TDNN
kernel size is ksz; in that case the TDNN layer outputs
Classifying with Respect to a Predicate In a only one vector and the Max layer performs an identity.
complex task like SRL, the class label of each word in a
sentence depends on a given predicate. It is thus neces- to the idea that a sequence has a notion of order. A
sary to encode in the NN architecture which predicate TDNN reads the sequence in an online fashion: at
we are considering in the sentence. time t 1, one sees xt , the tth word in the sentence.
We propose to add a feature for each word that encodes A classical TDNN layer performs a convolution on a
its relative distance to the chosen predicate. For the ith given sequence x(), outputting another sequence o()
word in the sentence, if the predicate is at position posp whose value at time t is:
we use an additional lookup table LT distp (i posp ). nt
X
o(t) = Lj xt+j , (2)
3.2. Variable Sentence Length j=1t
The lookup table layer maps the original sentence into where Lj Rnhu d (n j n) are the parameters
a sequence x() of n identically sized vectors: of the layer (with nhu hidden units) trained by back-
(x1 , x2 , . . . , xn ), t xt Rd . (1) propagation. One usually constrains this convolution
by defining a kernel width, ksz, which enforces
Obviously the size n of the sequence varies depending
on the sentence. Unfortunately normal NNs are not |j| > (ksz 1)/2, Lj = 0 . (3)
able to handle sequences of variable length.
A classical window approach only considers words in
The simplest solution is to use a window approach:
a window of size ksz around the word to be labeled.
consider a window of fixed size ksz around each word
Instead, if we use (2) and (3), a TDNN considers at the
we want to label. While this approach works with
same time all windows of ksz words in the sentence.
great success on simple tasks like POS, it fails on more
complex tasks like SRL. In the latter case it is common TDNN layers can also be stacked so that one can ex-
for the role of a word to depend on words far away tract local features in lower layers, and more global
in the sentence, and hence outside of the considered features in subsequent ones. This is an approach typ-
window. ically used in convolutional networks for vision tasks,
such as the LeNet architecture (LeCun et al., 1998).
When modeling long-distance dependencies is impor-
tant, Time-Delay Neural Networks (TDNNs) (Waibel We then add to our architecture a layer which captures
et al., 1989) are a better choice. Here, time refers the most relevant features over the sentence by feeding
A Unified Architecture for Natural Language Processing
the TDNN layer(s) into a Max Layer, which takes its the dependencies between words it can model. Also
the maximum over time (over the sentence) in (2) for C() is itself a NN inside a NN. Not only does one have
each of the nhu output features. to carefully design this additional architecture, but it
also makes the approach more complicated to train
As the layers output is of fixed dimension (indepen-
and implement. Integrating all the desired features
dent of sentence size) subsequent layers can be classical
in x() (including the predicate position) via lookup-
NN layers. Provided we have a way to indicate to our
tables makes our approach simpler, more general and
architecture the word to be labeled, it is then able to
easier to tune.
use features extracted from all windows of ksz words
in the sentence to compute the label of one word of
interest. 4. Multitasking with Deep NN
We indicate the word to be labeled to the NN with an Multitask learning (MTL) is the procedure of learning
additional lookup-table, as suggested in Section 3.1. several tasks at the same time with the aim of mutual
Considering the word at position posw we encode the benefit. This an old idea in machine learning; a good
relative distance between the ith word in the sentence overview, especially focusing on NNs, can be found in
and this word using a lookup-table LT distw (i posw ). (Caruana, 1997).
Lookup Tables Lookup Tables Finally, the authors of (Musillo & Merlo, 2006) made
an attempt at improving the semantic role labeling
LTw2 LTw3 LTw1 LTw 2
task by joint inference with syntactic parsing, but their
results are not state-of-the-art. The authors of (Sutton
Convolution Convolution & McCallum, 2005b) also describe a negative result at
Max Max the same joint task.
Classical NN Layer(s) Classical NN Layer(s)
Softmax Softmax
5. Leveraging Unlabeled Data
Task 1 Task 2 Labeling a dataset can be an expensive task, especially
in NLP where labeling often requires skilled linguists.
Figure 2. Example of deep multitasking with NN. Task 1 On the other hand, unlabeled data is abundant and
and Task 2 are two tasks trained with the architecture freely available on the web. Leveraging unlabeled data
presented in Figure 1. One lookup-table (in black) is shared in NLP tasks seems to be a very attractive, and chal-
(the other lookup-tables and layers are task specific). The lenging, goal.
principle is the same with more than two tasks.
In our MTL framework presented in Figure 2, there is
nothing stopping us from jointly training supervised
tasks on labeled data and unsupervised tasks on un-
eral techniques have already been explored. labeled data. We now present an unsupervised task
suitable for NLP.
Cascading Features The most obvious way to
achieve MTL is to train one task, and then use this Language Model We consider a language model
task as a feature for another task. This is a very com- based on a simple fixed window of text of size ksz us-
mon approach in NLP. For example, in the case of ing our NN architecture, given in Figure 2. We trained
SRL, several methods (e.g., (Pradhan et al., 2004)) our language model to discriminate a two-class classi-
train a POS classifier and use the output as features fication task: if the word in the middle of the input
for training a parser, which is then used for building window is related to its context or not. We construct
features for SRL itself. Unfortunately, tasks (features) a dataset for this task by considering all possible ksz
are learnt separately in such a cascade, thus propagat- windows of text from the entire of English Wikipedia
ing errors from one classifier to the next. (http://en.wikipedia.org). Positive examples are
windows from Wikipedia, negative examples are the
Shallow Joint Training If one possesses a dataset la- same windows but where the middle word has been
beled for several tasks, it is then possible to train these replaced by a random word.
tasks jointly in a shallow manner: one unique model
can predict all task labels at the same time. Using this
scheme, the authors of (Sutton et al., 2007) proposed a
XX
We train this problem with a ranking-type cost:
max (0, 1 f (s) + f (sw )) , (4)
conditional random field approach where they showed sS wD
improvements from joint training on POS tagging and
where S is the set of sentence windows of text, D is the
noun-phrase chunking tasks. However the requirement
dictionary of words, and f () represents our NN archi-
of jointly annotated data is a limitation, as this is often
tecture without the softmax layer and sw is a sentence
not the case. Similarly, in (Miller et al., 2000) NER,
window where the middle word has been replaced by
parsing and relation extraction were jointly trained in
the word w. We sample this cost online w.r.t. (s, w).
a statistical parsing model achieving improved perfor-
mance on all tasks. This work has the same joint label- We will see in our experiments that the features (em-
ing requirement problem, which the authors avoided bedding) learnt by the lookup-table layer of this NN
by using a predictor to fill in the missing annotations. clusters semantically similar words. These discovered
features will prove very useful for our shared tasks.
In (Sutton & McCallum, 2005a) the authors showed
that one could learn the tasks independently, hence Previous Work on Language Models (Bengio &
using different training sets, by only leveraging predic- Ducharme, 2001) and (Schwenk & Gauvain, 2002) al-
tions jointly in a test time decoding step, and still ob- ready presented very similar language models. How-
tain improved results. The problem is, however, that ever, their goal was to give a probability of a word
this will not make use of the shared tasks at training given previous ones in a sentence. Here, we only want
time. The NN approach used here seems more flexible to have a good representation of words: we take advan-
in these regards. tage of the complete context of a word (before and af-
A Unified Architecture for Natural Language Processing
Table 2. A Deep Architecture for SRL improves by learning auxiliary tasks that share the first layer that represents words
as wsz-dimensional vectors. We give word error rates for wsz=15, 50 and 100 and various shared tasks.
wsz=15 wsz=50 wsz=100
SRL 16.54 17.33 18.40
SRL + POS 15.99 16.57 16.53
SRL + Chunking 16.42 16.39 16.48
SRL + NER 16.67 17.29 17.21
SRL + Synonyms 15.46 15.17 15.17
SRL + Language model 14.42 14.30 14.46
SRL + POS + Chunking 16.46 15.95 16.41
SRL + POS + NER 16.45 16.89 16.29
SRL + POS + Chunking + NER 16.33 16.36 16.27
SRL + POS + Chunking + NER + Synonyms 15.71 14.76 15.48
SRL + POS + Chunking + NER + Language model 14.63 14.44 14.50
Test Error
Test Error
18 18
18
17 17 17
16 16 16
15 15 15
141 11 21 31 141 3.5 6 8.5 11 13.5 16 18.5 141 3.5 6 8.5 11 13.5
Figure 3. Test error versus number of training epochs over PropBank, for the SRL task alone and SRL jointly trained
with various other NLP tasks, using deep NNs.
Results: Language Model Because the language All MTL experiments performed better than SRL
model was trained on a huge database we first trained alone. With larger wsz (and thus large capacity) the
it alone. It takes about a week to train on one com- relative improvement becomes larger from using MTL
puter. The embedding obtained in the word lookup- compared to the task alone, which shows MTL is a
table was extremely good, even for uncommon words, good way of regularizing: in fact with MTL results
as shown in Table 1. The embedding obtained by are fairly stable with capacity changes.
training on labeled data from WordNet synonyms
The semi-supervised training of SRL using the lan-
is also good (results not shown) however the coverage
guage model performs better than other combinations.
is not as good as using unlabeled data, e.g. Dream-
Our best model performed as low as 14.30% in per-
cast is not in the database.
word error rate, which is to be compared to previ-
The resulting word lookup-table from the language ously published results of 16.36% with an NN archi-
model was used as an initializer of the lookup-table tecture (Collobert & Weston, 2007) and 16.54% for a
used in MTL experiments with a language model. state-of-the-art method based on parse trees (Pradhan
et al., 2004)1 . Further, our system is the only one not
Results: SRL Our main interest was improving SRL
to use POS tags or parse tree features.
performance, the most complex of our tasks. In Ta-
ble 2, we show results comparing the SRL task alone Results: POS and Chunking Training takes about
with the SRL task jointly trained with different com- 30 min for these tasks alone. Testing time for label-
binations of the other tasks. For all our experiments, ing a complete sentence is about 0.003s. We obtained
training was achieved in a few epochs (about a day) modest improvements to POS and chunking results us-
over the PropBank dataset as shown in Figure 3. Test- 1
Our loss function optimized per-word error rate. We
ing takes 0.015s to label a complete sentence (given one note that many SRL results e.g. the CONLL 2005 evalua-
verb). tion use F1 as a standard measure.
A Unified Architecture for Natural Language Processing
ing MTL. Without MTL (for wsz = 50) we obtain Gildea, D., & Palmer, M. (2001). The necessity of parsing
2.95% test error for POS and 4.5% (91.1 F-measure) for predicate argument recognition. Proceedings of the
for chunking. With MTL we obtain 2.91% for POS 40th Annual Meeting of the ACL, 239246.
and 3.8% (92.71 F-measure) for chunking. POS error Joachims, T. (1999). Transductive inference for text clas-
rates in the 3% range are state-of-the-art. For chunk- sification using support vector machines. ICML.
ing, although we use a different train/test setup to the LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998).
CoNLL-2000 shared task (http://www.cnts.ua.ac. Gradient-Based Learning Applied to Document Recog-
be/conll2000/chunking) our system seems compet- nition. Proceedings of the IEEE, 86.
itive with existing systems (better than 9 of the 11 McClosky, D., Charniak, E., & Johnson, M. (2006). Ef-
submitted systems). However, our system is the only fective self-training for parsing. Proceedings of HLT-
one that does not use POS tags as input features. NAACL 2006.
Note, we did not evaluate NER error rates because Miller, S., Fox, H., Ramshaw, L., & Weischedel, R. (2000).
we used non-gold standard annotations in our setup. A novel use of statistical parsing to extract informa-
tion from text. 6th Applied Natural Language Processing
Future work will more thoroughly evaluate these tasks.
Conference.
Musillo, G., & Merlo, P. (2006). Robust Parsing of the
7. Conclusion Proposition Bank. ROMAND 2006: Robust Methods in
Analysis of Natural language Data.
We proposed a general deep NN architecture for NLP.
Our architecture is extremely fast enabling us to take Okanohara, D., & Tsujii, J. (2007). A discriminative lan-
advantage of huge databases (e.g. 631 million words guage model with pseudo-negative samples. Proceedings
of the 45th Annual Meeting of the ACL, 7380.
from Wikipedia). We showed our deep NN could be
applied to various tasks such as SRL, NER, POS, Palmer, M., Gildea, D., & Kingsbury, P. (2005). The
chunking and language modeling. We demonstrated proposition bank: An annotated corpus of semantic
that learning tasks simultaneously can improve gener- roles. Comput. Linguist., 31, 71106.
alization performance. In particular, when training Pradhan, S., Ward, W., Hacioglu, K., Martin, J., & Juraf-
the SRL task jointly with our language model our sky, D. (2004). Shallow semantic parsing using support
architecture achieved state-of-the-art performance in vector machines. Proceedings of HLT/NAACL-2004.
SRL without any explicit syntactic features. This is Rosenfeld, B., & Feldman, R. (2007). Using Corpus Statis-
an important result, given that the NLP community tics on Entities to Improve Semi-supervised Relation Ex-
considers syntax as a mandatory feature for semantic traction from the Web. Proceedings of the 45th Annual
Meeting of the ACL, 600607.
extraction (Gildea & Palmer, 2001).
Schwenk, H., & Gauvain, J. (2002). Connectionist lan-
guage modeling for large vocabulary continuous speech
References recognition. IEEE International Conference on Acous-
Ando, R., & Zhang, T. (2005). A Framework for Learn- tics, Speech, and Signal Processing (pp. 765768).
ing Predictive Structures from Multiple Tasks and Un- Sutton, C., & McCallum, A. (2005a). Composition of con-
labeled Data. JMLR, 6, 18171853. ditional random fields for transfer learning. Proceedings
of the conference on Human Language Technology and
Bengio, Y., & Ducharme, R. (2001). A neural probabilistic Empirical Methods in Natural Language Processing, 748
language model. NIPS 13. 754.
Bridle, J. (1990). Probabilistic interpretation of feedfor- Sutton, C., & McCallum, A. (2005b). Joint parsing and
ward classification network outputs, with relationships semantic role labeling. Proceedings of CoNLL-2005 (pp.
to statistical pattern recognition. In F. F. Soulie and 225228).
J. Herault (Eds.), Neurocomputing: Algorithms, archi-
tectures and applications, 227236. NATO ASI Series. Sutton, C., McCallum, A., & Rohanimanesh, K. (2007).
Dynamic Conditional Random Fields: Factorized Prob-
Caruana, R. (1997). Multitask Learning. Machine Learn- abilistic Models for Labeling and Segmenting Sequence
ing, 28, 4175. Data. JMLR, 8, 693723.
Ueffing, N., Haffari, G., & Sarkar, A. (2007). Transductive
Chapelle, O., Schlkopf, B., & Zien, A. (2006). Semi-
learning for statistical machine translation. Proceedings
supervised learning. Adaptive computation and machine
of the 45th Annual Meeting of the ACL, 2532.
learning. Cambridge, Mass., USA: MIT Press.
Waibel, A., abd G. Hinton, T. H., Shikano, K., & Lang,
Collobert, R., & Weston, J. (2007). Fast semantic extrac- K. (1989). Phoneme recognition using time-delay neural
tion using a novel neural network architecture. Proceed- networks. IEEE Transactions on Acoustics, Speech, and
ings of the 45th Annual Meeting of the ACL (pp. 560 Signal Processing, 37, 328339.
567).