Sie sind auf Seite 1von 18

Paper Report

Attention is all you need

Naman Jain
160050025
Under
Prof. Pushpak Bhattacharyya
Department of Computer Science & Engineering
Indian Institute of Technology, Bombay
Contents

● Background |
3
● Introduction |
4
● Related Work |
6-8
● Contributions |
9-15
● Results
| 16-20
Introduction - 1
● RNNs, LSTMs, GRUs have taken the frontier in pushing state of the are
results in sequence modelling due to their ability to process variable size data
● They continuously process input, and generate current hidden representation
of the input based on previous hidden state and given word.
● Thus it makes them inherently sequential and not parallelizable which
becomes critical at longer sequence lengths, as memory constraints limit
batching across examples.
● The authors propose a parallelizable architecture using self-attention
mechanism, which not only is faster but also improved performance in
Machine Translation
Previous Approaches - 1
● There has been previous work in building parallelizable architectures for
sequences such as machine translation for e.g. Extended Neural GPU or
ConvS2S (Gehring et. al.) or ByteNet (Kalchbrenner et. al.)
● They use convolutions as basic building blocks for parallelizing computations
● However, number of operations required to relate signals from two arbitrary
input or output positions grows in the distance between positions, linearly for
ConvS2S and logarithmically for ByteNet.
● This makes it more difficult to learn dependencies between distant positions
Transformers - 1
● Transformers reduce the computation to a constant number of operations
● They do so via self attention mechanism, which computes intermediate
representations of its input without using seq. aligned RNNs or convolution
● Most sequence to sequence task comprise of encoder and a decoder
● encoder maps an input sequence of symbol representations (x1, ..., xn) to a
sequence of continuous representations and given z, the decoder then
generates an output sequence (y1, ..., ym) of symbols one element at a time.
● At each step the model is auto-regressive, consuming the previously
generated symbols as additional input when generating the next.
● Transformer follows this overall architecture using stacked self-attention and
point-wise, fully connected layers for both the encoder and decoder.
Transformers - 2
Transformers - 3
● Transformers are advantageous because they reduce everything to a
constant ‘path length’ between any two positions.
● The self attention mechanism proceeds as follows, for every feature in the
current feature vector, (r2 in our example), attention is computed using
“Scaled Dot-Product Attention” corresponding to every other input in the
vector (r1, r3, r4). A softmax is applied of this attention vector and a convex
combination corresponding to these attention weights is propagated towards
the next step along with residual r2 features to improve backpropagation
● Instead of performing a single attention for a feature vector, each feature is
projected “h” times with a smaller & different latent space and, “h” different
attention branches are computed for each of h heads. Finally these are
stacked together. Multi-head attention allows the model to jointly attend to
information from different representation subspaces at different positions
Transformers - 4
Transformers - 5
Transformers - 6 - Encoder/Decoder
Encoder

● The encoder is composed of a stack of N = 6 identical layers.


● Each layer has two sub-layers. The first is a multi-head self-attention
mechanism, and the second is a simple, positional fully connected network.

Decoder

● The decoder is also composed of a stack of N = 6 identical layers.


● In addition to the two sub-layers in each encoder layer, the decoder inserts a
third layer, to perform multi-head attention over the output of encoder stack
● Self-attention layer in the decoder stack is modified to prevent positions from
attending to subsequent positions ensuring that the predictions for position i
can depend only on the known outputs at positions less than i
Transformers - 7
Transformers - 8
Transformers - 9
Transformers - 10
Position wise FFN

● FFN layers in encoder and decoder contains a fully connected feed-forward


network, which is applied to each position separately and identically. This
consists of two linear transformations with a ReLU activation in between

Positional Encoding

● To compensate for absence of positional information (order of sequence) thet


injected some information about the relative or absolute position of the tokens
in sequence added to the bottom of both encoder and decoder stacks.
pos => position
i => dimension of vector
d_model => size of embedding
Transformers - 11
Transformers - 12
● Learning long-range dependencies is a key challenge in many sequence
transduction tasks. One key factor affecting the ability to learn them is the
path length forward and backward signals have to traverse in the network.
● The shorter these paths between any combination of positions in the input
and output sequences, the easier it is to learn long-range dependencies
● As we can see in table above, path lengths in self attention is very optimized.
In cases of very long sentences, one may also use restricted self attention
over neighbourhood of only r instead of n completely.
Results
References
● Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Processing Systems, pages
6000–6010
● CS-224N Ashish Vaswani, http://web.stanford.edu/class/cs224n/slides/cs224n-2019-
lecture14-transformers.pdf

Das könnte Ihnen auch gefallen