Beruflich Dokumente
Kultur Dokumente
Naman Jain
160050025
Under
Prof. Pushpak Bhattacharyya
Department of Computer Science & Engineering
Indian Institute of Technology, Bombay
Contents
● Background |
3
● Introduction |
4
● Related Work |
6-8
● Contributions |
9-15
● Results
| 16-20
Introduction - 1
● RNNs, LSTMs, GRUs have taken the frontier in pushing state of the are
results in sequence modelling due to their ability to process variable size data
● They continuously process input, and generate current hidden representation
of the input based on previous hidden state and given word.
● Thus it makes them inherently sequential and not parallelizable which
becomes critical at longer sequence lengths, as memory constraints limit
batching across examples.
● The authors propose a parallelizable architecture using self-attention
mechanism, which not only is faster but also improved performance in
Machine Translation
Previous Approaches - 1
● There has been previous work in building parallelizable architectures for
sequences such as machine translation for e.g. Extended Neural GPU or
ConvS2S (Gehring et. al.) or ByteNet (Kalchbrenner et. al.)
● They use convolutions as basic building blocks for parallelizing computations
● However, number of operations required to relate signals from two arbitrary
input or output positions grows in the distance between positions, linearly for
ConvS2S and logarithmically for ByteNet.
● This makes it more difficult to learn dependencies between distant positions
Transformers - 1
● Transformers reduce the computation to a constant number of operations
● They do so via self attention mechanism, which computes intermediate
representations of its input without using seq. aligned RNNs or convolution
● Most sequence to sequence task comprise of encoder and a decoder
● encoder maps an input sequence of symbol representations (x1, ..., xn) to a
sequence of continuous representations and given z, the decoder then
generates an output sequence (y1, ..., ym) of symbols one element at a time.
● At each step the model is auto-regressive, consuming the previously
generated symbols as additional input when generating the next.
● Transformer follows this overall architecture using stacked self-attention and
point-wise, fully connected layers for both the encoder and decoder.
Transformers - 2
Transformers - 3
● Transformers are advantageous because they reduce everything to a
constant ‘path length’ between any two positions.
● The self attention mechanism proceeds as follows, for every feature in the
current feature vector, (r2 in our example), attention is computed using
“Scaled Dot-Product Attention” corresponding to every other input in the
vector (r1, r3, r4). A softmax is applied of this attention vector and a convex
combination corresponding to these attention weights is propagated towards
the next step along with residual r2 features to improve backpropagation
● Instead of performing a single attention for a feature vector, each feature is
projected “h” times with a smaller & different latent space and, “h” different
attention branches are computed for each of h heads. Finally these are
stacked together. Multi-head attention allows the model to jointly attend to
information from different representation subspaces at different positions
Transformers - 4
Transformers - 5
Transformers - 6 - Encoder/Decoder
Encoder
Decoder
Positional Encoding