Extra 1 PDF

Machine Translation : From Statistical to modern Deep-learning practices
Siddhant Srivastava Anupam Shukla Ritu Tiwari

Soft Computing Laboratory IIIT Pune Soft Computing Laboratory
ABV-IIITM Gwalior dranupamshukla@gmail.com ABV-IIITM Gwalior
siddhant.srivastava11@gmail.com tiwariritu2@gmail.com
Abstract Statistical Machine Translation (SMT) is a data-

driven approach which uses probabilistic models
Machine translation (MT) is an area of
to capture the translation process. Early models
study in Natural Language processing
in SMT were based on generative models taking
which deals with the automatic translation
arXiv:1812.04238v1 [cs.CL] 11 Dec 2018
a word as the basic entity (Brown 1993), maxi-

of human language, from one language
mum entropy based discriminative models using
to another by the computer. Having a
features learned from sentences (Och 2002), sim-
rich research history spanning nearly three
ple and hierarchical phrases (Koehn 2003, Chi-
decades, Machine translation is one of the
ang 2007). These methods have been greatly
most sought after area of research in the
used since 2002 despite the fact that discrimina-
linguistics and computational community.
tive models faced with the challenge of data spar-
In this paper, we investigate the models
sity. Discrete word based representations made
based on deep learning that have achieved
SMT susceptible to learning poor estimate on the
substantial progress in recent years and be-
account of low count events. Also, designing fea-
coming the prominent method in MT. We
tures for SMT manually is a difficult task and re-
shall discuss the two main deep-learning
quire domain language, which is hard keeping in
based Machine Translation methods, one
mind the variety and complexity of different natu-
at component or domain level which lever-
ral languages.
ages deep learning models to enhance the
efficacy of Statistical Machine Translation Later years have witnessed the exceptional suc-
(SMT) and end-to-end deep learning mod- cess of deep learning applications in machine
els in MT which uses neural networks to translation. Deep learning approaches have out-
find correspondence between the source stripped statistical methods in almost all sub-fields
and target languages using the encoder- of MT and have become the de facto method in
decoder architecture. We conclude this both academia as well as in the industry. In this pa-
paper by providing a time line of the ma- per, we will discuss the two domains where deep
jor research problems solved by the re- learning has been greatly employed in MT. We
searchers and also provide a comprehen- will briefly discuss Component or Domain wise
sive overview of present areas of research deep learning methods for machine translation
in Neural Machine Translation. (Devline 2014) which uses deep learning models
to improve the effectiveness of different compo-
1 Introduction nents used in SMT including language models,
Machine Translation, which is a field of study un- transition models, and re-structuring models. Our
der natural language processing, targets at trans- main focus in on end-to-end deep learning mod-
lating natural language automatically using ma- els for machine translation (Sutskever 2014, Bah-
chines. Data-driven machine translation has be- danau 2014) that uses neural networks to extract
come the dominant field of study due to the acces- correspondence between a source and target lan-
sibility of large parallel corpora. The main objec- guage directly in a holistic manner without using
tive of data-driven machine translation is to trans- any hand-crafted features. These models are now
late unseen source language, given that the sys- recognised as Neural Machine translation (NMT).
tems learn translation knowledge from sentence The paper is arranged as follows. We will first
aligned bi-lingual training data. introduce the basic definitions and objective in
Machine Translation. The next section will have to utilize phrases to capture word selection and
a brief discussion on component-wise deep learn- restructuring the local context. The translation
ing models and how they improve SMT based model in phrase-based Statistical Machine Trans-
models. Next, we’ll be focusing on end-to-end lation is divided in three main steps or sub-models:
Neural Machine Translation mentioning the chal- (1) segmentation of source sentence to phrases. (2)
lenges faced in these models, we will discuss the transforming each source phrase to target phrase.
currently employed encoder-decoder model focus- (3) restructuring target phrases to match target lan-
ing mainly on network architecture used in the guage order. The addition of target phrases corre-
paradigm which is the current area of research. sponds to target sentence.
We will conclude by giving the time-line of ma- Statistical Machine Translation suffers from
jor breakthroughs in NMT and will also propose two problems, which are data sparsity and feature
the future research areas in neural architecture de- engineering. Due to discrete symbolic represen-
velopment. tation, the statistical model is disposed to learn-
ing weak estimates of model parameter on account
2 Machine translation : Essentials of low numbers. As a result, conventional con-
Let x denote the source language and y denote the ventional SMT resort to using simple features in-
target language, given a set of model parameters θ stead of complex features which sets an unavoid-
, the aim of any machine translation algorithm is to able bar on the model’s effectiveness. The second
find the translation having maximum probability challenge faced by SMT is of feature engineering,
ŷ: usual practice in feature design of SMT includes
ŷ = argmaxy P (y|x; θ). (1) annotating hand crafted features to capture local
syntactic and semantic features. Since there can be
The decision rule is re-written using Bayes’ rule millions of such features and mapping those fea-
as (Brown 1993): tures from one language to another can be a cum-
P (y; θlm )P (x|y; θtm ) bersome task, designing general features in SMT
ŷ = argmaxy . (2) remains a challenge.
P (x)
ŷ = argmaxy P (y; θlm )P (x|y; θtm ). (3) 3 Component or Domain wise

Where P (y; θlm ) is called as language model, and
Deep-learning methods in Statistical
P (x|y; θtm ) is called as transition model. The Machine Translation
translation model in addition, is defined as genera- In recent years, deep learning based approaches
tive model, which is disintegrated via latent struc- have been greatly studies to mitigate the issues
tures. faced by SMT, data sparsity and feature engi-
X neering. In this section, we briefly discuss how
P (x|y; θtm ) = P (x, y|z; θtm ). (4)
z
deep learning based models have been success-
fully used in key components of SMT, Word align-
Where, z denotes the latent structures like word ment, transition rule probability estimation, phrase
alignment between source language and target lan- restructuring model and language model.
guage. The key problem with this approach is
that it is hard to generalize because of dependen- 3.1 Word Alignment
cies among sub-models. To introduce knowledge
The role of word alignment is to find correspon-
sources in SMT, (Och 2002) uses log-linear mod-
dence between words present in parallel corpus
els.
(Brown 1993; Vogel 1996). In SMT, word align-
z exp(theta ∗ ψ(x, y, z))
P
ment is taken as hidden variable in generative
P (y, x|θ) = P0 P 0 0 0
y z exp(theta ∗ ψ(x , y, z )) models. Word alignment is basically modelled
(5) as P (x, z|y; θ), The Hidden Markov model is the
Where ψ(x, y, z) is a set of features defining the most widely used SMT alignment model (Vogel
translation process and θ denotes corresponding 1996) and the conventional training objective is to
weights for each feature. (Koen 2003) introduces maximize the log-likelihood of training data. The
phrase-based translational model which is widely drawback of using conventional generative model
used in academia and industry, the basic idea is is that it fails to capture complex relationships
among natural language due to data sparsity when tional probability using continuous representation
using discrete symbolic representation. of words. (Bengio 2013) designed a feed forward
(Yang 2013) were the first to propose context network to digest n-gram model into a continuous
dependant deep neural network for word align- vector space. (Vaswani 2013) integrated this neu-
ment, there idea was to capture contextual features ral network based n-gram model into the phrase
using continuous word representations as input to decoding phase of SMT, he developed a recurrent
a feed forward-neural network. (Tamura 2014) neural network taking n-gram representations of
used Recurrent neural network (RNN) to calcu- fixed size. This model, using n-gram representa-
late alignment score directly taking previous align- tion, assumed that the generation of current word
ment scores as input, The authors reported better depended on precursor n − 1 words, this assump-
performance than the Yang’s model. tion was relaxed with using LSTM (Hochreichter
1997) and GRU (Cho 2014) based network which
3.2 Transition rule probability estimation took into account all previous words to predict the
In a phrase based SMT, we may get multiple tran- current word.
sition rule from word aligned training data, the ob-
jective becomes to select the most apt rules dur- 4 End-to-End Deep Learning for
ing decoding phase. Usually, rule selection is Machine translation
done using transition probabilities and computed End-to-End Machine Translation models also
using MLE (Koehn 2003). The problem with termed as Neural Machine Translation (NMT),
this approach is that it suffers from data sparsity aims to find a correspondence between source and
and fails to capture deep semantics and context. target natural languages with the help of deep neu-
Deep learning techniques aim to alleviate these is- ral networks. The main difference between NMT
sues, (Gao 2014) calculate transition score among and conventional Statistical Machine Translation
source and target phrase in a low dimensional vec- (SMT) based approaches is that Neural model are
tor space using feed forward network. (Devlin capable of learning complex relationships among
2014) proposed a joint neural model which aimed natural languages directly from the data, without
at modelling both source and target context to pre- resorting to manual hand features, which are hard
dict transition score using feed forward network. to design.
3.3 Reordering phrases The standard problem in Machine Translation
remains the same, given a sequence of words in
After scoring phrases between source and target source language sentence X = x1 , ....xj , ....xJ
sentence using transition scores, the next step is to and target language sentence Y = y1 , ....yi , ....yI ,
order target phrases to produce a well-formed sen- NMT tries to factor sentence level translation
tence. Earlier SMT computed phrase order taking probability into context dependant sub-word trans-
discrete symbolic representations (Xiong 2016). lation probabilities.
(Li 2013; Li 2014) proposed a neural phrase re-
ordering network which employed recursive au- I
Y
toencoders to learn continuous distributed repre- P (y|x; θ) = P (yi |x, y<i ; θ) (6)
i=1
sentation of both source and target phrases, and
making final ordering prediction with the help of Here y<i is referred to as partial translation. There
feed forward network. can be sparsity among context between source and
target sentence when the sentences become too
3.4 Language Modelling long, to solve this issue, (Sutskever 2014) pro-
Target phrases are combined to create target sen- posed an encoder-decoder network which could
tence or a larger partial translation in some cases, represent variable length sentence to a fixed length
the role of language model is to determine that vector representation and use this distributed vec-
the larger translation is better than it’s composed tor to translate sentences.
phrases. Conventional SMT’s used n-gram lan-
guage model to compute this conditional probabil- 4.1 Encoder Decoder Framework for
ity. Since n-gram model is count based, it suffered Machine Translation
severely from data sparsity. Deep learning mod- Neural Machine Translation models adhere to an
els helped alleviate this issue by computing condi- Encoder-Decoder architecture, the role of encoder
is to represent arbitrary length sentences to a fixed
length real vector which is termed as context vec-
tor. This context vector contains all the necessary
features which can be inferred from the source
sentence itself. The decoder network takes this
vector as input to output target sentence word by
word. The ideal decoder is expected to output sen-
tence which contains the full context of source lan-
guage sentence.
Since source and target sentences are usually of
different lengths, Initially (Sutskever 2014) pro-
posed Recurrent Neural Network for both encoder
and decoder networks, To address the problem of
vanishing gradient and exploding gradients occur-
ring due to dependencies among word pairs, Long
Short Term Memory (LSTM) and Gated Recur-
rent Unit (GRU) were proposed instead of Vanilla
RNN cell. Fig.1 Shows the architectural flow of a Figure 1: Encoder-Decoder model for Machine
basic encoder-decoder based network. Translation, Crimson boxes depict the hidden
Training in NMT is done by maximizing log- stated of encoder, Blue boxes shows ”End of Sen-
likelihood as the objective function: tence” EOS and Green boxes show hidden state of
the decoder. credits (Neural Machine Translation
θ̂ = argmaxθ L(θ) (7) - Tutorial ACL 2016)
Where L(θ) is defined as:

I ←
− ←−−
X
(i) (i)
hs = f (x(s) , hs−1 , θ) (11)
L(θ) = logP (y |x ; θ) (8)
→
−
i=1 The forward hidden state hs and backward hidden
←
−
state hs are concatenated to capture sentence level
After training, learned parameters θ̂ is used for
context.
translation as: −−→ ←−−
hs = [hs−1 ; hs−1 ] (12)
ŷ = argmaxy P (y|x; θ̂) (9) The basic Ideology behind computing attention
is to find relevant portions in source text in order
4.2 Attention Mechanism in Neural Machine to generate target words in text, this is performed
Translation by computing attention weights first.
The Encoder network proposed by (Sutskever exp(a(tj−1 , hi , θ))
2014) represented source language sentence into αj,i = PI+1 (13)
i0 =1 exp(a(tj−1 , hi0 , θ))
a fixed length vector which was subsequently
utilised by the Decoder network, through empir- Where a(tj−1 , hi , θ) is the alignment function
ical testing, it was observed that the quality of which evaluates how well inputs are aligned with
translation greatly depended on the size of source respect to position i and output at position i. Con-
sentence and decreased significantly by increasing text vector cj is computed as a weighted sum of
the sentence size. hidden states of the source.
To address this issue, (Bahdanau 2014) pro-
I+1
posed to integrate an Attention mechanism in- cj =
X
αj,i hi (14)
side the Encoder network and showed that this i=1
could dynamically select relevant portions of con- And target hidden state is computed as follows.
text in source sentence to produce target sentence.
They used Bi-directional RNN (BRNN’s) to cap- tj = f (yj−1 , sj−1 , cj , θ) (15)
ture global contexts:
The difference between attention based NMT
→
− −−→
hs = f (x(s) , hs−1 , θ) (10) from original encoder-decoder based architecture
lems in the Neural Machine Translation and pro-
pose some of our own possible research areas.
5.1 Inefficient performance due to large

Vocabulary
Because NMT uses word level tokens as input, the
translation probability is calculated by normaliz-
ing over the total target words, the log likelihood
of training data in turn depends over the compu-
tation of transition probabilities. Calculating the
gradient of this log likelihood becomes an extra-
neous task for training NMT models as we need to
enumerate through all the target words.
For these reasons, (Sutskever 2014; Bahdanau
Figure 2: Attention based Encoder-Decoder Ar-
2014) trained their model on a subset of full vo-
chitecture for Machine Translation. Most of the
cabulary taking most frequent words for training
Architecture is similar to basic Encoder-Decoder
and rest of the words as Out of Vocabulary (OOV),
with the addition of Context Vector Computed us-
but this deteriorated the overall performance of the
ing attention weights for each word token, Atten-
model significantly.
tion vector is calculated using Context vector and
(Luong 2014) proposed a method to find corre-
hidden state of encoder. credits (Attention-based
spondence between source and target OOV words
Neural Machine Translation with Keras, blog by
and translate among the source and target as a sep-
Sigrid Keydana)
arate step in pre-processing. One interesting ap-
proach to solve this problem is to use character
is the way source context is computed, in original level (Chung 2016; Luong 2016) and sub-word
encoder-decoder, the source’s hidden state is used level (Sennrich 2015) tokens as input to the neural
to initialize target’s initial hidden state whereas architecture. The intuition behind this approach
in attention mechanism, a weighted sum of hid- is that using characters and sub-words greatly re-
den state is used which makes sure that the rele- duces the vocabulary making the computation of
vance of each and every source word in the sen- translation probability significantly faster.
tence is well preserved in the context. This greatly
5.2 Evaluation Metric for end-to-end
improves the performance of translation and thus
training in NMT
this has become the stateof theart model in neu-
ral machine translation. Fig.2 shows the architec- The standard training metric used by all the ma-
tural flow of attention based encoder and how this chine translation systems is Maximum Likelihood
information is carried forward and utilized bu the Error (MLE), MLE finds the optimum model pa-
decoder system. rameters by maximizing the log-likelihood of the
training data. (Ranzato 2015) identified a likely
5 Challenges in Neural Machine drawback with this approach, He pointed that
Translation MLE works at word level for it’s loss function
whereas the evaluation metric used by Machine
In this section, we will discuss some of the issues Translation is BLEU (Papineni 2002) and TER
researchers faced in NMT and the solutions pro- (Snover 2006) which are defined at the corpus or
posed by them. This section will include issues sentence level. The inconsistency between model
related (1) Inefficient performance due to large training and evaluation poses problem for Neural
vocabulary, (2) Evaluation metric for end-to-end Machine Translation.
training in NMT, (3) NMT training in low-quality To solve this issue (Shen 2015) proposed Mini-
data and (4) Network Architecture in NMT. Our mum Risk Training (MRT) as a loss function while
main focus will be based on the discussion of Net- training Neural models, the idea behind this being
work architectures that have been proposed in re- that loss function should measure the difference
cent years. Our main objective in this section is between model predictions and the ground truth
to pin point the current open ended research prob- translation value, the optimal parameters should
be computed by minimizing this loss. the training GRU. Recently, Convolution networks (CNN) and
objective is defined as. self attention networks have been studies and have
produced promising results.
θ̂ = argminθ R(θ) (16) The issue with using Recurrent networks in
S
X X
NMT is that it works by serial computation and
R(θ) = P (y|x(s) ; θ)∆(y, y (s) ) needs to maintain it’s hidden step at each step of
s=1 y∈Y (x(s) ) training. This makes the training highly inefficient
(17) and time consuming. (Gehring 2017) suggested
Where Y (x( s)) is the set of all translations of that convolution networks can, in contrast, learn
x( s), y and y ( s) are the model predictions and the fixed length hidden states using convolution
ground truth respectively, ∆(y, y (s) ) is the loss operation. The main advantage of this approach
function computing the difference between pre- being that convolution operation doesn’t depend
diction as ground truth values. The advantage of on previously computed values and can be par-
MRT over MLE is that MRT can directly optimize allelized for multi-core training. Also Convolu-
model parameters with respect to evaluation met- tion networks can be stacked one after the other to
ric. Also, MRT uses sentence level loss function, learn deeper context making it an ideal choice for
unlike word level in MLE and MRT is transparent both the encoder and decoder.
to neural models and can be used in a number of Recurrent networks compute dependency
artificial intelligence tasks. among words in a sentence in O(n) whereas
Convolution network can achieve the same in
5.3 NMT training in low-quality data
O(logk n) where k is the size of convolution
NMT models owe their success to the parallel cor- kernel.
pus containing parallel text as they are the main (Vaswani 2017) proposed a model which could
source of knowledge acquisition in terms of trans- compute the dependency among each word pair in
lation. As a result, the translation quality of NMT a sentence using only the attention layer stacked
systems rely greatly on the quality and quantity of one after the other in both the encoder and
parallel corpora. NMT models have been effec- decoder, he termed this as self-attention. In
tive in translating resource rich languages but for their model, hidden state is computed using self-
low resource language, the unavailability of large attention and feed forward network, they use po-
scale, high quality corpora poses a challenge for sitional encoding to introduce the feature based
NMT as neural models learn poorly from low level on the location of word in the sentence and their
counts. Results show that NMT performs worse self-attention layer named as multi-head attention
than SMT in case of low data availability. is highly parallelizable. This model has shown to
(Gulcehre 2015) proposed a solution by incor- be highly parallelizable due to before mentioned
porating knowledge learned using mono-lingual reason and significantly speeds up NMT training,
data which is relatively abundant compared to par- also resulting in better results than the baseline
allel text, into the NMT model. They proposed to Recurrent network based models. Fig 3. shows
types of fusion, shallow fusion and deep fusion by the internal architecture of Transformer network
incorporating the language model which is trained as proposed by (Vaswani 2017), the figure shows
on mono-lingual data into the hidden state or the encoder and decoder module separately.
decoder of NMT. (Cheng 2016) proposed a semi- Currently, there is no consensus regarding
supervised learning approach to NMT by propos- which neural architecture is the best and differ-
ing a neural autoencoder for source-to-target and ent architectures give different results depending
target-to-source which can be trained on both par- on the problem in hand. Neural architecture is
allel and mono-lingual data. (Cheng 2017) pro- still considered to be the hottest and most active
posed a pivot language based approach, the idea research field in Neural Machine Translation.
being that one can train an NMT by training it us-
ing source-to-pivot and pivot-to-target separately. 6 Research gaps and open problems
5.4 Neural Architectures for NMT Deep learning methods have revolutionized the
Most of the encoder-decoder based NMT models field of Machine Translation, with early efforts fo-
have utilized RNN and It’s variants LSTM and cusing on improving the key components of Sta-
acquainted with these issues and work towards it
for even faster development in the field. Table 1.
shows the major contributions done in the field
of machine translation as years progress, we can
see the inception of neural network architectures
for machine translation beginning from the year
2014 and most breakthroughs are observed using
the same neural approach.
6.1 Neural models inspired by linguistic

approaches
End-to-End models have been termed as the de
facto model in Machine Translation, but it is hard
to interpret the internal computation of neural net-
works which is often simply said to be the ”Black
Box” approach. One possible area of research
is to develop linguistically motivated neural mod-
els for better interpretability. It is hard to discern
knowledge from hidden state of current neural net-
works and as a result it is equally difficult to incor-
porate prior knowledge which is symbolic in na-
ture into continuous representation of these states
(Ding 2017).
6.2 Light weight neural models for learning

through sparse data
Figure 3: Self-Attention Encoder-Decoder Trans- Another major drawback for NMT is data scarcity,
former model. Encoder and Decoder both consists It is well understood that NMT models are data
positional encoding and stacked layers of multi- hungry and requires millions of training instances
head attention and feedforward network with the for giving best results. The problem arises when
Decoder containing an additional Masked multi there is not enough parallel corpora present for
head attention. Transition Probabilities are calcu- most of the language pairs in the world. Thus
lated using linear layer followed by softmax. cred- building models that can learn decent representa-
its (Vaswani et al. 2017) tion using relatively smaller data set is an actively
researched problem today. One similar issue is to
develop one-to-many and many-to-many language
tistical Machine Translation like word alignment models instead of one-to-one models. Researchers
(Yang 2013) , translation model, and phrase re- are not sure how to common knowledge using neu-
ordering and language model. Since 2010, most ral network from a linguistic perspective, as this
of the research has been shifted towards develop- knowledge will help develop multi-lingual trans-
ing end-to-end neural models that could mitigate lation models instead of one-to-one models used
the need of extensive feature engineering . Neu- today.
ral models have successfully replaced Statistical
models since their inception in all academic and 6.3 Multi-modal Neural Architectures for
industrial application. present data
Although Deep learning has accelerated re- One more problem is to develop multi-modal lan-
search in Machine Translation community but guage translation models. Almost all the work
nonetheless, Current NMT models are not free done has been based on textual data. Research
from imperfections and has certain limitations. In on developing continuous representation merging
this section, we will delineate some existing re- text, speech and visual data to develop multi-
search problems in NMT, our aim is to guide re- model systems is in full swing. Also since there is
searchers and scholars working in this field to get limited or no multi-model parallel corpora present,
development of such databases is also an inter- guage model. Journal of machine learning research,
esting field to explore and can also benefit multi- 3(Feb):11371155.
modal neural architectures.
Peter F Brown, Vincent J Della Pietra, Stephen A Della
Pietra, and Robert L Mercer. 1993. The mathematics
6.4 Parallel and distributed algorithms for of statistical machine translation: Parameter estima-
training neural models tion. Computational linguistics, 19(2):263311.
Finally, current neural architectures rely heavily Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua
extensive computation power for giving compe- Wu, Maosong Sun, and Yang Liu. 2016. Semisuper-
tent results. Although there is no compute and vised learning for neural machine translation. arXiv
storage shortage in current scenario, but it would preprint arXiv:1606.04596.
be more efficient to come up with light neural Yong Cheng, Qian Yang, Yang Liu, Maosong Sun, and
models of language translation. Also Recurrent Wei Xu. 2017. Joint training for pivot-based neural
models cannot be parallelized due to which it machine translation. In Proceedings of IJCAI.
is hard to develop distributed systems for model
David Chiang. 2007. Hierarchical phrase-based trans-
training. Fortunately, recent developments, with lation. computational linguistics, 33(2):201228.
the emergence of Convolution networks and self-
attention Networks can be parallelized and thus Kyunghyun Cho, Bart Van Merrienboer, Caglar Gul-
distributed among different systems. But because cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-
ger Schwenk, and Yoshua Bengio. 2014. Learning
they contain millions of interdependent parame-
phrase representations using rnn encoder-decoder
ters, it makes it hard to distribute them among for statistical machine translation. arXiv preprint
loosely coupled systems. Thus developing light arXiv:1406.1078.
neural architectures meant to be distributed can be
new probable frontier of NMT. Junyoung Chung, Kyunghyun Cho, and Yoshua Ben-
gio. 2016. A character-level decoder without explicit
segmentation for neural machine translation. arXiv
7 Conclusion preprint arXiv:1603.06147. abs/1409.0473.
In this paper, we discussed about Machine trans- Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas
lation. We started our discussion from a brief Lamar, Richard Schwartz, and John Makhoul. 2014.
discussion on basic Machine translation objec- Fast and robust neural network joint models for sta-
tistical machine translation. In Proceedings of the
tive and terminologies along with early Statistical
52nd Annual Meeting of the Association for Com-
approaches (SMT) pointing out different compo- putational Linguistics (Volume 1: Long Papers),
nents in each SMT system. We then discussed volume 1, pages 13701380.
the role of deep learning models in improving dif-
ferent components of SMT, Then we shifted our Yanzhuo Ding, Yang Liu, Huanbo Luan, and Maosong
Sun. 2017. Visualizing and understanding neural
discussion on end-to-end neural machine transla- machine translation. In Proceedings of the 55th An-
tion (NMT). Our discussion was largely based on nual Meeting of the Association for Computational
the basic encoder-decoder based NMT, attention Linguistics (Volume 1: Long Papers), volume 1,
based model. We finally listed the challenges in pages 11501159.
Neural Translation models and mentioned future Jianfeng Gao, Xiaodong He, Wen-tau Yih, and Li
fields of study and open ended problems. Through Deng. 2014. Learning continuous phrase representa-
our brief but comprehensive survey, we aim to tions for translation modeling. In Proceedings of the
guide new researchers and scholar into the latest 52nd Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers),
topics in Machine Translation and suggest possi- volume 1, pages 699709.
ble areas of development.
Jonas Gehring, Michael Auli, David Grangier, De-
nis Yarats, and Yann N Dauphin. 2017. Convolu-
References tional sequence to sequence learning. arXiv preprint
arXiv:1705.03122.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2014. Neural machine translation by jointly Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun
learning to align and translate. CoRR, Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. 2015. On us-
Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and ing monolingual corpora in neural machine transla-
Christian Jauvin. 2003. A neural probabilistic lan- tion. arXiv preprint arXiv:1503.03535.
Sepp Hochreiter and Jurgen Schmidhuber. 1997. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Long short-term memory. Neural computa- Sequence to sequence learning with neural net-
tion,9(8):17351780. works. In Advances in neural information process-
ing systems, pages 31043112.
Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical phrase-based translation. In Pro- Akihiro Tamura, Taro Watanabe, and Eiichiro Sumita.
ceedings of the 2003 Conference of the North Amer- 2014. Recurrent neural networks for word alignment
ican Chapter of the Association for Computational model. In Proceedings of the 52nd Annual Meet-
Linguistics on Human Language TechnologyVol- ing of the Association for Computational Linguis-
ume 1, pages 4854. Association for Computational tics (Volume 1: Long Papers), volume 1, pages
Linguistics. 14701480.
Peng Li, Yang Liu, and Maosong Sun. 2013. Recursive Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
autoencoders for itg-based translation. In Proceed- Uszkoreit, Llion Jones, Aidan N Gomez, ukasz
ings of the 2013 Conference on Empirical Methods Kaiser, and Illia Polosukhin. 2017. Attention is all
in Natural Language Processing, pages 567577. you need. In Advances in Neural Information Pro-
cessing Systems, pages 59986008.
Peng Li, Yang Liu, Maosong Sun, Tatsuya Izuha, and
Dakun Zhang. 2014. A neural reordering model for Ashish Vaswani, Yinggong Zhao, Victoria Fossum,
phrase-based translation. In Proceedings of COL- and David Chiang. 2013. Decoding with largescale
ING 2014, the 25th International Conference on neural language models improves translation. In
Computational Linguistics: Technical Papers, pages Proceedings of the 2013 Conference on Empiri-
18971907. cal Methods in Natural Language Processing, pages
Minh-Thang Luong and Christopher D Manning. 2016. 13871392.
Achieving open vocabulary neural machine trans- Stephan Vogel, Hermann Ney, and Christoph Till-
lation with hybrid word-character models. arXiv mann. 1996. Hmm-based word alignment in statis-
preprint arXiv:1604.00788. tical translation. In Proceedings of the 16th confer-
Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol ence on Computational linguistics-Volume 2, pages
Vinyals, and Wojciech Zaremba. 2014. Addressing 836 841. Association for Computational Linguistics.
the rare word problem in neural machine translation.
Deyi Xiong, Qun Liu, and Shouxun Lin. 2006. Max-
arXiv preprint arXiv:1410.8206.
imum entropy based phrase reordering model for
Franz Josef Och and Hermann Ney. 2002. Discrim- statistical machine translation. In Proceedings of
inative training and maximum entropy models for the 21st International Conference on Computational
statistical machine translation. In Proceedings of Linguistics and the 44th annual meeting of the As-
the 40th annual meeting on association for compu- sociation for Computational Linguistics, pages 521
tational linguistics, pages 295302. Association for 528. Association for Computational Linguistics.
Computational Linguistics.
Nan Yang, Shujie Liu, Mu Li, Ming Zhou, and Neng-
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- hai Yu. 2013. Word alignment modeling with con-
Jing Zhu. 2002. Bleu: a method for automatic eval- text dependent deep neural network. In Proceed-
uation of machine translation. In Proceedings of ings of the 51st Annual Meeting of the Association
the 40th annual meeting on association for compu- for Computational Linguistics (Volume 1: Long Pa-
tational linguistics, pages 311318. Association for pers), volume 1, pages 166175.
Computational Linguistics.
MarcAurelio Ranzato, Sumit Chopra, Michael Auli,
and Wojciech Zaremba. 2015. Sequence level train-
ing with recurrent neural networks. arXiv preprint
arXiv:1511.06732.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2015. Neural machine translation of rare words with
subword units. arXiv preprint arXiv:1508.07909.
Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua
Wu, Maosong Sun, and Yang Liu. 2015. Minimum
risk training for neural machine translation. arXiv
preprint arXiv:1512.02433.
Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
nea Micciulla, and John Makhoul. 2006. A study of
translation edit rate with targeted human annotation.
In Proceedings of association for machine transla-
tion in the Americas, volume 200.

Extra 1 PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Extra 1 PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Machine Translation : From Statistical to modern Deep-learning practices

Siddhant Srivastava Anupam Shukla Ritu Tiwari

Abstract Statistical Machine Translation (SMT) is a data-

a word as the basic entity (Brown 1993), maxi-

ŷ = argmaxy P (y; θlm )P (x|y; θtm ). (3) 3 Component or Domain wise

Where L(θ) is defined as:

5.1 Inefficient performance due to large

6.1 Neural models inspired by linguistic

6.2 Light weight neural models for learning

Das könnte Ihnen auch gefallen