Tailoring An Interpretable Neural Language Model: Yike Zhang, Pengyuan Zhang, and Yonghong Yan

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2019.2913087, IEEE/ACM
Transactions on Audio, Speech, and Language Processing
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
Tailoring an Interpretable Neural Language Model

Yike Zhang, Pengyuan Zhang, and Yonghong Yan
Abstract—Neural networks have shown great potential in lan- RNN models hard to be parallelized across timesteps. Second-
guage modeling. Currently, the dominant approach to language ly, although the recurrent operation provides an effective way
modeling is based on recurrent neural networks (RNNs) and to learn a representation of the entire history, it brings some
convolutional neural networks (CNNs). Nonetheless, it is not clear
why RNNs and CNNs are suitable for the language modeling difficulties for understanding the feature extraction mecha-
task, because these neural models are lack of interpretability. nism of RNN models. In other words, it is not clear why
The goal of this paper is to tailor an interpretable neural iteratively incorporating history words into a model is better
model as an alternative to RNNs and CNNs for the language than incorporating all the history words into the model at
modeling task. This paper proposes a unified framework for one time. Moreover, the gates do help RNN models learn
language modeling, which can partly interpret the rationales
behind existing LMs. Based on the proposed framework, an in- long-term dependencies. Long short-term memory (LSTM)
terpretable neural language model (INLM) is proposed, including [7] and gated recurrent unit (GRU) [8] are two typical gated
a tailored architectural structure and a tailored learning method RNN models. Nevertheless, the gates also make RNN models’
for the language modeling task. The proposed INLM can be internal working mechanism more confusing.
approximated as a parameterized auto-regressive moving average Recently, CNNs have been also applied to the language
(ARMA) model and provides interpretability in two aspects:
component interpretability and prediction interpretability. Exper- modeling task [9]–[11]. CNN models first stack the word
iments demonstrate that the proposed INLM outperforms some embeddings of input words to form a matrix. Then convolution
typical neural LMs on several language modeling datasets and operations are applied between the input matrix and multiple
on the Switchboard speech recognition task. Further experiments kernels. Each kernel learns a specific pattern (or feature map)
also show that the proposed INLM is competitive with the state- from the input matrix, but the differences between the patterns
of-the-art long short-term memory (LSTM) LMs on the Penn
Treebank and WikiText-2 datasets. learned by different kernels are not clear. Therefore, it is
hard to evaluate which kernels are essential and which ones
Index Terms—Neural language models, interpretability, auto- are dispensable. Additionally, CNN models usually have high
regressive moving average, speech recognition
computational complexity due to the kernels.
Although neural-network-based techniques have significant-
I. I NTRODUCTION ly improved the language modeling task, existing LMs of-
fer little transparency regarding how their mathematical and
T HE language model (LM) is an essential component
in many natural language-related applications such as
speech recognition [1] and machine translation [2]. The lan-
parametric structure influences the overall performance. We
should rethink whether the prevalent architectures, such as
guage modeling task aims to estimate the probability of natural recurrent components or convolutional components, are es-
sentences. In other words, the LM can indicate how likely a sential or irreplaceable for a specific task. If we have a
sequence of words makes a natural sentence. deeper understanding of the existing LMs, we can design a
With the pioneering work in [3], neural networks found better neural model more easily for the language modeling
their way into the domain of language modeling. Although task. Therefore, this paper first breaks down the language
neural network LMs outperform n-gram models [4] in a modeling task into several subtasks. Then some typical LMs
wide range of applications, many studies demonstrate that are reformulated under a unified framework. Based on that,
the performance of neural network LMs strongly depends on this paper proposes an interpretable neural model for language
their architectural structures [5]. Currently, recurrent neural modeling, including a tailored architectural structure and a
networks (RNNs) and convolutional neural networks (CNNs) tailored learning method. The proposed model, which can be
dominate the domain of language modeling. approximated as a parameterized auto-regressive moving aver-
RNNs were first introduced to the language modeling age (ARMA) model, provides interpretability in two aspects:
task by Mikolov et al. [6]. With the help of the recurrent component interpretability and prediction interpretability. Con-
architecture, RNN models can efficiently capture long-term cretely, the component interpretability refers to understanding
dependencies. However, the recurrent architecture also brings the fundamentals of certain neural models, including the
two main problems. Firstly, the sequential dependency makes optimal architecture and mathematical parametric design. The
prediction interpretability means to provide human-readable
Y.Zhang, P.Zhang, and Y.Yan are with the Institute of Acoustics, Chinese A- justifications that support the model’s prediction. These two
cademy of Sciences, Beijing 100190, China e-mail: zhangyike@hccl.ioa.ac.cn; kinds of interpretability are complementary. The component
zhangpengyuan@hccl.ioa.ac.cn; yanyonghong@hccl.ioa.ac.cn.
Y.Zhang, P.Zhang, and Y.Yan are with the University of Chinese Academy interpretability can serve as guidelines for the neural model
of Sciences, Beijing 100049, China design. The prediction interpretability helps us analyse how
Y.Yan is with the Xinjiang Technical Institute of Physics and Chemistry, the model makes decision and further improve the model’s
Chinese Academy of Sciences, Urumchi 830011, China
Manuscript received April 19, 2005; revised August 26, 2015. (Correspond- performance. In experiments, the proposed model was first
ing author: Pengyuan Zhang) compared to several typical neural LMs. All the models had
2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2019.2913087, IEEE/ACM
approximately the same amount of parameters and adopted revisited in [24]. Parameter sharing is another way to regularise
the same hyperparameter configuration. Results show that the neural networks. Press et al. [25] and Inan et al. [26] tied the
proposed model outperforms the LSTM model and some other input embedding and the output embedding, which facilitates
typical neural LMs. We further conducted experiments to better learning in language modeling. Moreover, Inan et al.
compare the proposed model with the state-of-the-art language [26] also augmented the cross-entropy training loss with an
modeling techniques. Results demonstrate that the proposed additional term, which minimizes the KL-divergence between
model is competitive with the state-of-the-art LSTM LMs on the prediction distribution and an more accurate estimate of
the Penn Treebank and WikiText-2 datasets. In addition, we the true data distribution.
also evaluated the proposed model on the Switchboard speech In addition, Yang et al. [27] proposed a high-rank LM,
recognition task. Results show that the proposed model slightly which has multiple softmax output layers. Merity et al. [22]
outperforms the LSTM model. proposed a non-monotonically triggered averaged stochastic
The remainder of this paper is organised as follows: We gradient descent (NT-ASGD) optimization method for lan-
first present related works in Section II. A unified framework guage modeling. In the NT-ASGD method, models in the last
for language modeling is defined in Section III. Then several K training iterations are averaged as the final model, where
typical LMs are reformulated under the unified framework K is determined by a trigger.
in Section IV. In Section V, we propose an interpretable Although the canonical RNN models are powerful at mod-
neural model for the language modeling task. Experimental eling sequential data, the dependence of each timestep’s com-
settings and results are described in Section VI and Section putation on the previous timestep’s output limits parallelism.
VII respectively. Finally, Section VIII concludes this paper. Many novel neural models have been proposed to alleviate this
problem. Zhang et al. [28] applied the fixed-size ordinally-
II. R ELATED WORKS forgetting encoding (FOFE), which can encode a variable-
A. Neural language models length sequence of words into a fixed-size representation, to
Generally, the architectural structure largely determines the feedforward neural LMs. As a more advanced model, the
performance of neural LMs. RNN is the most popular architec- feedforward sequential memory network (FSMN) proposed
ture in language modeling thanks to its powerful capabilities by Zhang et al. [29] can model long-term dependencies in
for modeling sequential data. A variety of RNN architectures time series without using recurrent feedback. The Quasi-
were explored in [12]. The authors concluded that it is hard to RNN (QRNN) proposed by Bradbury et al. [30] alternatives
find a custom architecture that consistently outperforms LSTM convolutional layers that can be applies in parallel across
in all experimental conditions. Nonetheless, with the help of timesteps, and a minimalist recurrent pooling function that
reinforcement learning, some better neural architectures can can be applied in parallel across channels. Dauphin et al.
be efficiently found for specific tasks [13]. [10] first developed a fully convolutional LM. Bai et al. [11]
Recurrent models usually suffer from the gradient decay evaluated the generic temporal convolutional network (TCN)
over time and layers. Besides the well-known gate mechanism, architecture across a diverse range of sequence modeling tasks
such as LSTM and GRU, many new techniques have been including language modeling. They reported that the TCN
proposed to address this problem. Zilly et al. [14] proposed model outperforms the LSTM models on some tasks.
a recurrent highway network (RHN), which extended LSTMs Overall, most of the above works improved neural LMs
to allow step-to-step transition depths larger than one. Kurata by addressing the general problems on neural networks, such
et al. [15] extended LSTMs by adding highway networks as learning long-term dependencies and regularization, rather
inside an LSTM. Grave et al. [16] proposed a neural cache than the specific problems in language modeling. This paper
model to augment neural LMs with a longer-term memory that introduces a novel neural model, whose architectural structure
dynamically updates the word probabilities based on the long- and optimization method are specially designed for the lan-
term context. Merity et al. [17] introduced the pointer sentinel guage modeling task. Furthermore, the proposed neural model
mixture architecture for neural sequence modeling, which has also elegantly bypasses the problems mentioned above, in-
the ability to either reproduce a word from the recent context cluding gradient decay, complicated regularization and limited
or produce a word from a standard softmax classifier. Li et parallelism.
al. [18] proposed an independently RNN (IndRNN). In the
IndRNN, neurons in the same layer are independent of each B. Interpretability of neural models
other and they are connected across layers. Currently, researchers mainly study the interpretability of
Regularization also strongly affects the performance of neural models from two different perspectives. Some re-
recurrent models. Since Zaremba et al. [19] introduced the searchers try to interpret what is learned by a neural network.
dropout technique into RNN models, various advanced dropout For instance, visualization is a typical method to explore the
techniques have been proposed, such as embedding dropout patterns hidden inside a neural unit [31], [32]. In addition,
[20], variational dropout [20], and fraternal dropout [21]. diagnosing the network representations can also help us obtain
Instead of performing dropout on hidden states or memory an insight into features encoded in a neural model [33], [34].
cells, the weight-dropped LSTM [22] proposed by Merity et Other researchers focus on directly building an interpretable
al. uses DropConnect [23] on hidden-to-hidden weights as a model. Zhang et al. [35] developed an approach to modifying
form of recurrent regularization. Besides dropout, L2 regular- a traditional CNN into an interpretable CNN to clarify knowl-
ization on RNN’s activations and successive hidden states was edge representations in high convolution layers. Wu et al.
[36] proposed a method of learning qualitatively interpretable last n − 1 words are reduced to an equivalence class through
n
models for object detection. Li et al. [37] created a novel the mapping fngram . Many smoothing techniques [39], [40]
network architecture for deep learning that naturally interprets were proposed to alleviate the data sparsity problem in n-gram
its own reasoning for each prediction. Lei [38] designed an models. Back-off and interpolation are two typical smoothing
interpretable neural model for natural language processing. techniques. Both of them incorporate lower order features
Actually, the above two perspectives complement each into n-gram models. Hence, these techniques correspond to
other. If we have a better understanding of the rationale behind the aggregation function φ in Eq.(2). Finally, the next-token
the architecture and mathematical parametric design of neural probability distribution is estimated based on the word co-
models, we can efficiently design a high-performance neural occurrence frequencies.
model for specific tasks.
B. Maximum entropy models
III. A UNIFIED FRAMEWORK FOR LANGUAGE MODELING
The maximum entropy (ME) model [41] is an exponential
The goal of the language modeling task is to determine
model with a form
the joint probability p(s) of a sentence s, where s = X
w1 , w2 , · · · , wl consists of l words. Currently, the predominant p(wi |hi ) = exp( λi fi (wi , hi ))/Z(hi ) (3)
approach to language modeling decomposes the joint proba- i
bility p(s) into a product of conditional probabilities based on
where λi isP the weightP for the feature extracted by fi , and
the chain rule. j λj fj (wi ,hi ) is used for normalizing
Z(hi ) = wi ∈V e
l
Y the probability distribution. In Eq.(3), different features are
p(s) = p(wi |w1 , · · · , wi−1 ) (1)
aggregated in a linear way. Then the next-token probability
i=1
distribution is directly derived from the aggregated feature with
In other words, LMs sequentially estimate the probability of an exponential function. Theoretically, the solution of the ME
each word in a sentence conditioned on previous words. model is equivalent to maximize the likelihood of Eq.(3) on
Generally, the language modeling task involves the follow- the training data [42].
ing problems:
a. How to select appropriate features to represent the
context (history words). C. Neural models
b. How to aggregate different features. This subsection demonstrates how four typical neural mod-
c. How to derive the next-token probability distribution els correspond to the Eq.(2), including feedforward neural
from aggregated features. network (FNN) models, RNN models, CNN models, and
d. How to do parameter estimation for specific models. recurrent convolution neural network (RCNN) models [38].
Taking into account the above problems, a LM can be ab- 1) FNN models: Similar to n-gram models, the prediction
stracted into the following formula of FNN models depends on the most recent n preceding words,
where n is a hyperparameter. The feature extraction function
p(wi |hi ) = ψ(φ(f1 (hi ), · · · , fK (hi ))) (2) for FNN models is as follows:
where hi = w1 , w2 , · · · , wi−1 = w1i−1
is the history of word n
fFNN (hi ) = C(xi−n ) ⊕ · · · ⊕ C(xi−1 ) ∈ Rne (4)
wi , f1 , f2 , · · · , fK are a set of functions to extract different
features from the original history hi , φ determines the way to where the one-hot encoding xi ∈ R|V | of word wi is a sparse
aggregate different features, and ψ represents a mapping from vector with |V |−1 zeros and a single one. |V | is the vocabulary
aggregated features to the next-token probability distribution. size. C(xi ) = Exi ∈ Re maps the one-hot encoding xi
Although different LMs deal with the above problems with to a dense vector, which is commonly referred to as word
different methods, there are many similarities among these embedding. E ∈ Re×|V | is a learnable matrix. ⊕ stands for
methods. More details will be presented in the next section to the concatenation operation.
show how several typical LMs cope with these problems. n
Besides the feature fFNN , it is easy to incorporate other
features into FNN models, such as part-of-speech features [43]
IV. R EFORMULATE TYPICAL LM S UNDER THE UNIFIED or topic features [44]. These additional features are usually
FRAMEWORK appended to each word embedding in the following form
This section provides an insight into the existing LMs by n
φFNN (hi ) = φFNN (fFNN (hi ), f1 (hi ), · · · , fK (hi ))
reformulating several typical LMs under the unified framework (5)
mentioned in Section III. = g(φ̂FNN (xi−n ) ⊕ · · · ⊕ φ̂FNN (xi−1 ))
where
A. n-gram models
φ̂FNN (xi ) = C(xi ) ⊕ f1 (xi ) ⊕ · · · ⊕ fK (xi ) (6)
The term “n-gram” means n consecutive words. n-gram
models predict the next word using the previous n − 1 words. is a word-level feature aggregation function, and g is usually
Formally, the feature extraction function for n-gram models is a fully-connected layer to get a latent representation of the
n i−1
fngram (hi ) = wi−n+1 . In essence, all histories with the same history hi from the aggregated word-level features φ̂FNN .
Finally, the next-token probability distribution is estimated where ∗ represents the convolutional operation, and σ is a non-
via a softmax output layer, which guarantees positive proba- linear function. A convolution layer is equivalent to a group
k
bilities summing to one: of fully-connected layers, thereby fCNN (hi ) is essentially a k-
eyi gram feature. In practice, we can obtain a number of features
p(hi ) = ψ(hi ) = P yij (7) by using multiple kernels of varying widths in Eq.(14). These
je
features are aggregated by
where the unnormalised probability distribution (or logit) yi 1 n
is computed as follows with parameters W and b: φCNN (hi ) = φCNN (fCNN (hi ), · · · , fCNN (hi ))
1 n
(15)
= q(fCNN (hi ) ⊕ · · · ⊕ fCNN (hi ))
yi = WφFNN (hi ) + b (8)
where q can be either a pooling layer or a fully-connected
and yij is the j-th element of yi .
layer to reduce the dimension of the aggregated feature.
Since neural models are non-convex, they are usually trained
4) RCNN models: RCNNs are specially designed for natu-
via the stochastic gradient descent (SGD) algorithm. The
ral language processing tasks, including language modeling
cross-entropy between the target xi and the model prediction
[38]. RCNN models can use both consecutive and non-
ψ(hi ) is widely used as the cost function, which implies the
consecutive n-gram features. At timestep i, RCNN models
prediction error, i.e.
compute a consecutive/non-consecutive k-gram feature by:
|V |
(j ,··· ,jk )
(hi ) = λjk −j1 −k+1 · C(xj1 ) · · · C(xjk ) (16)
X 1
L(xi , ψ(hi )) = − xip log ψ(wp |hi ) fRCNN
p=1
(9) where j1 < · · · < jk < i and is the element-wise mul-
|V |
X eyip tiplication. The aggregated k-th order feature is obtained by
=− xip log P yij summing all the consecutive/non-consecutive k-gram features
p=1 je
up to timestep i − 1:
where xip is the p-th element of xi , and wp is the p-th X (j ,··· ,j )
word in the vocabulary. Since xi is a one-hot vector, Eq.(9) φkRCNN (i) = 1
fRCNN k
(hi ) (17)
degenerates to the predicted log-likelihood of the target word. j1 ,··· ,jk
Therefore, minimizing the cross-entropy across the training Eq.(17) can be efficiently computed via dynamic program-
data is equivalent to the maximum likelihood estimation. ming:
In fact, all neural LMs share the process from Eq.(7)
to Eq.(9). Namely, all neural LMs derive the probability φkRCNN (i) = λφkRCNN (i−1) + µφk−1
RCNN (i−1)C(xi−1 )
(18)
distribution and do parameter estimation in the same way.
2) RNN models: Taking the advantage of the recurrent where µ = 1 − λ.
architecture, RNN models can efficiently deal with a longer
history compared to FNN models. RNN models encode the V. A N INTERPRETABLE NEURAL LANGUAGE MODEL
entire history as a latent representation in an iterative way. This section proposes an interpretable neural language mod-
At timestep i, only the current word wi−1 is fed into RNN el (INLM) under the unified framework. We first describe how
models. Similar to FNN models, the feature extraction function the INLM copes with the four problems mentioned in Section
for RNN models is III, and then elaborate its interpretability.
fRNN (i) = C(xi−1 ) (10)
A. Features
Then the current input fRNN (i) and the previous hidden state
The n-gram feature is an efficient representation of context
φRNN (i − 1) are combined in the following way
as shown in Section IV. Therefore, we adopt the n-gram
φRNN (i) = g(φRNN (i − 1), φ̂RNN (i)) (11) feature as the basic feature of the proposed model:
where n
fINLM (hi ) = C(xi−n ) ⊕ · · · ⊕ C(xi−1 ) (19)
φ̂RNN (i) = fRNN (i) ⊕ f1 (i) ⊕ · · · ⊕ fK (i) (12) n
where the superscript n of fINLM is a hyperparameter indicat-
aggregates the word-level features like Eq.(6). Essentially, ing the length of the history.
φRNN (i) is a latent representation of the history hi . Many studies show that a longer history can help improve
3) CNN models: The prediction of CNN models relies the performance of RNN models. However, this conclusion
on the most recent n preceding words like FNN models. does not hold true for the models with feedforward archi-
Nevertheless, CNN models stack the input words as a matrix tectures. For instance, the context of FNN models is usually
(or feature map): less than 10 words. CNN models often limit the kernel width
between 3 and 6. This is partly because the feedforward
Mn (hi ) = [C(xi−n ); · · · ; C(xi−1 )] ∈ Re×n (13) architecture cannot efficiently model the temporal relationship
Then narrow convolutions are applied between the input among history words. In order to address this problem, we
feature map Mn (hi ) and a kernel Xk ∈ Re×k of width k propose two word-level position features (PF). One is to
(1 ≤ k ≤ n): describe the absolute position of a word in the history:
k
fCNN (hi ) = σ(Xk ∗ Mn (hi )) (14) fabs (xi−k ) = [δi−k (i − n), · · · , δi−k (i − 1)] (20)
where 1 ≤ k ≤ n and of C, which maps word embeddings (or aggregated features)

1, if m=k to the unnormalised probability distribution (or logit). Thus,
δm (k) =
0, if m 6= k
(21) φnINLM (i) can be approximated as C(xi ). Using Eq.(23),
Eq.(28) can be rewritten as
Since fabs is essentially a n-dimensional one-hot encoding, it i−1 i−1
fails to model the relationships among different input words. X X
C(xi ) = V1 C(xt ) + V2 fPE (xt ) (29)
Hence, another feature is proposed to describe the relative
t=i−n t=i−n
position of history words:
where V = [V1 ; V2 ] and fPE (xt ) = fabs (xt ) ⊕ frel (xt ). We
n−k
frel (xi−k ) = (22) can see that φnINLM is essentially an ARMA model [45], which
n−1 is a typical model for random signals. If we only use the basic
n
where frel ranges from 0 to 1 and assigns larger values to feature fINLM , then φnINLM degenerates to an autoregressive
more recent words. (AR) model [45].
Taking the advantage of Eq.(28), we can get the latent
B. Feature aggregation representations for n-grams of any order. Like ME and CNN
models, different order n-gram features are incorporated into
Inspired by the analysis in Section IV, a hierarchical ap-
the proposed model. This is similar to the interpolation in
proach to feature aggregation is proposed in this subsection.
n-gram models. In this paper, we simply implement the
In addition, this subsection not only provides a theoretical
interpolation operation (IO) by concatenating the latent repre-
interpretation for the proposed approach, but also reveals the
sentations of different n-gram features. Finally, the aggregated
relationship between the proposed approach and the standard
feature is obtained by
RNN model. In the proposed method, different features are
first aggregated at word-level by φINLM (i) = η(φ1INLM (i) ⊕ · · · ⊕ φnINLM (i)) (30)
φ̂INLM (i) = C(xi−1 ) ⊕ fabs (xi−1 ) ⊕ frel (xi−1 ) (23) where the gate mechanism (GM) η is an alternative of Eq.(26).
For standard RNN models, the aggregation function Eq.(11), Notably, vanilla RNNs suffer from gradient vanishing [46]
with parameters U and V, can be rewritten as due to the recurrent structure. As a result, RNN models
usually fail to learn long-term patterns. Several gated RNN
φRNN (i) = σ(UφRNN (i − 1) + Vφ̂RNN (i)) (24) models were proposed to alleviate the problem, such as LSTM
and GRU. Instead of directly combining different inputs, the
where σ is a non-linear activation function. Generally, σ and U
GM first controls each input with a learned gate, and then
lead to a low parallel efficiency. Concretely, φRNN (i) depends
combines the transformed inputs in a certain way. Therefore,
on the previous timestep’s output φRNN (i − 1). If σ and U
the GM can be also regarded as an advanced information
are removed from Eq.(24), φRNN (i) and φP RNN (i − 1) can
i aggregation mechanism, although it is originally proposed
be computed simultaneously by φRNN (i) = t=0 Vφ̂RNN (t)
Pi−1 to solve the gradient vanishing problem. In this paper, the
and φRNN (i − 1) = t=0 Vφ̂RNN (t). The following two highway architecture [47] is used as the GM:
equations are an alternative of Eq.(24):
η(x) = σ(T (x)) · σ(A(x)) + (1 − σ(T (x))) · x (31)
φ(i) = φ(i − 1) + Vφ̂(i) (25)
where T and A are two affine transforms with different
φ0RNN (i) = σ(Uφ(i)) (26)
parameters.
In fact, Eq.(25) is the Euler method of the following ordinary In summary, the proposed feature aggregation function
differential equation (ODE): φINLM can be interpreted as a variant of the ARMA model,
and the aggregated feature can be computed hierarchically
∆φ(i) = Vφ̂(i) (27) using Eq.(23), Eq.(28), Eq.(30), and Eq.(31).
We can easily find φ(i) by integrating Eq.(27). The latent
i−n
representation of the truncated history wi−1 can be obtained C. Probability estimation
by integrating ∆φ(i) from i − n to i − 1: Once we get the aggregated feature φINLM , the unnor-
i−1
X malised probability for the target word wi can be directly
φnINLM (i) = Vφ̂INLM (t) (28) computed using Eq.(8). According to the analysis in Section
t=i−n V-B, Eq.(8) can be regarded as an inverse mapping of C.
φnINLM can also be regarded as a n-gram feature. Like [25], [26], we limit the W in Eq.(8) to the transposition
From the above analysis, we can see that RNN is essentially of the matrix E in the linear transformation C, namely
a variant of the ODE. This partly reveals why RNNs are so W = ET ∈ R|V |×e , in the proposed model. Thus, the
powerful at sequential modeling. Since Eq.(24) is inefficient to unnormalised probability distribution can be computed by
parallelize, Eq.(28) is incorporated into the proposed model. yi = C −1 (φINLM (i)) + b (32)
If φnINLM is used as the latent representation of hi , the logit
can be computed by replacing φFNN (hi ) with φnINLM (i) in where C −1 (x) = ET x. This constraint can efficiently reduce
Eq.(8). In fact, Eq.(8) can be regarded as an inverse mapping the number of parameters especially when the vocabulary
size is very large. The normalised probability distribution Eq.(28) as one hidden layer). This indicates that Eq.(28) is a
ψINLM (hi ) can be obtained via substituting Eq.(32) into rough way to aggregate features. Therefore, we replace Eq.(28)
Eq.(7). with the transformer architecture proposed in [51].
In our implementation, the transformer layer contains a self-
D. Learning methods attention sublayer followed by a feed-forward network of two
fully connected sublayers. Formally, the self-attention sublayer
Recent studies show learning methods strongly affect the in the l-th transformer layer with parameters Ql , Kl , Tl ∈
performance of neural models. For instance, LSTM models Rh×h is as follows
always outperform feedforward models when cross-entropy
is used as the loss function. Surprisingly, a feedforward Tril((Ql cl )T (Kl cl ))T
clatt = (Tl cl )υ( √ ) (34)
model can outperform LSTM models by means of knowledge h
distillation [48]. Nevertheless, knowledge distillation involves where cl is the input of the l-th transformer layer. The input
two models. An auxiliary model is used to provide soft labels for the first transformer layer is
for the target model in the training process. This subsection
proposes a novel learning method, which can efficiently train c1 = [φ̂INLM (i − n); · · · ; φ̂INLM (i − 1)] ∈ Rh×n (35)
the INLM without auxiliary models.
and the input for the l-th transformer layer cl is the output
As shown in Eq.(9), the cross-entropy loss implicitly builds
of the (l − 1)-th transformer layer. υ is the softmax function.
a relationship between φ(hi ) and xi . In other words, the cross-
The Tril(X) operation only keeps the lower triangular part
entropy loss encourages the model predicting xi using φ(hi ).
of the matrix X, and replaces the elements in the upper
Neural LMs build such relationship at different granularities
triangular part with zeros. The Tril operation ensures that the
depending on their architectural structures. For instance, given
model’s predictions are only conditioned on past words. The
n history words, recurrent models learn such relationship for
fully connected sublayers in the l-th transformer layer with
all history words (φ(h1 ), x1 ), (φ(h2 ), x2 ), · · · , (φ(hn ), xn ),
parameters W1 , b1 , W2 , b2 are as follows
whereas feedforward models only learn the relationship for
the last word (φ(hn ), xn ). Thus, recurrent models embed cltra = LN(W2 max(0; W1 LN(clatt ) + b1 ) + b2 )
(36)
temporal information into φ(hi ) and achieve a temporal-aware = [cltra (i − n); · · · ; cltra (i − 1)]
representation (TR) of the history. In contrast, feedforward
models leave φ(hi ) unstructured. where LN is the layer normalization operation [52], and
According to the above analysis, we enforce the proposed cltra (i − t) is the (n − t + 1)-th column of the matrix cltra .
INLM to learn a TR from history words. Concretely, in the According to Eq.(35) and Eq.(36), cl in Eq.(34) can be
proposed learning method, the INLM not only predicts the represented as [cl (i−n); · · · ; cl (i−1)]. Let qlj = Ql cl (i−j),
target xi with φ(hi ), but also predicts each history word xi−t klj = Kl cl (i − j) and tlj = Tl cl (i − j), Eq.(34) can be
with φ(hi−t ), where 1 ≤ t < n. These additional prediction reformulated as
 
losses can help the proposed INLM learn a TR of the history. αnn · · · α2n α1n
Formally, a temporal-aware loss function can be derived from  .. .. .. 
clatt = [tln ; · · · ; tl1 ]υ( .
 ··· . . 
)
the proposed learning method:  0 · · · α22 α12 
i−1
0 ··· 0 α11 (37)
X L(xt , ψ 0 (ht )
Ltw (θ) = µL(xi , ψINLM (hi )) + γ (33) n n n
n−1 X
0
X
0 l
X
0 l
t=i−n+1 =[ αnj tlj · · · α2j tj α1j tj ]
where ψ 0 (ht ) is the probability distribution
Pt−1 derived by re-
j=n j=2 j=1
placing φINLM (i) with φt−i+n

INLM (t) = j=i−n Vφ̂INLM (j) in qli klj
0
Eq.(32). µ and γ are two scalars ranging from 0 to 1, and where αij = √h and αij is the normalised αij by the
we limit γ = 1 − µ. The first term in Ltw (θ) is the major softmax function υ. Eq.(37) shows that the transformer archi-
objective to be optimised, and the second term in Ltw (θ) acts tecture aggregates inputs at different timesteps in a weighted
as a regularization term. Please note that the proposed learning way. The j-th column of clatt or cltra is essentially a latent
i−j
method does not introduce any new parameters, since the representation of the history wi−n . In contrast, Eq.(28) treats
auxiliary prediction losses are computed via reusing existing inputs at different timesteps equally. Therefore, the transformer
parameters. architecture can be regarded as an advanced version of Eq.(28).
Our implementation of transformer is much simpler than
the original formulation in [51]. Specifically, we perform
E. Deep transition with self-attention a single attention instead of the multi-head attention, and
Network depth is of central importance in the resurgence remove the residual connections in both self-attention and
of neural networks as a powerful machine learning paradigm fully-connected sublayers. Although multi-head attentions and
[49]. Theoretical evidence indicates that deeper networks can residual connections can partly improve the performance of the
be exponentially more efficient at representing certain function proposed model, these components are not necessary for the
classes [50]. However, in our preliminary experiments, we language modeling task. Hence, in order to keep the proposed
found that there was almost no improvement in performance INLM in a simple architecture with better interpretability, we
when we increased the number of hidden layers (we refer to omit these components in our implementation.
After replacing Eq.(28) with the transformer architecture, Table I provides the computational complexity and degree of
we cannot directly get φkINLM (i) in Eq.(30). Therefore, we parallelism for different models. Since the proposed learning
reimplement Eq.(30) as method requires additional computation, the proposed INLM
1 n
has a much higher computational complexity at training. The
φINLM (i) = η(Cacc (i) ⊕ · · · ⊕ Cacc (i) ⊕ cL
tra (i − 1)) (38) penultimate line in Table I is the training complexity of the
k
P k INLM. The last line in Table I is the inference complexity of
where Cacc (i) = j=1 C(i − j) and L is the number
of transformer layers. In addition, the auxiliary prediction the INLM. As for other neural models in Table I, their infer-
ψ 0 (ht ) in Eq.(33) can be obtained by replacing φINLM (i) with ence complexities are the same as their training complexities.
φt−i+n L
INLM (t) = ctra (t − 1) in Eq.(32).
TABLE I
T HE COMPUTATIONAL COMPLEXITY AND DEGREE OF PARALLELISM
F. Interpretability (D O P) COMPARISON OF DIFFERENT NEURAL LM S
Neural models are usually treated as black boxes in most Model Complexity DoP
applications. This leads to two main problems. One is that it FNN (N − 1)E × H + H × V B×T
RNN E × H + H2 + H × V B
is difficult to evaluate whether a neural model is suitable for a
GRU 4(E × H) + 3H 2 + H × V B
specific task. The other one is that it is not clear how a neural LSTM 6(E × H) + 5H 2 + H × V B
model makes decisions. Previous studies usually addressed the CNN (N − 1)E 2 × K + E × H + H × V B×T
first problem by evaluating proposed neural models on specific RCNN E×H +H ×V B
tasks. This is an indirect but efficient approach to verify (N + 2)E × H + 4H 2 + N × H × V B×T
INLM
(N + 2)E × H + 4H 2 + H × V B×T
whether the proposed model suits a specific task. However,
researchers usually ignore the second problem.
Table I shows that the computational complexities of FNN,
This paper simultaneously addresses the above two prob-
CNN, and INLM models are positively correlated to the
lems by proposing an interpretable neural model, which pro-
history length N . This is innocuous since N is usually less
vides interpretability in two aspects: component interpretabili-
than 30. Generally, the vocabulary size V is the bottleneck of
ty and prediction interpretability. The component interpretabil-
the complexity, because the vocabulary may includes million
ity refers to understanding how each fundamental design in
of words. This problem is particularly severe for the INLM
every neural network component contributes to and impact
due to the auxiliary regularization term in Eq.(33). We bypass
the performance in a given task. Since the fundamental de-
this problem by using the noise contrastive estimation (NCE)
sign in every neural network component is clear, we can
algorithm [53], which is a fast way to train neural LMs. The
directly evaluate whether the proposed model is suitable for
NCE method reduces the probability distribution estimation
the language modeling task. The prediction interpretability
problem in Eq.(7) to the problem of estimating the parameters
refers to providing human-readable justifications that support
of a binary classifier to distinguish samples from the empirical
the model’s prediction. Since the proposed model adopts a
distribution from samples generated by the noise distribution.
feedforward structure, information at different timesteps is
We use S(S V ) noise samples for each target word. Then
not entangled when propagating through layers. Therefore,
the term N × H × V in the training complexity of the INLM
we can figure out that a certain prediction mainly relies on
becomes N × H × S, which is still less than H × V . Hence,
which inputs. In contrast, information at different timesteps
the proposed INLM can be efficiently trained via the NCE
is entangled together in RNN models due to the recurrent
method.
operation, which makes it difficult to find out how the model
As for the degree of parallelism, RNN, GRU, LSTM, and
makes decision. Although RCNN models separate the current
RCNN models can be only parallelized across sentences,
input C(xi−1 ) from the entire context as shown in Eq.(18),
while FNN, CNN, and INLM models can be parallelized
the residual context, namely φkRCNN (i − 1), is still entangled
across sentences as well as timesteps. Therefore, although the
together.
proposed INLM has a similar complexity with LSTM models,
it has a higher parallel efficiency.
G. Computational complexity and degree of parallelism
This subsection investigates the computational complexity VI. E XPERIMENTAL SETUP
and degree of parallelism of different neural models including
the proposed INLM. For simplicity, we assume that the A. Datasets
network depth of all neural models is one and ignore the We evaluated the proposed INLM on the Penn Treebank
bias terms. Actually, some models usually have more than (PTB) [54], the WikiText-2 [17], and the Switchboard [55]
one layers, such as CNN and RCNN models. datasets.
The model configurations are as follows. There are B sen- The PTB dataset is one of the most widely used datasets for
tences of T words. The vocabulary size is V , the embedding evaluating the performance of LMs, which consists of about
size is E, and the hidden size is H. All the feedforward models 929k training words, 73k validation words, and 82k test words.
rely on N history words. For the CNN model, we adopt the We adopted the same preprocessing as that in [56]. Words were
depth-wise convolution with E input channels and one channel lower-cased, numbers were replaced with N, newlines were
multiplier, and the kernel width is K. replaced with h/si, and all other punctuation was removed.
The vocabulary was the top 10k most frequent words with the avoid overfitting for all models. L2 regularization with a value
rest of the tokens being replaced by an hunki token. of 0.0001 was applied to the INLM50 and INLM100.
The WikiText-2 dataset introduced in [17] is another fre- Table II shows the hyperparameter configurations for all
quently used dataset for evaluating LMs, which consists of models in the experiments. The order for the LSTM, GRU and
about 2M training words, 218k validation words, and 246k test RCNN models means the length of training sequences. Under
words. The WikiText-2 dataset is more sophisticated compared the configuration in Table II, the CNN model actually depends
to the PTB dataset. It retains numbers, case, and punctuation. on 8 history words. The depth means the number of layers.
The vocabulary size was 33,278. We carried out no extra The layer can be either a fully-connected layer, a recurrent
processing for the WikiText-2 dataset other than replaced layer, a convolutional layer, or a transformer layer. The norm
newlines with h/si tokens. of gradients was clipped to 1.0 for all models to avoid the
Both the PTB and the WikiText-2 datasets were collected gradient explosion. In order to improve the training efficiency,
from articles in written language. We also would like to the NCE loss was used. On the PTB/WikiText-2/Switchboard
evaluate the proposed INLM on a spoken language dataset, dataset, 300/800/900 noise samples were used for each target
since spoken language usually has more complex patterns. word. Each noise word was sampled independently from the
Thus, we also conducted experiments on the Switchboard unigram distribution in the training data.
speech dataset, which consists of approximately 300 hours
conversational speech. We chose the first 4k utterances in TABLE II
H YPERPARAMETER CONFIGURATIONS FOR DIFFERENT NEURAL LM S .
the transcriptions as the validation set, and used the rest INLM W / O AT MEANS E Q .(28) IS ADOPTED AS THE HIDDEN LAYER , AND
transcriptions to train LMs. Additionally, we adopted the INLM W / AT MEANS THE TRANSFORMER ARCHITECTURE OF
HUB5 dataset [57] as the test set. In summary, there were E QS .(34)-(36) IS ADOPTED AS THE HIDDEN LAYER .
about 3M training words, 43k validation words, and 49k
Model Order Depth Embed Hidden Dropout
test words. The vocabulary was limited to the top 25k most 5 2 300 300 0.15
frequent training words. The rest words were replaced by the FNN 10 2 300 300 0.20
hunki token. In addition, newlines were replaced with h/si 20 2 300 300 0.25
LSTM 20 2 300 250 0.25
tokens. GRU 20 2 300 250 0.25
CNN 8 6 300 300 0.25
RCNN 20 3 300 200 0.25
B. Neural LMs 5 2 300 300 0.15
INLM w/o AT 10 2 300 300 0.20
In experiments, the proposed INLM was compared to sev- 20 2 300 300 0.25
eral typical neural LMs, including FNN, GRU, LSTM, CNN, 20 2 300 300 0.25
and RCNN models. To be fair, all the models were limited to INLM w/ AT 50 4 300 600 0.25
100 4 300 600 0.25
have about the same amount of parameters.
Training sequences were clipped to a maximum length of
20 words for the LSTM, GRU and RCNN models. The LSTM
C. N -Best rescoring
and GRU models had 2 recurrent layers. The RCNN model had
3 recurrent layers and adopted adaptive weights λ [38]. The The proposed INLM was also evaluated on the Switchboard
CNN model had 6 convolution layers with gated linear units speech recognition task by rescoring the n-best lists of the
(GLUs) [10]. Residual connections [58] were adopted every speech recognition outputs. The baseline speech recognition
two convolutional layers. The kernel width was 4. As for the system was build by the Kaldi toolkit [60]. The acoustic model
FNN model and the proposed INLM, the influence of context is a hybrid 3-layer LSTM trained with cross-entropy criterion.
length on the overall performance was also investigated. In The baseline LM was a Kneser-Ney smoothed trigram model
experiments, the INLM/FNN model had context length 5, 10, (KN3) estimated by the SRILM toolkit [61]. The baseline
and 20 respectively. In order to reduce complexity, the INLM speech recognition system was used to generate n-best lists for
only adopted the (n − 4m)-th, (n − 3m)-th, (n − 2m)-th, rescoring. In experiments, we selected the top 100 hypotheses
(n−1m)-th, and n-th order n-gram features in Eq.(30), where for each test utterance. All neural models were interpolated
m = n/5. Additionally, the weight V in Eq.(28) is shared with KN3 with a weight of 0.5.
at different timesteps. However, our preliminary experiments We reported the results on both the validation set and the test
show that sharing V in the first hidden layer limits the set (HUB5). In fact, HUB5 consists of two subsets: one has
performance. Therefore, in experiments, V is not shared in 20 unreleased telephone conversations from the Switchboard
the first hidden layer. studies (referred to as SW); the other includes 20 telephone
Except the CNN model, all models were optimized via the conversations from CALLHOME American English Speech
SGD algorithm with an initial learning rate of 1.0. The CNN (referred to as CH). Therefore, the results on these two subsets
model was optimized via the Adam algorithm [59] with an were also reported.
initial learning rate of 0.001, since it cannot be efficiently
optimized by the SGD algorithm. The learning rate decayed D. Interpretability analysis
by a factor of 0.8 when the validation loss stucked in a flat In order to find out the patterns that the proposed INLM
range. Training was stopped if the learning rate is less than learned from the training data, we visualised the√ averaged at-
0.00001 (0.000001 for the CNN model). Dropout was used to tention alignments (υ(Tril((Ql cl )T (Kl cl ))T / h) in Eq.(34)
) in the INLM50 on the PTB test set. Some examples were studies [25]. Results in lines 18 and 20 also show that the AT
also provided to show how the proposed INLM works on a component can significantly improve the performance of the
specific utterance. In addition, we also visualised the adaptive proposed INLM. In Table III, the LSTM model outperforms
weight µ (in Eq.(18)) in the RCNN model, which can partly the GRU, CNN, and RCNN models by a large margin. Howev-
interpret how the RCNN model utilise context. er, with the AT component, the proposed INLM20 outperforms
the LSTM model on all datasets.
VII. E XPERIMENTAL RESULTS In addition to the performance, Table IV also reports the
training/testing speed of some models in Table III on the
A. Perplexity WikiText-2 dataset. For the LSTM, CNN, and RCNN models,
We first investigated the contribution of each component each batch contained 400 words according to the configura-
in the proposed INLM to the overall performance, and then tions in Section VI-B. Therefore, we enlarged the batch size of
compared the proposed INLM to several typical neural LMs. the INLM20 to 400 words when evaluating the training/testing
Results are reported in Table III. To make this clearer, we speed. Training(xEnt) means that the neural LMs were trained
briefly clarify the components proposed in Section V as with cross-entropy. Training(NCE) represents that the NCE
follows. The PF (Eqs.(20)-(22)) and TR (Eq.(33)) components method was used to train neural LMs. All experiments in Table
are designed to incorporate the temporal information into IV were conducted on a Tesla P100 GPU. We trained/tested
the INLM. The IO (Eq.(30) without η) and GM (Eq.(31)) each model for 500 batches, and reported the average speed.
components aim to extract useful information from context. Table IV shows that the LSTM and RCNN models are
We refer to the constraint in Eq.(32), which ties the matrix less efficient than the CNN model due to their recurrent
E in mapping C and the matrix W in the output layer architectures. With the cross-entropy loss, the INLM20 has an
of Eq.(8), as the weight-tying (WT) component. The self- extremely low training speed due to the auxiliary prediction
attention architecture (Eqs.(34)-(36)) is referred to as the AT losses in Eq.(33). With the NCE method, all models achieve
component. a higher training speed. Especially, the INLM20 reaches a
Results in lines 5-7, 9-11, and 14-16 in Table III demon- speed of about 12k words per second. Since the auxiliary
strate that both the PF component and the TR component predictions are no longer needed at testing, the testing speed
can reduce the perplexity of the INLM without regard to of the INLM20 is only second to the speed of the CNN model.
the context length. This indicates that the proposed PF and
TR components can efficiently capture temporal information, TABLE IV
which is of importance to language modeling. Results in lines T RAINING AND TESTING SPEED ( WORDS PER SECONDS ) OF DIFFERENT
NEURAL LM S ON THE W IKI T EXT-2 DATASET
5,9, and 14 show that longer context hurts the performance
of the FNN model on the PTB dataset. This is probably Model Training(xEnt) Training(NCE) Testing
because the FNN model cannot effectively learn temporal LSTM 8.8k 9.2k 22.3k
CNN 12.6k 13.3k 33.4k
information from the context. In contrast, the proposed INLM RCNN 6.4k 6.8k 18.5k
can efficiently deal with longer context with the help of the INLM20 2.5k 12.0k 31.3k
PF and TR components.
Results in lines 8, 12, and 17 show that the GM component In order to provide a more sound assessment, we also com-
can further improve the INLM’s performance. The element- pared the proposed INLM with the state-of-the-art language
wise multiplication operation in the GM component can be modeling techniques on the PTB and WikiText-2 datasets.
considered as a logical and operator between two latent repre- Recently, many studies claimed that they achieved the state-
sentations derived from different affine transforms. Therefore, of-the-art performance on specific datasets. However, LMs
the GM component can emphasise the useful information for in recent studies were usually more larger and deeper than
both latent representations. Additionally, results in lines 8, 12, their baseline models. In order to make a fair comparison on
and 17 show that longer context can improve the INLM’s performance between the proposed INLM and recent state-of-
performance. Results in lines 13 and 18 show that the IO the-art language modeling techniques, all neural LMs should
component can further improve the INLM’s performance. This have about the same amount of parameters. Therefore, we
hints additional features are useful. But in practice, a trade-off reimplemented some models with the source codes provided
between performance and complexity is often necessary. In or- by the authors. Results are shown in Table V.
der to reduce complexity, both the INLM10 and INLM20 only Table V shows that LSTM is still the most popular archi-
adopted 5 lower order n-gram features in the IO component. tecture in language modeling and some CNN LMs (such as
The RCNN model has high complexity because the prediction QRNN and TCN) are competitive with LSTM LMs. Results
of the next token depends on all consecutive/non-consecutive in Table V also demonstrate that the proposed INLM is com-
n-gram features up to the current timestep. Since the prediction petitive with recent state-of-the-art language models. But the
of LMs mostly depends on the most recent words, some distant proposed INLM also performs slightly worse than the neural
n-grams might have little or even negative contribution to cache model [16], pointer sentinel-LSTM model [17] and
language modeling. AWD-LSTM model [22]. The cache and pointer techniques
Results in lines 18 and 19 show that the WT component can effectively reduce perplexities by dynamically adapting
not only reduces the amount of parameters, but also improves the model’s prediction to the recent history. The NT-ASGD
the performance. This is consistent with the results in other method can significantly improve the performance of LSTM
TABLE III
E VALUATING DIFFERENT NEURAL LM S ON THREE DATASETS .
PTB WikiText-2 HUB5

No Model PF TR GM IO WT AT
√ Size Perplexity Size Perplexity Size Perplexity
1 LSTM - - - - √ - 6.7M 110.65 13.8M 132.65 10.5M 79.87
2 GRU - - - - √ - 6.3M 121.28 13.8M 151.03 10.5M 83.91
3 CNN - - - - √ - 6.3M 126.40 13.5M 171.39 10.0M 89.40
4 RCNN - - - - √ - 6.7M 126.79 13.7M 168.74 10.3M 86.67
5 FNN5 -
√ - - - √ - 3.7M 140.25 10.7M 178.05 8.2M 93.65
6 √ -
√ - - √ - 3.7M 134.01 10.7M 174.49 8.2M 87.47
7 INLM5 √ √ -
√ - √ - 3.7M 129.57 10.7M 170.24 8.2M 86.81
8 - √ - 3.9M 120.42 10.9M 158.84 8.4M 84.11
9 FNN10 -
√ - - - √ - 4.2M 148.71 11.2M 173.78 8.7M 89.31
10 √ -
√ - - √ - 4.2M 134.89 11.2M 162.48 8.7M 86.46
11 √ √ -
√ - √ - 4.2M 128.88 11.2M 161.81 8.7M 85.59
INLM10
12 √ √ √ -
√ √ - 4.4M 119.82 11.4M 147.39 8.9M 83.98
13 √ - 4.7M 114.68 11.7M 142.22 9.3M 83.88
14 FNN20 -
√ - - - √ - 5.1M 154.95 12.1M 172.62 9.6M 88.37
15 √ -
√ - - √ - 5.2M 140.69 12.2M 160.65 9.7M 86.16
16 √ √ -
√ - √ - 5.2M 131.95 12.2M 160.19 9.7M 85.51
17 √ √ √ -
√ √ - 5.4M 121.90 12.2M 146.21 9.9M 83.80
INLM20
18 √ √ √ √ - 6.2M 116.33 12.4M 139.99 10.0M 83.52
19 √ √ √ √ -
√ -
√ 8.7M 117.55 22.7M 142.41 17.7M 84.83
20 6.5M 92.61 12.6M 109.29 10.3M 79.76
TABLE V
C OMPARING THE PROPOSED INLM WITH THE STATE - OF - THE - ART LANGUAGE MODELING TECHNIQUES ON THE PTB AND W IKI T EXT-2 DATASETS .
PTB WikiText-2
Model
Size Perplexity Size Perplexity
FOFE (Zhang et al., 2015) [28] 6M 108 - -
FSMN (Zhang et al., 2015) [29] 6M 101 - -
LSTM(medium) (Zaremba et al., 2015) [19] 20M 82.7 - -
LSTM-b (Jozefowicz et al., 2015) [12] 20M 79.8 - -
Variational LSTM(medium) (Gal et al., 2016) [20] 20M 79.7 - -
LSTM-Char-CNN(large) (Kim et al., 2016) [9] 19M 78.9 - -
Variational RHN-WT(2 hidden layers) (Zilly et al., 2016) [14] 17M 75.1 - -
NAS(our implements) (Zoph et al., 2017) [13] 18M 90.7 26M 121.6
RCNN(adaptive λ, bigram) (Lei, 2017) [38] 16M 89.6 - -
HW-LSTM(our implements) (Kurata et al., 2017) [15] 20M 82.3 27M 106.8
QRNN (Bradbury et al., 2017) [30] 18M 79.9 - -
LSTM-WT(large) (Press et al., 2017) [25] 51M 74.3 - -
VD-LSTM-ALRE(medium) (Inan et al., 2017) [26] 10M 73.2 25M 87.0
Zoneout + Variational LSTM (Merity et al., 2017) [17] - - 20M 100.9
Neural cache model (Grave et al., 2017) [16] - 72.1 - 81.6
Pointer Sentinel-LSTM (Merity et al., 2017) [17] 21M 70.9 21M 80.8
TCN (Ba et al., 2018) [11] 13M 89.2 - -
Res-IndRNN(our implements) (Li et al., 2018) [18] 21M 86.1 30M 112.5
AWD-LSTM(w/ NT-ASGD, our implements) (Merity et al., 2018) [22] 16M 66.2 21M 83.6
AWD-LSTM(w/ SGD, our implements) (Merity et al., 2018) [22] 16M 74.4 21M 97.5
INLM50 + L2 regularization 15M 73.1 22M 86.4
INLM100 + L2 regularization 15M 72.9 22M 83.6
models by model averaging. Results in Table V show that the In experiments, we found that the RCNN and CNN models
AWD-LSTM only achieves a perplexity of 74.4 on the PTB needed well-chosen hyperparameters and were sensitive to the
dataset and a perplexity of 97.5 on the WikiText-2 dataset optimization algorithm. Hence, they performed less than satis-
when optimized by SGD. However, the NT-ASGD method factory in our configurations. We further conducted a series of
reduces the perplexities of the AWD-LSTM to 66.2 on the experiments to evaluate how sensitive the proposed INLM is to
PTB dataset and 83.6 on the WikiText-2 dataset respectively. the hyperparameters, including the decay rate of the learning
In addition, results in Table V show that the proposed INLM rate (LrDcy), the maximum of the gradient norm (GrdNrm),
outperforms the LMs with highway architectures, such as the dropout rate, batch size, and the number of lower order n-gram
variational RHN [14] and HW-LSTM [15]. features in the IO component (nNgrm). Results are presented
in Table VII. Specifically, a nNgrm of 10 means the (n−9m)-
In Table VI, we reevaluated the effect of each proposed
th, ..., (n − 1m)-th, and n-th order n-gram features are used
component in INLM50, since the models in Table III are too
in Eq.(30), where m = n/10. A nNgrm of 25 means the
small. Results show that each proposed component can still
(n − 24m)-th, ..., (n − 1m)-th, and n-th order n-gram features
reduce the perplexity even if the model size becomes larger.
are used in Eq.(30), where m = n/25. Overall, the proposed
In addition, the INLM can achieve better performance with a
INLM50 can still achieve satisfactory performance when the
longer context.
TABLE VI
E VALUATING THE EFFECT OF THE PROPOSED COMPONENTS IN INLM50 ON THE PTB AND W IKI T EXT-2 DATASETS .
PTB WikiText-2
No Model PF TR GM IO WT AT
√ Size Perplexity Size Perplexity
1 -
√ - - - √ - 13M 159.94 20M 164.60
2 √ -
√ - - √ - 14M 121.06 21M 133.19
3 √ √ -
√ - √ - 14M 113.99 21M 128.07
4 INLM50 √ √ √ -
√ √ - 15M 102.75 22M 118.52
5 √ √ √ √ - 16M 97.06 23M 113.13
6 √ √ √ √ -
√ -
√ 19M 101.49 33M 122.23
7 √ √ √ √ √ √ 15M 75.11 22M 88.43
8 INLM100 15M 73.89 22M 85.39
hyperparameters fluctuate within a certain range. Results show TABLE VIII

that the proposed INLM can achieve better performance with A N EVALUATION OF DIFFERENT NEURAL LM S ON THE S WITCHBOARD
SPEECH RECOGNITION TASK
a larger batch size. Dropout rate has the largest effect among
all the hyperparameters. With a small dropout of 0.15, the Model Padding DEV SW CH HUB5
perplexity of the INLM50 increases to 78.19. This indicates baseline - 15.43 14.88 24.92 20.01
LSTM - 14.27 13.85 23.84 18.95
that dropout can effectively avoid overfitting on small datasets. GRU - 14.34 13.99 23.97 19.08
Results in the last two lines in Table VII also indicate that CNN - 14.89 14.54 24.38 19.56
incorporating more lower order n-gram features into Eq.(30) RCNN - 14.60 14.25 24.03 19.23
eos 14.34 13.97 24.14 19.15
provides no significant improvement. Hence, it is reasonable INLM5 (w/o AT)
tail 14.34 14.04 24.10 19.17
to adopt 5 lower order n-gram features in the IO component eos 14.33 13.95 24.07 19.10
INLM10 (w/o AT)
in our experiments. tail 14.31 13.94 24.03 19.08
eos 14.32 13.83 23.92 18.97
INLM20 (w/o AT)
tail 14.40 13.83 23.75 18.89
TABLE VII
eos 14.16 13.57 23.57 18.67
P ERFORMANCE OF INLM50 ON THE PTB DATASET UNDER DIFFERENT INLM20 (w/ AT)
tail 14.15 13.55 23.56 18.65
HYPERPARAMETER CONFIGURATIONS
LrDcy GrdNrm Dropout Batch nNgrm Perplexity

0.8 1.0 0.25 200 5 75.11
0.7 1.0 0.25 200 5 75.54 outperforms the LSTM model by a narrow margin. Table VIII
0.6 1.0 0.25 200 5 76.11 also demonstrates that the two different padding methods al-
0.8 1.5 0.25 200 5 75.25 most achieve the same results. Padding meaningful words (tail
0.8 2.0 0.25 200 5 76.02
0.8 1.0 0.35 200 5 74.79
padding) sometime achieves slightly better results, whereas
0.8 1.0 0.15 200 5 78.19 padding h/si (eos padding) is more robust. More specifically,
0.8 1.0 0.25 300 5 74.59 the eos padding method introduces the trivial token h/si into
0.8 1.0 0.25 400 5 74.14
the raw context, which cannot provide useful information.
0.8 1.0 0.25 200 10 74.41
0.8 1.0 0.25 200 25 74.20 Similar to the idea of the cache model, the original intention of
the tail padding strategy is as follows. Tokens, which appeared
in the history, are more likely to appear again in the future.
B. N -best rescoring Conversely, tokens at the end of the current utterance also
This subsection evaluated the proposed INLM on the probably appear in the missing context of the tokens at the
Switchboard speech task. Since the proposed INLM requires a beginning of the current utterance. Conversations in HUB5
fixed-length context at each timestep, we padded extra tokens are split into short utterances. Hence, tokens at the beginning
to contexts which are shorter than a specified length. For of most utterances indeed have a groundtruth context, which
example, the raw context at timestep t = 5 only consists is the groundtruth transcription of the preceding utterance.
of 4 words, but the INLM5 requires a context of length 5. However, the groundtruth contexts are unavailable at testing.
Therefore, an extra token should be added in front of the According to the above analysis, the tail padding method might
raw context. This paper investigated two different padding provide better results than the eos padding method. However, if
methods. Given a context of length m and a n-th order INLM there are obvious differences between the preceding utterance
(n > m), the first method pads n − m sentence end marker and the current utterance, the tail padding method will perform
h/si to the raw context, whereas the second one pads the last worse than the eos padding method.
n − m words in the current utterance to the raw context. If Empirically, longer context usually leads to a better perfor-
the current utterance is shorter than n − m, we first produce a mance in language modeling. However, results in Table VIII
new utterance, which is longer than n − m, by duplicating the show that longer context has negligible contributions to the
raw utterance for several times. We refer to the first method performance of the proposed INLM. This may be owing to
as eos padding and the second one as tail padding. Results are the following two reasons. Firstly, the Switchboard dataset is
presented in Table VIII. a collection of conversational utterances, which are usually
The results in Table VIII is consistent with those in Table III. lack of long-term dependencies. Secondly, the test set mainly
The proposed INLM20 reaches the optimal performance and consists of short utterances. Table IX shows the statistics of
the utterance length in the validation set and the test set. The successfully bypasses this problem. Similar conclusions can
utterances are divided into four groups by length as shown in be drawn from Table XI. Since the WER is also affected by
the first column of Table IX. For each group of utterances, other factors, such as the acoustic likelihood, the rules in Table
we computed their proportion in the entire validation/test XI are not as obvious as those in Table X.
set (Ratio) and mean length (Mean). The statistical results
demonstrate that, in the validation/test set, almost half of the C. Interpretability analysis
utterances are less than 5 words and the mean length of the This subsection provides some elementary analyses on
utterances is about 10 words. Therefore, in this case, a very the prediction interpretability of the proposed INLM. For
long context is not necessary. each transformer layer in the INLM50, we computed√the
averaged attention alignments (υ(Tril((Ql cl )T (Kl cl ))T / h)
TABLE IX in Eq.(34)) on the PTB test set. The averaged attention
L ENGTH OF THE UTTERANCES IN THE S WITCHBOARD VALIDATION SET
AND THE HUB5 DATASET
alignments are plotted in Fig.1. The vertical axis of each
subfigure is the order of the attention. Concretely, the k-th
DEV SW CH order attention in the l-th layer determines the weights for
Length
Ratio Mean Ratio Mean Ratio Mean
[1,5) 45.67 1.81 43.53 1.82 47.96 2.25 vectors cl−1 l−1
tra (i − n), · · · , ctra (i − n + k − 1) in Eq.(36). The
[5,10) 12.20 7.88 14.36 7.96 22.12 7.76 horizontal axis of each subfigure represents the history words.
[10,20) 18.20 15.30 24.41 15.01 21.47 14.67 At timestep t, word wt−1 is with index 49 and word wt−50 is
[20,∞) 23.93 32.30 20.70 31.16 8.45 26.40
with index 0.
Total 100.0 12.30 100.0 11.60 100.0 8.18
In order to investigate whether a longer span model is useful

if the utterances are longer, we further conducted the following
experiments. We divided the utterances in HUB5 into four
subsets according to their lengths like Table IX. The perplexity
and WER results of the INLM5/INLM10/INLM20 on each
subset are presented in Table X and Table XI respectively.
TABLE X
P ERPLEXITY RESULTS OF THE INLM5/INLM10/INLM20 ON EACH
SUBSET OF HUB5
INLM5 INLM10 INLM20

Length
eos tail eos tail eos tail
[1,5) 42.49 34.31 42.81 31.44 43.18 31.06
[5,10) 93.70 92.84 94.89 77.30 95.24 75.53
[10,20) 95.39 95.83 94.78 94.74 95.03 88.24
[20,∞) 92.15 92.42 91.74 92.51 90.44 91.24
Fig. 1. Averaged attention alignments in each transformer layer of the
INLM50 on the PTB test set. The averaged attention weights for each order
TABLE XI are normalised to between 0 and 1 independently.
WER RESULTS OF THE INLM5/INLM10/INLM20 ON EACH SUBSET OF
HUB5 As shown in Fig. 1, the attention alignments in the first layer
INLM5 INLM10 INLM20 focuses on the full context. The attention alignments in higher
Length
eos tail eos tail eos tail layers mainly focuses on recent context. Since the weight γ in
[1,5) 25.93 25.74 26.17 25.94 26.05 25.59 Eq.(33) exponentially decays in the training process, low-order
[5,10) 21.12 21.49 21.35 21.05 21.15 21.19
[10,20) 19.62 19.53 19.45 19.36 19.26 19.30 attentions might not be fully optimised. Therefore, some low-
[20,∞) 16.11 16.11 16.00 16.18 15.72 15.72 order attentions in the 4-th layer fails to learn useful patterns.
This does not affect the final performance of the INLM, since
With the tail padding method, the INLM20 achieves the only the highest-order attention is used at testing.
lowest perplexities on each subset. With the eos padding To provide a further analysis on the interpretability, we
method, the INLM5 performs best on the first two subsets. selected an example utterance from the PTB test set. The
This is probably because more meaningless tokens h/si were highest-order attention weights in the first transformer layer of
added into the contexts of the INLM10/INLM20. However, the INLM20 for the most recent 5 words in the context at each
the INLM10 outperforms the INLM5 on the third subset, timestep is plotted in Fig. 2. The highest-order accumulated
and the INLM20 performs best on the fourth subset. This attention weights for the most recent 5 words in the context at
indicates that a longer span model usually performs better if each timestep is plotted in Fig. 3. At timestep t, the highest-
effective contexts are available. In summary, results in Table order accumulated attention weight for the history word wt−j
X show that longer context is useful for the proposed INLM. is obtained by multiplying the corresponding attention weights
Moreover, the eos padding method limits the performance in all transformer layers.
of long span INLMs on short utterances because of the In Fig. 2, almost all the attention weights are between 0.145
meaningless padding tokens, whereas the tail padding method and 0.165. At each timestep, the attention weights for different
Fig. 2. The highest-order attention weights in the first transformer layer of Fig. 4. The wight µ for the current input wt−1 in the RCNN model at each
the INLM20 for the most recent 5 words in the context at each timestep on an timestep on an example utterance.
example utterance. At each timestep, attention t − i means the highest-order
attention weight for the history word wt−i .
VIII. C ONCLUSIONS
This paper proposes a unified framework for language

modeling, which can partly interpret the rationales behind
existing LMs. Based on the proposed framework, an in-
terpretable neural language model is proposed, including a
tailored architectural structure and a tailored learning method.
The proposed model, which can be approximated as a parame-
terized ARMA model, provides interpretability in two aspects:
component interpretability and prediction interpretability. S-
ince the proposed model adapts a feedforward architecture,
it does not suffer from the gradient decay problem and can
be parallelised across timesteps. Complicated regularization
techniques are no longer necessary for the proposed model,
since conventional regularization techniques, such as dropout
Fig. 3. The highest-order accumulated attention weights of the INLM20 for and L2 regularization, can efficiently regularise the proposed
the most recent 5 words in the context at each timestep on an example utter- model.
ance. At each timestep, attention t − i means the highest-order accumulated Experiments demonstrate that the proposed model outper-
attention weight for the history word wt−i .
forms some typical neural LMs on several language mod-
eling datasets and on the Switchboard speech recognition
words in the context have similar values. Fig. 3 shows that task. Further experiments also show that the proposed model
more recent history words have larger attention weights. These is competitive with the state-of-the-art LSTM LMs on the
results are also consistent with those in Fig. 1, PTB and WikiText-2 datasets. Experiments on interpretability
Since the RCNN model can provide some interpretability as demonstrate that the proposed model can efficiently learn
we mentioned in Section V-F, we also plot the µ in Eq.(18) temporal information from the data, but it fails to learn
on the same utterance for comparison. In fact, µ is a high- high-level knowledge. Hence, in the future, we will study
dimensional vector. Each element in µ ranges from 0 to 1. how to incorporate high-level knowledge, such as semantic
Therefore, we plot the mean of the elements in µ in Fig. 4. information, into the proposed model.
As shown in Eq.(18), µ indicates the importance of the
current input. Fig. 4 shows that µ is larger than 0.48 at ACKNOWLEDGMENT
every timestep, which means the current input plays the most
important role among all history words. Although the weight µ The authors would like to thank the anonymous reviewers
can provide some prediction interpretability, the RCNN model for their helpful and constructive comments that greatly im-
still has some defects. One is that apart from the current input, proved this manuscript.
the residual context is still entangled together. This brings This work is partially supported by the National Natural
some difficulties for analysing the contributions of each history Science Foundation of China (Nos. 11590771, 11590770),
word. Another one is that the µ is shared among all hidden the National Key Research and Development Program
layers, which might limit the performance of the RCNN model (Nos.2016YFB0801203, 2016YFB0801200), the Key Science
and prevent us from further analyzing the patterns learned by and Technology Project of the Xinjiang Uygur Autonomous
different layers. Region (No.2016A03007-1), the Pre-research Project for
Equipment of General Information System (No.JZX2017- [26] H. Inan, K. Khosravi, and R. Socher, “Tying word vectors and word
0994/Y306), and the Strategic Priority Research Program of classifiers: A loss framework for language modeling,” arXiv preprint
arXiv:1611.01462, 2016.
Chinese Academy of Sciences (No.XDC02050400). [27] Z. Yang, Z. Dai, R. Salakhutdinov, and W. W. Cohen, “Breaking the
softmax bottleneck: A high-rank RNN language model,” arXiv preprint
R EFERENCES arXiv:1711.03953, 2017.
[28] S. Zhang, H. Jiang, M. Xu, J. Hou, and L. Dai, “A fixed-size encoding
[1] X. Liu, X. Chen, Y. Wang, M. J. Gales, and P. C. Woodland, “Two effi- method for variable-length sequences with its application to neural
cient lattice rescoring methods using recurrent neural network language network language models,” arXiv preprint arXiv:1505.01504, 2015.
models,” IEEE/ACM Transactions on Audio, Speech, and Language [29] S. Zhang, C. Liu, H. Jiang, S. Wei, L. Dai, and Y. Hu, “Feedforward
Processing, vol. 24, no. 8, pp. 1438–1449, 2016. sequential memory networks: A new structure to learn long-term depen-
[2] F. J. Och and H. Ney, “Discriminative training and maximum entropy dency,” arXiv preprint arXiv:1512.08301, 2015.
models for statistical machine translation,” in Proc. of the 40th Annual [30] J. Bradbury, S. Merity, C. Xiong, and R. Socher, “Quasi-recurrent neural
Meeting on Association for Computational Linguistics, 2002, pp. 295– networks,” arXiv preprint arXiv:1611.01576, 2016.
302. [31] A. Mahendran and A. Vedaldi, “Understanding deep image representa-
[3] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural proba- tions by inverting them,” in Proc. of the IEEE Conference on Computer
bilistic language model,” Journal of Machine Learning Research, vol. 3, Vision and Pattern Recognition, 2015, pp. 5188–5196.
no. Feb, pp. 1137–1155, 2003.
[32] A. Dosovitskiy and T. Brox, “Inverting visual representations with
[4] R. Kneser and H. Ney, “Improved backing-off for m-gram language
convolutional networks,” in Proc. of the IEEE Conference on Computer
modeling,” in Proc. of ICASSP, 1995, pp. 181–184.
Vision and Pattern Recognition, 2016, pp. 4829–4837.
[5] M. Sundermeyer, H. Ney, and R. Schlüter, “From feedforward to recur-
[33] M. Aubry and B. C. Russell, “Understanding deep features with
rent LSTM neural networks for language modeling,” IEEE Transactions
computer-generated imagery,” in Proc. of the IEEE International Con-
on Audio, Speech, and Language Processing, vol. 23, no. 3, pp. 517–
ference on Computer Vision, 2015, pp. 2875–2883.
529, 2015.
[6] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur, “Re- [34] P. W. Koh and P. Liang, “Understanding black-box predictions via
current neural network based language model,” in Proc. INTERSPEECH, influence functions,” arXiv preprint arXiv:1703.04730, 2017.
2010. [35] Q. Zhang, Y. N. Wu, and S.-C. Zhu, “Interpretable convolutional neural
[7] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural networks,” arXiv preprint arXiv:1710.00935, 2017.
Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [36] T. Wu, X. Li, X. Song, W. Sun, L. Dong, and B. Li, “Interpretable
[8] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, R-CNN,” arXiv preprint arXiv:1711.05226, 2017.
H. Schwenk, and Y. Bengio, “Learning phrase representations using [37] O. Li, H. Liu, C. Chen, and C. Rudin, “Deep learning for case-
rnn encoder-decoder for statistical machine translation,” arXiv preprint based reasoning through prototypes: A neural network that explains its
arXiv:1406.1078, 2014. predictions,” arXiv preprint arXiv:1710.04806, 2017.
[9] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-aware neural [38] T. Lei, “Interpretable neural models for natural language processing,”
language models.” in Proc. of the 30th AAAI Conference on Artificial Ph.D. dissertation, Massachusetts Institute of Technology, 2017.
Intelligence, 2016, pp. 2741–2749. [39] I. J. Good, “The population frequencies of species and the estimation
[10] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling of population parameters,” Biometrika, vol. 40, no. 3-4, pp. 237–264,
with gated convolutional networks,” arXiv preprint arXiv:1612.08083, 1953.
2016. [40] S. Katz, “Estimation of probabilities from sparse data for the language
[11] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic model component of a speech recognizer,” IEEE Transactions on A-
convolutional and recurrent networks for sequence modeling,” arXiv coustics, Speech, and Signal Processing, vol. 35, no. 3, pp. 400–401,
preprint arXiv:1803.01271, 2018. 1987.
[12] R. Jozefowicz, W. Zaremba, and I. Sutskever, “An empirical exploration [41] S. Khudanpur and J. Wu, “A maximum entropy language model in-
of recurrent network architectures,” in Proc. of International Conference tegrating n-grams and topic dependencies for conversational speech
on Machine Learning, 2015, pp. 2342–2350. recognition,” in Proc. of ICASSP, 1999, pp. 553–556.
[13] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement [42] R. Rosenfeld, “Adaptive statistical language modeling; a maximum
learning,” arXiv preprint arXiv:1611.01578, 2016. entropy approach,” CARNEGIE-MELLON UNIV PITTSBURGH PA
[14] J. G. Zilly, R. K. Srivastava, J. Koutnı́k, and J. Schmidhuber, “Recurrent DEPT OF COMPUTER SCIENCE, Tech. Rep., 1994.
highway networks,” arXiv preprint arXiv:1607.03474, 2016. [43] D. Cutting, J. Kupiec, J. Pedersen, and P. Sibun, “A practical part-
[15] G. Kurata, B. Ramabhadran, G. Saon, and A. Sethy, “Language mod- of-speech tagger,” in Proc. of the 3rd Conference on Applied Natural
eling with highway LSTM,” in Proc. of 2017 IEEE Automatic Speech Language Processing, 1992, pp. 133–140.
Recognition and Understanding Workshop, 2017, pp. 244–251. [44] T. Mikolov and G. Zweig, “Context dependent recurrent neural network
[16] E. Grave, A. Joulin, and N. Usunier, “Improving neural language models language model.” in Proc. of IEEE Spoken Language Technology Work-
with a continuous cache,” arXiv preprint arXiv:1612.04426, 2016. shop, 2012, pp. 234–239.
[17] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel
[45] S. L. Marple and S. L. Marple, Digital spectral analysis: with applica-
mixture models,” arXiv preprint arXiv:1609.07843, 2016.
tions. Prentice-Hall Englewood Cliffs, NJ, 1987, vol. 5.
[18] S. Li, W. Li, C. Cook, C. Zhu, and Y. Gao, “Independently recurrent
[46] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training
neural network (indRNN): Building a longer and deeper RNN,” in Proc.
recurrent neural networks,” in Proc. of 30th International Conference
of the IEEE Conference on Computer Vision and Pattern Recognition,
on Machine Learning, 2013, pp. 1310–1318.
2018, pp. 5457–5466.
[19] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network [47] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,”
regularization,” arXiv preprint arXiv:1409.2329, 2014. arXiv preprint arXiv:1505.00387, 2015.
[20] Y. Gal and Z. Ghahramani, “A theoretically grounded application of [48] K. Irie, Z. Lei, R. Schlüter, and H. Ney, “Prediction of LSTM-RNN full
dropout in recurrent neural networks,” in Proc. of the 30th International context states as a subtask for n-gram feedforward language models,”
Conference on Neural Information Processing Systems, 2016, pp. 1019– in Proc. of ICASSP, 2018, pp. 6104–6108.
1027. [49] J. Schmidhuber, “Deep learning in neural networks: An overview,”
[21] K. Zolna, D. Arpit, D. Suhubdy, and Y. Bengio, “Fraternal dropout,” Neural Networks, vol. 61, pp. 85–117, 2015.
arXiv preprint arXiv:1711.00066, 2017. [50] M. Bianchini and F. Scarselli, “On the complexity of neural network
[22] S. Merity, N. S. Keskar, and R. Socher, “Regularizing and optimizing classifiers: A comparison between shallow and deep architectures,” IEEE
LSTM language models,” arXiv preprint arXiv:1708.02182, 2017. Transactions on Neural Networks and Learning Systems, vol. 25, no. 8,
[23] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus, “Regularization pp. 1553–1565, 2014.
of neural networks using dropconnect,” in in Proc. of International [51] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Conference on Machine Learning, 2013, pp. 1058–1066. Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. of
[24] S. Merity, B. McCann, and R. Socher, “Revisiting activation regulariza- the 31th International Conference on Neural Information Processing
tion for language RNNs,” arXiv preprint arXiv:1708.01009, 2017. Systems, 2017, pp. 5998–6008.
[25] O. Press and L. Wolf, “Using the output embedding to improve language [52] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv
models,” arXiv preprint arXiv:1608.05859, 2016. preprint arXiv:1607.06450, 2016.
[53] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new

estimation principle for unnormalized statistical models,” in Proc. of the
13th International Conference on Artificial Intelligence and Statistics,
2010, pp. 297–304.
[54] M. Marcus, B. Santorini, M. A. Marcinkiewicz, and A. Taylor,
“Treebank-3 ldc99t42,” CD-ROM. Philadelphia, Penn.: Linguistic Data
Consortium, 1999.
[55] G. John and E. Holliman, “Switchboard-1 release 2 ldc97s62,” Web
Download. Philadelphia: Linguistic Data Consortium, 1993.
[56] T. Mikolov, “Statistical language models based on neural networks,”
Ph.D. dissertation, Brno University of Technology, 2012.
[57] L. Consortium et al., “2000 hub5 English evaluation speech ld-
c2002s09,” Web Download. Philadelphia: Linguistic Data Consortium,
2002.
[58] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proc. of the IEEE Conference on Computer Vision and
Pattern Recognition, 2016, pp. 770–778.
[59] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
[60] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,
M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The Kaldi
speech recognition toolkit,” in IEEE 2011 Workshop on Automat-
ic Speech Recognition and Understanding, no. EPFL-CONF-192584.
IEEE Signal Processing Society, 2011.
[61] A. Stolcke, “SRILM-an extensible language modeling toolkit,” in Proc.
of 7th International Conference on Spoken Language Processing, 2002.
Yike Zhang received the B.E. degree in Infor-

mation and Signal processing from Northwestern
Polytechnical University, China, in 2014. He is a
Ph.D. candidate at the Institute of Acoustics, Chinese
Academy of Sciences. His research interests include
automatic speech recognition and natural language
processing.
Pengyuan Zhang received the Ph.D. degree in

Information and Signal Processing from Institute of
Acoustics, Chinese Academy of Sciences, China,
in 2007. From 2013 to 2014, he was a research
scholar of University of Sheffield. He is currently
a professor at the Speech Acoustics and Content
Understanding Laboratory, Chinese Academy of Sci-
ences. His research interests include spontaneous
speech recognition, speech synthesis and acoustic
signal detection.
Yonghong Yan received the B.E. degree in Elec-

tronic Engineering from Tsinghua University, China,
in 1990, and the Ph.D. degree in Computer Science
and Engineering from Oregon Graduate Institute of
Science and Technology, USA, in 1995. Currently he
is a professor at the Speech Acoustics and Content
Understanding Laboratory, Chinese Academy of Sci-
ences. His research interests include speech process-
ing and recognition. language/speaker recognition,
and human computer interface.

Tailoring An Interpretable Neural Language Model: Yike Zhang, Pengyuan Zhang, and Yonghong Yan

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Tailoring An Interpretable Neural Language Model: Yike Zhang, Pengyuan Zhang, and Yonghong Yan

Hochgeladen von

Copyright:

Verfügbare Formate

This article has been accepted for publication in a future issue of this journal, but has not been

Tailoring an Interpretable Neural Language Model

where 1 ≤ k ≤ n and of C, which maps word embeddings (or aggregated features)

placing φINLM (i) with φt−i+n

PTB WikiText-2 HUB5

hyperparameters fluctuate within a certain range. Results show TABLE VIII

LrDcy GrdNrm Dropout Batch nNgrm Perplexity

In order to investigate whether a longer span model is useful

INLM5 INLM10 INLM20

This paper proposes a unified framework for language

[53] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new

Yike Zhang received the B.E. degree in Infor-

Pengyuan Zhang received the Ph.D. degree in

Yonghong Yan received the B.E. degree in Elec-

Das könnte Ihnen auch gefallen