Hybrid Context Dependent CD-DNN-HMM Keyword Spotting (KWS) in Speech Conversations

2016 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT.
13–16, 2016, SALERNO, ITALY
HYBRID CONTEXT DEPENDENT CD-DNN-HMM KEYWORD SPOTTING (KWS) IN

SPEECH CONVERSATIONS
Vivek Tyagi
Xerox Research Center India

vivektyagiibm@gmail.com
ABSTRACT racy gains (Figure of Merit (FOM)1 relative improvement of 43.0%)

over a very competitive baseline of CD-GMM-HMM KWS, while
We present detailed analysis of phoneme recognition performance also running in sub real-time (0.056x RT factor2 ).
of a context dependent tied-state triphone Gaussian Mixture Model To our knowledge, the proposed CD-DNN KWS architecture
Hidden Markov Model (CD-GMM-HMM) acoustic model (state-of- is the first result where a large four hidden layer DNN has been
the-art large acoustic model (AM)) and a four hidden layer context successfully trained on context dependent tied state classes (1147
dependent Deep Neural Network (CD-DNN-HMM) AM on the WSJ classes), followed by the output posterior probabilities used as scaled
speech corpus. Using a bigram phoneme language model, phoneme likelihoods of the tied-state triphones in a generative HMM based
recognition experiments are performed on a two hour independent keyword spotting system. In particular we use a relatively sim-
test set using the Viterbi decoding which show a relative 33.3% im- pler procedure for training large four hidden layer DNN without
provement by our CD-DNN acoustic model. We then present a filler performing cumbersome pre-training through Restricted Boltzmann
based Hybrid DNN-HMM Keyword Spotting KWS system which to Machine (RBM) as done in [1]. Furthermore, training a DNN on
our knowledge is the first KWS architecture using context depen- tied triphone states (1147 classes) is a much harder learning prob-
dent DNN and HMM. In our experiments, a strong baseline of CD- lem than training it on monophone states (40 phoneme classes), since
GMM-HMM KWS provide 79.0% correct detection accuracy at a the former is a super-set of the latter. We achieve this, even without
false alarm (FA) rate of 5.0 FA/Hr. Whereas, the proposed hybrid a pretraining with RBMs, and using a random initialization of the
CD-DNN-HMM KWS results in 88.5% correct detection accuracy weights, by a careful implementation of the stochastic gradient de-
at 5.0 FA/Hr – a relative improvement of 43.3%. We provide further scent (SGD), a judicious use of batch size in SGD (50 samples per
analysis and conclude that Hybrid CD-DNN-HMM KWS provides batch), use of Rectified Linear Unit (ReLU) and a low learning rate
an attractive alternate solution for near real-time KWS applications of 0.001 which is decayed when validation error increases. Further
with high detection accuracy and low FA. discussion on the practical tricks to train a large DNN is beyond the
Index Terms: Speech Recognition, Deep Neural Networks, Key- scope of this paper and the interested reader is referred to [3].
word Spotting Our proposed architecture, offers several advantages over the previ-
ously published results where DNNs/CNNs are trained directly on
1. INTRODUCTION just 5 − 6 keywords as the output classes [4],[5]. Since the authors
in [4],[5] directly train their DNN on 5 − 6 keywords as the out-
With the widespread development and adoption of the mobile de- put classes, their DNN-KWS system is locked into those few key-
vices, speech recognition technologies have been witnessing an in- words only, incapable of detecting any other keywords, thus making
creased performance improvement and R&D focus. Deep Neu- it highly impractical. Further, to train the DNNs on these keywords,
ral Networks (DNNs)[1],[2] and their close variants have tremen- their training set should consist of a large number (tens of thou-
dously improved the Large Vocabulary Continuous Speech Recog- sands) of these specific keyword tokens – a prohibitively difficult
nition (LVCSR) performance leading to superior user experience in constraint to meet. While their approach [4] is useful for detecting
form of voice search and command and control apps such as Google an extremely small and fixed set of keyword commands on a smart-
Now, Apple Siri and Microsoft Cortana. Similarly speech recog- phone, it is rendered in-applicable, if the set of keywords is large
nition technologies are also being applied for the Speech Analytics (50 − 100 keywords), and are being dynamically changed3 . These
applications. With the vast amounts of speech data handled daily are precisely the requirements in contact center analytics, where a
and recorded at the customer care centers, Keyword Spotting (KWS) supervisor typically uses 20 − 100 keywords and has to change the
technology has emerged as a powerful tool to organize, search and set of keywords, on a daily/weekly basis, based on the domain and
analyze the vast amount of speech data that gets recorded at the en- process (insurance, heath-care, billing, reimbursement etc.) in order
terprise contact centers. to mine the various trending patterns in the contact center conversa-
In this paper, we present a detailed analysis of phoneme recogni- tions. Therefore, our proposed CD-DNN-HMM KWS system uses
tion accuracies of a context dependent tied-state triphone CD-GMM- generative HMM models of the keywords (by concatenating CD tri-
HMM (strong baseline with about 1.8M parameters) and a context phones states) and a filler single state HMM (for the non-keywords)
dependent hybrid CD-DNN-HMM acoustic models (AM), and re- 1 Defined as average correct detection accuracy at 5.0 false alarms per
port relative 33.3% improvement by the latter. We then extend our hour (FA)/Hr.
work by presenting the CD-GMM-HMM KWS and the proposed 2 takes only 0.056s of GPU compute time to run KWS on 1s of speech
Hybrid HMM CD-DNN KWS system. Our results indicate that Hy- 3 Since the DNN is trained on the fixed set of keywords directly as the
brid HMM-CD-DNN KWS can provide significant detection accu- output labels
978-1-5090-0746-2/16/$31.00 ©2016 IEEE

in a parallel network. The filler-HMM KWS approach, which is well and a single state filler HMM (to model non-keywords) in a parallel
known in the literature [6, 7], is chosen to obtain a generative model network. In the following we will describe these modules in detail.
of the keywords. However, given the superior classification perfor- As is well known, speech is produced by modulating a small
mance of a DNN[2], we use a large four layer DNN AM to estimate number of parameters of a dynamical system and this implies that
the tied-triphone state emission likelihoods instead of a generative speech features may reside in a low dimensional manifold X̂, which
GMM-HMM or a single layer neural network as in [6, 7]. This is is a non-linear subspace of original high dimensional feature space
the main contribution of this paper and we validate our approach X. DNNs with multiple hidden layers, trained with the Stochas-
through detailed experiments on a large vocabulary English dataset tic Gradient Descent (SGD) over large amount of labeled data, are
–Wall Street Journal (WSJ) corpus (66 hours of speech), which con- able to learn these manifolds much better than the purely genera-
firms significant phoneme recognition (relative 33.0%) and KWS tive models such as CD-GMM-HMM[2]. Further the speech recog-
performance improvements gains (relative 43.0%) by the proposed nition performance have improved tremendously when the DNN is
novel Hybrid CD-HMM DNN KWS system. trained on the CD triphones states as the output labels than the mono-
In [6], a detailed comparison of the LVCSR word lattices based phone labels[1]. This is due to the better co-articulation modeling by
KWS and direct phoneme recognition based KWS is presented. As the CD-triphone states. However, the CD-triphone states are much
discussed in [6], LVCSR KWS comprises of first creating word lat- larger in number than the monophone states (1147 and 40 respec-
tices and then searching for keywords in these lattices. Therefore, tively in our experiments) and training a DNN with CD-triphones as
its performance is intimately dependent on the good coverage of the labels in much more harder learning problem than training with the
keywords in the LVCSR language model. However, this severely monophones as the labels.
limits the application of LVCSR KWS to a pre-fixed domain and its We have trained a strong baseline CD tied-triphone GMM-
specific language model (LM). If new keywords are added4 (prod- HMM acoustic model with 1147 tied states on the WSJ corpus
uct names, new processes, new topics), LVCSR KWS performance which comprises of 40 phonemes/monophones (please refer to Sec-
drops drastically since they are not well covered in the LM and hence tion 3. for details ). We have developed a Context Dependent HMM-
are missed in the intermediately produced word lattices, besides be- GMM software training library in C++, which uses the standard
ing extremely computationally expensive. On the other hand, if sev- Expectation-Maximization (EM) algorithms for the parameter esti-
eral topic models are included in the LVCSR LM, it leads to higher mation. This system is also used to produce CD tied-state triphone
LM perplexity. As is well know in the LVCSR literature, higher per- labels for the entire train-set ( 62 hours which after removing excess
plexity LM leads to much poorer LVCSR recognition accuracy[8] silence frames is ≈ 46 hrs) via forced alignment in order to train
due to the exponentially increasing search space. Therefore, these our 4 hidden layer DNN. We use the traditional 24 Mel filter bands,
conflicting goals severely limit the utility of LVCSR KWS. To par- followed by 13 DCT coefficients of the log Mel filterbank energies
tially overcome this problem, phone lattices based KWS was devel- and their delta and delta-delta coefficients, resulting into 39 dimen-
oped in [9, 6]. While the phone lattice KWS does not use a word sional MFCC feature vector for each speech frame. This DNN is
LM, its accuracy is in general modest (FOM of ∼ 60.0 − 64.0%) provided a context of P adjoining MFCC feature frames, resulting
and it is still computationally very expensive (since it first creates into 39 × P dimensional input feature. The network processes the
a phone lattice and then does keyword search in lattices) requiring input through a sequence of ReLU nonlinearities [10, 11]. In partic-
upto 10x RT factor[6]. ular at the ith layer, the network computes: hi = f (Wi hi−1 + bi ),
All these shortcomings are addressed in the proposed Hybrid where W ∈ RM ×N is a matrix of trainable weights, bi ∈ RM
CD-HMM DNN KWS, which does not use a word or phone LM is vector of trainable biases, and hi ∈ RM is the ith hidden layer
(as will be explained in the later sections), and hence can perform and hi−1 ∈ RN is the (i − 1)th hidden layer (or the input x if
very well even when the conversation topics are changing rapidly (i − 1) = 0). Inspired by the results in [10, 11], we have used Recti-
and new keywords are introduced in the KWS system. Further due fied Linear Units (ReLU) f (x) = max(0, x) as non-linear functions
to an advanced CD-DNN acoustic model, it offers extremely high since they consistently provided better accuracy and faster learning
detection accuracy, while providing low false alarm rate. Finally we than logistic sigmoid function in our experiments.
present a simple Viterbi search algorithm which uses GPU computed We have used a fully connected four hidden layer feed-forward
CD-triphone DNN posteriors, resulting into sub-real time KWS op- DNN architecture, with 3000 units per hidden layer. To predict the
eration (0.056x Real-Time). CD tied triphone label for a given context of MFCC features x (P
The rest of this paper is organized as follows. In Section 2, we adjacent MFCC frames concatenated together), the last layer of the
present the main idea in our paper – a Hybrid DNN-HMM KWS sys- network uses a softmax non-linearity which outputs posterior prob-
tem and its search algorithm. In Section 3, we present our generative ability values. If the network has L layers, the prediction for the kth
CD-GMM-HMM KWS system which forms our competitive base- class probability is:
line KWS. Section 4, describes phoneme recognition experiments
based on the CD-GMM-HMM and CD-HMM DNN systems fol- exp(WL k hL−1 + bL
k)
lowed by the KWS experiments based on the two systems. PDN N (s = k|x) = PC (1)
L
j=1 exp(W j h
L−1 + bL
j )
2. HYBRID CD HMM DNN KWS where, WL k is the kth row of the last layer weight matrix, bL k is
the kth term in the last layer bias vector and C is the number of CD
Our proposed Hybrid CD-DNN-HMM KWS system is illustrated tied triphone states (C = 1147 in this paper). Theano library[12] is
in Figure 1 and consists of two major modules: (1) A four hidden used to train our DNN through Stochastic Gradient Descent (SGD)
layer DNN trained with context dependent (CD) tied-state triphones with progressively decaying learning rate which starts at 0.001 and
as the output classes (1147 in number), (2) A HMM based filler is cross-validated with a small dev-set to reduce the learning rate.
KWS that consists of K keywords HMM (with CD triphones states) All parameters in the weight and bias matrices are initialized ran-
domly. We run SGD for a fixed number of 10 epochs since the cross-
4 which is very common in contact center analytics validation gain started to converge by 9 − 10th epoch. Our test-set
Keyword HMM: 1
4 Hidden Layer DNN

Keyword HMM: K
ReLU Nonlinearity DNN Outputs:
Posterior probability of
Tied-state triphone states
Schematic of Hybrid CD HMM-DNN KWS Filler HMM (single state)
Fig. 1. Schematic diagram of Hybrid Context Dependent DNN-HMM Keyword Spotting (KWS) system.
consists of two hours of speech (WSJ corpus) and the trained DNN (0.01, 0.02). Viterbi decoding on the parallel HMM networks pro-
is used to estimate the CD-triphone posterior probabilities which we ceeds through the usual dynamic programming technique[13]. At
directly used as scaled likelihoods: p(x|s = k), assuming equal the last speech frame, the state-sequence with the maximum score
priors: P (s = k) = 1/C, k ∈ (1, C) for all the CD triphones. (log-likelihood) S is backtraced to output the keywords present with
the begin and end time-stamps. Suppose, the keyword w is present in
PDN N (s = k|x)p(x)
p(x|s = k) = the most likely backtraced path with the begin and end time frames,
P (s = k) (2) tb , te respectively. We compute its detection confidence κ(w) as
∝ PDN N (s = k|x) follows,
These scaled likelihoods p(x|s = k) (direct posteriors) are used as exp [(S(Lw , w, te ) − S(1, w, tb ))/(te − tb + 1)]
the emission likelihood values for CD triphone states in the parallel κ(w) =
exp [(S(1, F iller, te ) − S(1, F iller, tb ))/(te − tb + 1)]
HMM network of the U keywords in our experiments. The resulting (3)
Hybrid CD-DNN-HMM KWS parallel network is shown in Figure.
2. where, S(Lw , w, te ) is the log-likelihood of the most likely path
The kth keyword’s HMM model is formed by concatenating the being at the last HMM state Lw of keyword w at time te , and
tied-triphone states in its phonetic pronunciation as in the conven- S(1, w, tb ) is the log-likelihood of this path at the first state of the
tional filler GMM-HMM KWS [6]. However, we replace the log- keyword w at time tb . Similarly, S(1, F iller, te ), S(1, F iller, tb )
likelihoods of the tied triphone GMM-HMM states by their DNN are the log-likelihoods of the filler state at times te , tb respectively.
log posteriors as in (2). State transition probabilities in the CD- The quantity (S(Lw , w, te ) − S(1, w, tb )) is the log likelihood of
DNN-HMM system are same as those in CD-GMM-HMM system. the partial Viterbi path that accounts for the keyword w over the
Finally, we run token-passing Viterbi algorithm to detect the pos- time (te − tb + 1). (S(1, F iller, te ) − S(1, F iller, tb )) is the
sible keywords in the most likely Viterbi path[13] over the parallel log-likelihood of the Filler/background over the same time period.
network in Figure. 2. We represent the score (total log likelihood) Therefore the keyword confidence score in (3) is effectively the geo-
of the partial Viterbi path ending at the state i of the keyword k at metric mean of the likelihood ratio between the keyword w and filler
time t, by Si,k,t . Similarly S1,F iller,t is defined as the score (total HMM. Higher the value of κ(w), higher is the confidence of the CD-
log likelihood) of the partial Viterbi path ending at the first state of DNN-HMM KWS in detecting the keyword w. If the confidence
the filler5 model at time t. Since the posterior probabilities are nor- value exceeds a user-set threshold ∆, the corresponding keyword w
malized (sum to one), we use a constant posterior probability for the is marked as a hit/detection. As the threshold ∆ is varied over a
Filler HMM: P (s = F iller|x) ≈ 0.01. This value was empirically range, we obtain a receiver operating curve (ROC) which shows the
chosen such that P (s = F iller|x) > 1/C, where C is the num- relationship between correct detection accuracy and false alarm rate
ber of CD tied triphone states and is typically in the range of values of a KWS system A contact center analytics supervisor can easily
5 We model each triphone by the standard 3-state HMM. However filler is configure the threshold ∆ to operate at a desirable point on the ROC
modeled by a single state HMM with a self loop arc, and an incoming and an curve which will be described in Section 4
outgoing arc. We next describe our baseline CD-GMM-HMM KWS system.
SIL-P+OY PO-Y+N OY-N+T N-T+SIL
HMM of Keyword ‘POINT’
SIL-D+AA D-AA+L AA-L+ER L-ER+Z ER-Z+SIL
HMM of Keyword ‘DOLLARS’
Emission likelihoods are obtained by

scaled posterior of the corresponding
tied-states at the DNN’s output.
‘Filler‘ HMM State
Parallel network of Hybrid CD HMM-DNN KWS
Fig. 2. Schematic diagram of Parallel Network of Keyword HMM’s and Filler as used in the CD DNN-HMM Keyword Spotting (KWS) system.
3. BASELINE CD-GMM-HMM KWS CD GMM-HMM system with upto 20 diagonal covariance Gaussian
mixtures/components per HMM state. Since the input feature x is 39
Our baseline CD-HMM KWS system also comprises of the key- dimensional, we have 39 × 2 = 78, mean and variance parameters
word HMMs created by concatenating their corresponding triphone per Gaussian, leading upto (78 × 20 × 1147 ∼ 1.8M parameters in
states, and placed in a parallel network along with a single state Filler our CD-HMM GMM acoustic model.
HMM. Suppose, ptri j (t) be the log likelihood of tied-triphone state The confidence score of a detected word w is also
j ∈ {1, C} emitting observation (MFCC) vector x(t). computed through Eq.(3) where the log-likelihood values
Assume we have U keywords in our KWS detection task. Authors S(Lw , w, te ), S(1, w, tb − 1) are obtained from the CD-GMM-
in [6] used all the tied-state triphone states in a loop to represent the HMM KWS Viterbi decoding system.
filler/garbage model. However, in this paper, we use a novel tech-
niques to estimate the filler likelihood. A good filler HMM state
4. EXPERIMENTS
should have an emission distribution that assigns high likelihood to
all the non-keyword speech frames. In order to fulfill this require- Datasets: We have used the speaker independent si tr s dataset of
ment, we have used a novel parameterized percentile likelihood over the WSJ corpus which consists of 66 hours of speech and 30278 ut-
all the tied-state triphone states to represent the filler state instead terances. For the meaningful KWS experiments, it is preferable to
of a loop over all the tied-triphone states as in [6]. Let ptrp max (t), have a high number of occurrences (tokens) of the keywords in test-
ptrp
min (t) be the maximum and minimum likelihood respectively, over set to reliably estimate the detection accuracy of those keywords.
all the tied triphone states (C states) at time t. To keep the filler However the official WSJ test-set H1 P 0 or H1 C1 consist of only
model likelihood competitive as compared to the keywords HMM 200 utterances and has very few re-occurrences of the words. Thus,
network likelihoods, we use a percentile value over the minimum they are not suitable for our KWS experiments. Therefore, we parti-
and maximum likelihoods at a given time t. Further, assuming a tioned the official 30278 utterance si tr s set into three disjoint sets
uniform distribution over the likelihood values, we obtain an α per- as follows. The first 1000 utterances in the directory 13-11.1 were
centile value of the filler likelihood by setting retained as the independent test-set. From the remaining 29278 ut-
terances, the last 1000 utterances (directory 13-6.1) were retained as
pf iller (t) = αpmax (t) + (1 − α)pmin (t), where α ∈ (0, 1) (4) the independent development set, leading to an independent train set
with 28278 utterances. The 1000 utterances dev-set is used for cross
We note that this formula leads to a dynamic estimate of the filler validation in the DNN training as well as in tuning the LM factor,
likelihood which changes for each time frame, while also follow- insertion penalty and factor α in our phoneme recognition and KWS
ing a user-set percentile value over all the triphone likelihoods6 . In experiments.
our experiments, we have set α = 0.92 and found it to perform Phoneme Recognition: To ascertain the relative discrimination
well in the range (0.90, 0.95) on our dev-set, without much change strength of the CD-GMM-HMM and Hybrid CD-DNN-HMM
in the KWS accuracy. This approach leads to high detection ac- acoustic model we have first performed the phoneme recognition
curacy and low false alarms and forms a very competitive baseline experiments on a 2hr independent test-set drawn from the si tr s
KWS system (please refer to Section 4 for details ). We have trained set of the WSJ corpus. This test-set was chosen to ensure good
frequency of the 45 − 50 keywords7 . We have used the optimal
6 assuming uniform distribution over the likelihood values, which in prac-
tice turns out to be a good assumption 7 which finally came out to be in the range 10 − 40 occurrences per key-
Viterbi decoding with both acoustic models, where in case of the Acoustic Model Phoneme
CD-DNN acoustic model, scaled likelihoods in (2) are used for the Accuracy %
triphone state likelihoods. A bigram phoneme language model is CD-GMM-HMM 71.1
learned on the train-set phoneme transcripts. The LM factor and GMM-14 comp.
phone insertion penalty are empirically tuned on a small dev-set. CD-GMM-HMM 71.8
The phone recognition accuracies are reported in Table.1. We pro- GMM 17 comp.
gressively increased the number of Gaussian mixture components CD-GMM-HMM 72.0
in the CD-GMM-HMM acoustic model, whose accuracy started to GMM 20 comp.
saturate around 20 Gaussian components CD-GMM-HMM, which CD-DNN-HMM 79.0 (+25.0%)
we have chosen as our baseline system. It has a phoneme recog- MFCC 17 frame context
nition accuracy of 72.0%. We have also trained three DNNs with CD-DNN-HMM 80.5 (+30.4%)
adjoining 17, 25 and 35 MFCC frames as the input. These three CD- MFCC 25 frame context
DNN-HMM systems have phoneme recognition accuracy of 79.0% CD-DNN-HMM 81.3 (+33.2%)
, 80.5% and 81.3% leading to relative 25.0%, 30.4% and 33.3% im- MFCC 35 frame context
provement compared to the CD-GMM-HMM system with 20 Gaus-
sian components. We also note that while CD-GMM-HMM sys-
tem has 1.8M parameters, the CD-DNN-HMM systems have about Table 1. Phoneme recognition accuracies with CD-GMM-HMM
39 × 35 × 3000 + 3000 × 3000 + 3000 × 3000 + 3000 × 3000 + acoustic model (AM) with increasing Gaussian mixture components
3000 × 1147 ≈ 33.4M parameters. As can be seen in Table.1, and CD-DNN-HMM AM with increasing MFCC frames context as
while the CD-GMM-HMM system saturated at 20 components, CD- input.
DNN-HMM system improved as the adjoining MFCC frames in-
put is increased from 17 to 25 to 35 frames. Further increasing POINT, HUNDRED, COMPANY, DOLLARS, NINETEEN,
the GMM components did not improve the results. Furthermore, STOCK, ABOUT, MILLION, PERCENT, MARKET, SHARES,
as is well known most of the weights and activation values (about TWENTY, AFTER, PRICES, THOUSAND, BANK, HIGHER,
80.0%) in a DNN are zero leading to a sparse high-dimensional non- PEOPLE, BILLION, DOLLAR, SPOKESMAN, TRADING,
linear representation which leads to better discrimination between GOVERNMENT, PRESIDENT, CLOSED, MONEY, YESTERDAY
the classes[14, 10]. We also conjecture that due to the co-articulation AMERICAN, CAPITAL, SEVERAL, ANALYSTS, INVESTORS,
effect, CD triphones span over longer speech windows. Therefore, LITTLE, ANOTHER, BETWEEN, CHAIRMAN,FEDERAL,
when we used CD-tied triphone states as the DNN output classes, its AGAINST, ASSETS, BECAUSE, CORPORATION, MONTH,
performance improves as the input MFCC context is increased from EXCHANGE, EXECUTIVES, FINANCIAL
17 frames to 35 frames (effectively covering a speech spectrogram
over a 170ms to 350ms duration since each frame corresponds to
10ms of speech signal’s spectrum.) Table 2. Keywords (45)
100 Receiver Operating Curves (ROCs)

We have evaluated the proposed Hybrid CD-DNN-HMM KWS
system for detecting these 45 keywords in the 2Hr test-set. For a
90 given value of confidence threshold ∆, if the keyword’s detection
confidence κ exceeds ∆, it is marked as a detection. Further if the
Correct Detection Accuracy (%)
80
mid-point of the detected keyword (w) i.e. (tb +te )/2 lies in between
the time stamps of that keyword w in the ground truth, it is counted
70
as a correct detection. Otherwise, it is counted as a False alarm.
60 By varying the user-set threshold ∆, we trace out the KWS receiver
operating curve (ROC).
50
CD-DNN-HMM 35Frames Figure.3, shows the ROCs corresponding to two CD-DNN-
CD-HMM-DNN 25Frames
CD-GMM-HMM 20-Mix HMM KWS (blue curves with 35 and 25 frames MFCC input respec-
40 CD-GMM-HMM 17-Mix tively), and two CD-GMM-HMM KWS (red curves with 20 and 17
component/mixture GMM per CD-triphone state respectively). For
30 the sake of clarity, we have not plotted the ROCs of CD-DNN-HMM
0 2 4 6 8 10
False Alarm/Hr (FA/Hr) KWS with 17 MFCC frames and 14Mix CD-GMM-HMM KWS
which were similar to their corresponding systems. As can be no-
Fig. 3. Receiver Operating Curve (ROC) of proposed CD DNN- ticed from these plots, DNN KWS system significantly outperforms
HMM Keyword Spotting (KWS) system and baseline CD GMM- a very competitive baseline CD-GMM-HMM KWS. In fact the Fig-
HMM KWS system. ure of Merit (FOMs)8 of the CD-DNN-HMM KWS are 88.5 and
87.3 (for 35 and 25 frames MFCC input respectively), whereas the
FOMs of the CD-GMM-HMM KWS are 79.0 and 78.5 (for 20 and
KWS Results: As earlier discussed we have chosen 45 keywords 17 GMM components respectively), leading to 43.0% relative im-
which have relatively higher frequency of occurrence (in the range provement by the proposed Hybrid CD-DNN-HMM KWS system.
10 − 40 for each keyword) in our 2Hr test-set. These keywords are In [6] authors have reported FOMs in the range 58.0 − 64.0% for
shown in Table. 2
8 Defined as average correct detection accuracy at 5.0 false alarms per
word over the 2hr test-set. hour (FA)/Hr.
the various LVCSR-KWS, Phoentic-KWS and single hidden layer et al., “Deep neural networks for acoustic modeling in speech
neural networks on 17 keywords. While the datasets are different, recognition: The shared views of four research groups,” Signal
both our baseline and the proposed CD-DNN-HMM KWS compare Processing Magazine, IEEE, vol. 29, no. 6, pp. 82–97, 2012.
favorably to these results. [3] Y. Bengio, “Practical recommendations for gradient-based
We have performed the DNN training as well the DNN poste- training of deep architectures,” in Neural Networks: Tricks of
rior computation of the 2Hr test-set using Nvidia Tesla K40c GPU, the Trade. Springer, 2012, pp. 437–478.
which tremendously accelerated the DNN feed-forward and SGD
computations. For the 2Hr test-set’s posterior probability compu- [4] G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword
tation, Nvidia Tesla K40c GPU took 168s of GPU compute time. spotting using deep neural networks,” in Proc. of Acoustics,
Once the posteriors (scaled likelihoods) in (2) are computed, we run Speech and Signal Processing (ICASSP), 2014 IEEE Interna-
the Viterbi search as described in Section 2 along with confidence tional Conference on. IEEE, 2014, pp. 4087–4091.
κ(w) computation in (3) which took 238s of compute time on a sin- [5] G. Chen, C. Parada, and T. N. Sainath, “Query-by-example
gle core of CPU. Therefore, our proposed Hybrid CD-DNN-KWS keyword spotting using long short-term memory networks,” in
system takes 168 + 238 = 406s to perform KWS on 2Hr = 7200s Proc. of Acoustics, Speech and Signal Processing (ICASSP),
of speech, leading to sub-real-time computation factor of 0.056. It is 2015 IEEE International Conference on. IEEE, 2015, pp.
worth noting that GPUs generally are highly efficient at performing 5236–5240.
dense matrix-matrix and matrix-vector multiplications since they can
[6] I. Szöke, P. Schwarz, P. Matejka, L. Burget, M. Karafiát,
run on 1000s of GPU cores in parallel. In training a DNN through
M. Fapso, and J. Cernockỳ, “Comparison of keyword spotting
Stochastic Gradient Descent (SGD), the feed-forward operations and
approaches for informal continuous speech.” in Proc. of Inter-
the back-prop benefit immensely from this parallelization. Whereas,
speech, 2005, pp. 633–636.
the traditional tied-state triphone HMM-GMM LVCSR Lattice based
KWS system, which incur their biggest computational cost on log [7] K. Knill and S. J. Young, “Speaker dependent keyword spotting
and exp operations in computing the likelihoods of tens thousands of for accessing stored speech,” 1994.
Gaussian densities, and the large vocabulary graph search, do not en- [8] S. F. Chen, D. Beeferman, and R. Rosenfeld, “Evaluation met-
joy the similar benefit by offloading these computations to the GPU rics for language models,” 1998.
since they do not constitute matrix-matrix or matrix-vector multipli-
cations. [9] M. Clements, P. Cardillo, and M. Miller, “Phonetic searching
of digital audio,” in Proceedings, 2001 Conference of the Na-
tional Association of Broadcasters, 2001.
5. CONCLUSION
[10] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier
We have presented a novel hybrid CD-DNN-KWS system which neural networks,” in Proc. of International Conference on Ar-
provides high detection accuracy and low false alarm rate (FOM of tificial Intelligence and Statistics, 2011, pp. 315–323.
88.0%) as compared to a very competitive CD-GMM-HMM KWS [11] V. Nair and G. E. Hinton, “Rectified linear units improve re-
system (FOM of 79.0%) providing about 43.0% relative FOM im- stricted boltzmann machines,” in Proc. of the 27th Interna-
provement. Our results compare favorably to the LVCSR-KWS and tional Conference on Machine Learning (ICML-10), 2010, pp.
Phonetic-KWS FOM (results in the range of 58.0 − 64.0% FOM) 807–814.
as reported in [6]. This improvement is attributed to the better dis-
[12] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfel-
crimination ability of a 4 layer DNN which is able to discriminate
low, A. Bergeron, N. Bouchard, D. Warde-Farley, and Y. Ben-
much better between the CD tied triphone states than a state-of-the-
gio, “Theano: new features and speed improvements,” arXiv
art generative CD-GMM-HMM acoustic model, leading to a 33.0%
preprint arXiv:1211.5590, 2012.
relative improvement in phoneme recognition accuracy. Further the
proposed hybrid CD-DNN-KWS system can run in about 0.056 real- [13] H. J. Ney and S. Ortmanns, “Dynamic programming search for
time factor (as measured with a Nvidia Tesla K40c GPU augmented continuous speech recognition,” Signal Processing Magazine,
computer) with a work-load of about 45 keywords, offering great IEEE, vol. 16, no. 5, pp. 64–83, 1999.
performance potential in contact center speech analytics. [14] M. D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q. V.
Le, P. Nguyen, A. Senior, V. Vanhoucke, J. Dean et al., “On
6. ACKNOWLEDGMENTS rectified linear units for speech processing,” in Proc. of Acous-
tics, Speech and Signal Processing (ICASSP), 2013 IEEE In-
The author would like to thank Florent Perronnin of Facebook AI ternational Conference on. IEEE, 2013, pp. 3517–3521.
Research, France and Adrien Gaidon of Xerox Research Center Eu-
rope for several insightful discussions. This research was primarily
done while the author was at Xerox Research Center India.
7. REFERENCES
[1] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent

pre-trained deep neural networks for large-vocabulary speech
recognition,” Audio, Speech, and Language Processing, IEEE
Transactions on, vol. 20, no. 1, pp. 30–42, 2012.
[2] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed,
N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath

Hybrid Context Dependent CD-DNN-HMM Keyword Spotting (KWS) in Speech Conversations

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hybrid Context Dependent CD-DNN-HMM Keyword Spotting (KWS) in Speech Conversations

Hochgeladen von

Copyright:

Verfügbare Formate

2016 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT.

13–16, 2016, SALERNO, ITALY

HYBRID CONTEXT DEPENDENT CD-DNN-HMM KEYWORD SPOTTING (KWS) IN

Xerox Research Center India

ABSTRACT racy gains (Figure of Merit (FOM)1 relative improvement of 43.0%)

978-1-5090-0746-2/16/$31.00 ©2016 IEEE

4 Hidden Layer DNN

Schematic of Hybrid CD HMM-DNN KWS Filler HMM (single state)

HMM of Keyword ‘POINT’

SIL-D+AA D-AA+L AA-L+ER L-ER+Z ER-Z+SIL

HMM of Keyword ‘DOLLARS’

Emission likelihoods are obtained by

‘Filler‘ HMM State

Parallel network of Hybrid CD HMM-DNN KWS

100 Receiver Operating Curves (ROCs)

[1] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent

Das könnte Ihnen auch gefallen