Beruflich Dokumente
Kultur Dokumente
Vivek Tyagi
2. HYBRID CD HMM DNN KWS where, WL k is the kth row of the last layer weight matrix, bL k is
the kth term in the last layer bias vector and C is the number of CD
Our proposed Hybrid CD-DNN-HMM KWS system is illustrated tied triphone states (C = 1147 in this paper). Theano library[12] is
in Figure 1 and consists of two major modules: (1) A four hidden used to train our DNN through Stochastic Gradient Descent (SGD)
layer DNN trained with context dependent (CD) tied-state triphones with progressively decaying learning rate which starts at 0.001 and
as the output classes (1147 in number), (2) A HMM based filler is cross-validated with a small dev-set to reduce the learning rate.
KWS that consists of K keywords HMM (with CD triphones states) All parameters in the weight and bias matrices are initialized ran-
domly. We run SGD for a fixed number of 10 epochs since the cross-
4 which is very common in contact center analytics validation gain started to converge by 9 − 10th epoch. Our test-set
Keyword HMM: 1
Fig. 1. Schematic diagram of Hybrid Context Dependent DNN-HMM Keyword Spotting (KWS) system.
consists of two hours of speech (WSJ corpus) and the trained DNN (0.01, 0.02). Viterbi decoding on the parallel HMM networks pro-
is used to estimate the CD-triphone posterior probabilities which we ceeds through the usual dynamic programming technique[13]. At
directly used as scaled likelihoods: p(x|s = k), assuming equal the last speech frame, the state-sequence with the maximum score
priors: P (s = k) = 1/C, k ∈ (1, C) for all the CD triphones. (log-likelihood) S is backtraced to output the keywords present with
the begin and end time-stamps. Suppose, the keyword w is present in
PDN N (s = k|x)p(x)
p(x|s = k) = the most likely backtraced path with the begin and end time frames,
P (s = k) (2) tb , te respectively. We compute its detection confidence κ(w) as
∝ PDN N (s = k|x) follows,
These scaled likelihoods p(x|s = k) (direct posteriors) are used as exp [(S(Lw , w, te ) − S(1, w, tb ))/(te − tb + 1)]
the emission likelihood values for CD triphone states in the parallel κ(w) =
exp [(S(1, F iller, te ) − S(1, F iller, tb ))/(te − tb + 1)]
HMM network of the U keywords in our experiments. The resulting (3)
Hybrid CD-DNN-HMM KWS parallel network is shown in Figure.
2. where, S(Lw , w, te ) is the log-likelihood of the most likely path
The kth keyword’s HMM model is formed by concatenating the being at the last HMM state Lw of keyword w at time te , and
tied-triphone states in its phonetic pronunciation as in the conven- S(1, w, tb ) is the log-likelihood of this path at the first state of the
tional filler GMM-HMM KWS [6]. However, we replace the log- keyword w at time tb . Similarly, S(1, F iller, te ), S(1, F iller, tb )
likelihoods of the tied triphone GMM-HMM states by their DNN are the log-likelihoods of the filler state at times te , tb respectively.
log posteriors as in (2). State transition probabilities in the CD- The quantity (S(Lw , w, te ) − S(1, w, tb )) is the log likelihood of
DNN-HMM system are same as those in CD-GMM-HMM system. the partial Viterbi path that accounts for the keyword w over the
Finally, we run token-passing Viterbi algorithm to detect the pos- time (te − tb + 1). (S(1, F iller, te ) − S(1, F iller, tb )) is the
sible keywords in the most likely Viterbi path[13] over the parallel log-likelihood of the Filler/background over the same time period.
network in Figure. 2. We represent the score (total log likelihood) Therefore the keyword confidence score in (3) is effectively the geo-
of the partial Viterbi path ending at the state i of the keyword k at metric mean of the likelihood ratio between the keyword w and filler
time t, by Si,k,t . Similarly S1,F iller,t is defined as the score (total HMM. Higher the value of κ(w), higher is the confidence of the CD-
log likelihood) of the partial Viterbi path ending at the first state of DNN-HMM KWS in detecting the keyword w. If the confidence
the filler5 model at time t. Since the posterior probabilities are nor- value exceeds a user-set threshold ∆, the corresponding keyword w
malized (sum to one), we use a constant posterior probability for the is marked as a hit/detection. As the threshold ∆ is varied over a
Filler HMM: P (s = F iller|x) ≈ 0.01. This value was empirically range, we obtain a receiver operating curve (ROC) which shows the
chosen such that P (s = F iller|x) > 1/C, where C is the num- relationship between correct detection accuracy and false alarm rate
ber of CD tied triphone states and is typically in the range of values of a KWS system A contact center analytics supervisor can easily
5 We model each triphone by the standard 3-state HMM. However filler is configure the threshold ∆ to operate at a desirable point on the ROC
modeled by a single state HMM with a self loop arc, and an incoming and an curve which will be described in Section 4
outgoing arc. We next describe our baseline CD-GMM-HMM KWS system.
SIL-P+OY PO-Y+N OY-N+T N-T+SIL
Fig. 2. Schematic diagram of Parallel Network of Keyword HMM’s and Filler as used in the CD DNN-HMM Keyword Spotting (KWS) system.
3. BASELINE CD-GMM-HMM KWS CD GMM-HMM system with upto 20 diagonal covariance Gaussian
mixtures/components per HMM state. Since the input feature x is 39
Our baseline CD-HMM KWS system also comprises of the key- dimensional, we have 39 × 2 = 78, mean and variance parameters
word HMMs created by concatenating their corresponding triphone per Gaussian, leading upto (78 × 20 × 1147 ∼ 1.8M parameters in
states, and placed in a parallel network along with a single state Filler our CD-HMM GMM acoustic model.
HMM. Suppose, ptri j (t) be the log likelihood of tied-triphone state The confidence score of a detected word w is also
j ∈ {1, C} emitting observation (MFCC) vector x(t). computed through Eq.(3) where the log-likelihood values
Assume we have U keywords in our KWS detection task. Authors S(Lw , w, te ), S(1, w, tb − 1) are obtained from the CD-GMM-
in [6] used all the tied-state triphone states in a loop to represent the HMM KWS Viterbi decoding system.
filler/garbage model. However, in this paper, we use a novel tech-
niques to estimate the filler likelihood. A good filler HMM state
4. EXPERIMENTS
should have an emission distribution that assigns high likelihood to
all the non-keyword speech frames. In order to fulfill this require- Datasets: We have used the speaker independent si tr s dataset of
ment, we have used a novel parameterized percentile likelihood over the WSJ corpus which consists of 66 hours of speech and 30278 ut-
all the tied-state triphone states to represent the filler state instead terances. For the meaningful KWS experiments, it is preferable to
of a loop over all the tied-triphone states as in [6]. Let ptrp max (t), have a high number of occurrences (tokens) of the keywords in test-
ptrp
min (t) be the maximum and minimum likelihood respectively, over set to reliably estimate the detection accuracy of those keywords.
all the tied triphone states (C states) at time t. To keep the filler However the official WSJ test-set H1 P 0 or H1 C1 consist of only
model likelihood competitive as compared to the keywords HMM 200 utterances and has very few re-occurrences of the words. Thus,
network likelihoods, we use a percentile value over the minimum they are not suitable for our KWS experiments. Therefore, we parti-
and maximum likelihoods at a given time t. Further, assuming a tioned the official 30278 utterance si tr s set into three disjoint sets
uniform distribution over the likelihood values, we obtain an α per- as follows. The first 1000 utterances in the directory 13-11.1 were
centile value of the filler likelihood by setting retained as the independent test-set. From the remaining 29278 ut-
terances, the last 1000 utterances (directory 13-6.1) were retained as
pf iller (t) = αpmax (t) + (1 − α)pmin (t), where α ∈ (0, 1) (4) the independent development set, leading to an independent train set
with 28278 utterances. The 1000 utterances dev-set is used for cross
We note that this formula leads to a dynamic estimate of the filler validation in the DNN training as well as in tuning the LM factor,
likelihood which changes for each time frame, while also follow- insertion penalty and factor α in our phoneme recognition and KWS
ing a user-set percentile value over all the triphone likelihoods6 . In experiments.
our experiments, we have set α = 0.92 and found it to perform Phoneme Recognition: To ascertain the relative discrimination
well in the range (0.90, 0.95) on our dev-set, without much change strength of the CD-GMM-HMM and Hybrid CD-DNN-HMM
in the KWS accuracy. This approach leads to high detection ac- acoustic model we have first performed the phoneme recognition
curacy and low false alarms and forms a very competitive baseline experiments on a 2hr independent test-set drawn from the si tr s
KWS system (please refer to Section 4 for details ). We have trained set of the WSJ corpus. This test-set was chosen to ensure good
frequency of the 45 − 50 keywords7 . We have used the optimal
6 assuming uniform distribution over the likelihood values, which in prac-
tice turns out to be a good assumption 7 which finally came out to be in the range 10 − 40 occurrences per key-
Viterbi decoding with both acoustic models, where in case of the Acoustic Model Phoneme
CD-DNN acoustic model, scaled likelihoods in (2) are used for the Accuracy %
triphone state likelihoods. A bigram phoneme language model is CD-GMM-HMM 71.1
learned on the train-set phoneme transcripts. The LM factor and GMM-14 comp.
phone insertion penalty are empirically tuned on a small dev-set. CD-GMM-HMM 71.8
The phone recognition accuracies are reported in Table.1. We pro- GMM 17 comp.
gressively increased the number of Gaussian mixture components CD-GMM-HMM 72.0
in the CD-GMM-HMM acoustic model, whose accuracy started to GMM 20 comp.
saturate around 20 Gaussian components CD-GMM-HMM, which CD-DNN-HMM 79.0 (+25.0%)
we have chosen as our baseline system. It has a phoneme recog- MFCC 17 frame context
nition accuracy of 72.0%. We have also trained three DNNs with CD-DNN-HMM 80.5 (+30.4%)
adjoining 17, 25 and 35 MFCC frames as the input. These three CD- MFCC 25 frame context
DNN-HMM systems have phoneme recognition accuracy of 79.0% CD-DNN-HMM 81.3 (+33.2%)
, 80.5% and 81.3% leading to relative 25.0%, 30.4% and 33.3% im- MFCC 35 frame context
provement compared to the CD-GMM-HMM system with 20 Gaus-
sian components. We also note that while CD-GMM-HMM sys-
tem has 1.8M parameters, the CD-DNN-HMM systems have about Table 1. Phoneme recognition accuracies with CD-GMM-HMM
39 × 35 × 3000 + 3000 × 3000 + 3000 × 3000 + 3000 × 3000 + acoustic model (AM) with increasing Gaussian mixture components
3000 × 1147 ≈ 33.4M parameters. As can be seen in Table.1, and CD-DNN-HMM AM with increasing MFCC frames context as
while the CD-GMM-HMM system saturated at 20 components, CD- input.
DNN-HMM system improved as the adjoining MFCC frames in-
put is increased from 17 to 25 to 35 frames. Further increasing POINT, HUNDRED, COMPANY, DOLLARS, NINETEEN,
the GMM components did not improve the results. Furthermore, STOCK, ABOUT, MILLION, PERCENT, MARKET, SHARES,
as is well known most of the weights and activation values (about TWENTY, AFTER, PRICES, THOUSAND, BANK, HIGHER,
80.0%) in a DNN are zero leading to a sparse high-dimensional non- PEOPLE, BILLION, DOLLAR, SPOKESMAN, TRADING,
linear representation which leads to better discrimination between GOVERNMENT, PRESIDENT, CLOSED, MONEY, YESTERDAY
the classes[14, 10]. We also conjecture that due to the co-articulation AMERICAN, CAPITAL, SEVERAL, ANALYSTS, INVESTORS,
effect, CD triphones span over longer speech windows. Therefore, LITTLE, ANOTHER, BETWEEN, CHAIRMAN,FEDERAL,
when we used CD-tied triphone states as the DNN output classes, its AGAINST, ASSETS, BECAUSE, CORPORATION, MONTH,
performance improves as the input MFCC context is increased from EXCHANGE, EXECUTIVES, FINANCIAL
17 frames to 35 frames (effectively covering a speech spectrogram
over a 170ms to 350ms duration since each frame corresponds to
10ms of speech signal’s spectrum.) Table 2. Keywords (45)
80
mid-point of the detected keyword (w) i.e. (tb +te )/2 lies in between
the time stamps of that keyword w in the ground truth, it is counted
70
as a correct detection. Otherwise, it is counted as a False alarm.
60 By varying the user-set threshold ∆, we trace out the KWS receiver
operating curve (ROC).
50
CD-DNN-HMM 35Frames Figure.3, shows the ROCs corresponding to two CD-DNN-
CD-HMM-DNN 25Frames
CD-GMM-HMM 20-Mix HMM KWS (blue curves with 35 and 25 frames MFCC input respec-
40 CD-GMM-HMM 17-Mix tively), and two CD-GMM-HMM KWS (red curves with 20 and 17
component/mixture GMM per CD-triphone state respectively). For
30 the sake of clarity, we have not plotted the ROCs of CD-DNN-HMM
0 2 4 6 8 10
False Alarm/Hr (FA/Hr) KWS with 17 MFCC frames and 14Mix CD-GMM-HMM KWS
which were similar to their corresponding systems. As can be no-
Fig. 3. Receiver Operating Curve (ROC) of proposed CD DNN- ticed from these plots, DNN KWS system significantly outperforms
HMM Keyword Spotting (KWS) system and baseline CD GMM- a very competitive baseline CD-GMM-HMM KWS. In fact the Fig-
HMM KWS system. ure of Merit (FOMs)8 of the CD-DNN-HMM KWS are 88.5 and
87.3 (for 35 and 25 frames MFCC input respectively), whereas the
FOMs of the CD-GMM-HMM KWS are 79.0 and 78.5 (for 20 and
KWS Results: As earlier discussed we have chosen 45 keywords 17 GMM components respectively), leading to 43.0% relative im-
which have relatively higher frequency of occurrence (in the range provement by the proposed Hybrid CD-DNN-HMM KWS system.
10 − 40 for each keyword) in our 2Hr test-set. These keywords are In [6] authors have reported FOMs in the range 58.0 − 64.0% for
shown in Table. 2
8 Defined as average correct detection accuracy at 5.0 false alarms per
word over the 2hr test-set. hour (FA)/Hr.
the various LVCSR-KWS, Phoentic-KWS and single hidden layer et al., “Deep neural networks for acoustic modeling in speech
neural networks on 17 keywords. While the datasets are different, recognition: The shared views of four research groups,” Signal
both our baseline and the proposed CD-DNN-HMM KWS compare Processing Magazine, IEEE, vol. 29, no. 6, pp. 82–97, 2012.
favorably to these results. [3] Y. Bengio, “Practical recommendations for gradient-based
We have performed the DNN training as well the DNN poste- training of deep architectures,” in Neural Networks: Tricks of
rior computation of the 2Hr test-set using Nvidia Tesla K40c GPU, the Trade. Springer, 2012, pp. 437–478.
which tremendously accelerated the DNN feed-forward and SGD
computations. For the 2Hr test-set’s posterior probability compu- [4] G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword
tation, Nvidia Tesla K40c GPU took 168s of GPU compute time. spotting using deep neural networks,” in Proc. of Acoustics,
Once the posteriors (scaled likelihoods) in (2) are computed, we run Speech and Signal Processing (ICASSP), 2014 IEEE Interna-
the Viterbi search as described in Section 2 along with confidence tional Conference on. IEEE, 2014, pp. 4087–4091.
κ(w) computation in (3) which took 238s of compute time on a sin- [5] G. Chen, C. Parada, and T. N. Sainath, “Query-by-example
gle core of CPU. Therefore, our proposed Hybrid CD-DNN-KWS keyword spotting using long short-term memory networks,” in
system takes 168 + 238 = 406s to perform KWS on 2Hr = 7200s Proc. of Acoustics, Speech and Signal Processing (ICASSP),
of speech, leading to sub-real-time computation factor of 0.056. It is 2015 IEEE International Conference on. IEEE, 2015, pp.
worth noting that GPUs generally are highly efficient at performing 5236–5240.
dense matrix-matrix and matrix-vector multiplications since they can
[6] I. Szöke, P. Schwarz, P. Matejka, L. Burget, M. Karafiát,
run on 1000s of GPU cores in parallel. In training a DNN through
M. Fapso, and J. Cernockỳ, “Comparison of keyword spotting
Stochastic Gradient Descent (SGD), the feed-forward operations and
approaches for informal continuous speech.” in Proc. of Inter-
the back-prop benefit immensely from this parallelization. Whereas,
speech, 2005, pp. 633–636.
the traditional tied-state triphone HMM-GMM LVCSR Lattice based
KWS system, which incur their biggest computational cost on log [7] K. Knill and S. J. Young, “Speaker dependent keyword spotting
and exp operations in computing the likelihoods of tens thousands of for accessing stored speech,” 1994.
Gaussian densities, and the large vocabulary graph search, do not en- [8] S. F. Chen, D. Beeferman, and R. Rosenfeld, “Evaluation met-
joy the similar benefit by offloading these computations to the GPU rics for language models,” 1998.
since they do not constitute matrix-matrix or matrix-vector multipli-
cations. [9] M. Clements, P. Cardillo, and M. Miller, “Phonetic searching
of digital audio,” in Proceedings, 2001 Conference of the Na-
tional Association of Broadcasters, 2001.
5. CONCLUSION
[10] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier
We have presented a novel hybrid CD-DNN-KWS system which neural networks,” in Proc. of International Conference on Ar-
provides high detection accuracy and low false alarm rate (FOM of tificial Intelligence and Statistics, 2011, pp. 315–323.
88.0%) as compared to a very competitive CD-GMM-HMM KWS [11] V. Nair and G. E. Hinton, “Rectified linear units improve re-
system (FOM of 79.0%) providing about 43.0% relative FOM im- stricted boltzmann machines,” in Proc. of the 27th Interna-
provement. Our results compare favorably to the LVCSR-KWS and tional Conference on Machine Learning (ICML-10), 2010, pp.
Phonetic-KWS FOM (results in the range of 58.0 − 64.0% FOM) 807–814.
as reported in [6]. This improvement is attributed to the better dis-
[12] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfel-
crimination ability of a 4 layer DNN which is able to discriminate
low, A. Bergeron, N. Bouchard, D. Warde-Farley, and Y. Ben-
much better between the CD tied triphone states than a state-of-the-
gio, “Theano: new features and speed improvements,” arXiv
art generative CD-GMM-HMM acoustic model, leading to a 33.0%
preprint arXiv:1211.5590, 2012.
relative improvement in phoneme recognition accuracy. Further the
proposed hybrid CD-DNN-KWS system can run in about 0.056 real- [13] H. J. Ney and S. Ortmanns, “Dynamic programming search for
time factor (as measured with a Nvidia Tesla K40c GPU augmented continuous speech recognition,” Signal Processing Magazine,
computer) with a work-load of about 45 keywords, offering great IEEE, vol. 16, no. 5, pp. 64–83, 1999.
performance potential in contact center speech analytics. [14] M. D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q. V.
Le, P. Nguyen, A. Senior, V. Vanhoucke, J. Dean et al., “On
6. ACKNOWLEDGMENTS rectified linear units for speech processing,” in Proc. of Acous-
tics, Speech and Signal Processing (ICASSP), 2013 IEEE In-
The author would like to thank Florent Perronnin of Facebook AI ternational Conference on. IEEE, 2013, pp. 3517–3521.
Research, France and Adrien Gaidon of Xerox Research Center Eu-
rope for several insightful discussions. This research was primarily
done while the author was at Xerox Research Center India.
7. REFERENCES