Icassp 2016

A DEEP NEURAL NETWORK WITH TOP-DOWN FEEDBACK
Winston Mann1 , Tasha Nagamine1 , Michael L. Seltzer2 , Nima Mesgarani1

1
Department of Electrical Engeering, Columbia University, New York, USA
2
Microsoft Research, Redmond, USA
wem2115@columbia.edu, tasha.nagamine@columbia.edu, mseltzer@microsoft.com, nima@ee.columbia.edu
ABSTRACT predictable inputs, the expected output of the model is always

known. Consequently, the network can use prior knowledge
Biological neural networks possess a unique ability to dy-
of its expected output to update the transformation of an in-
namically reorganize themselves in order to perform new
put signal by modifying the network weights. A successful
computations, such as attending to one speaker in a crowd.
implementation of this idea requires solving three key prob-
Artificial neural network models, however, are fixed after
lems. First, we must determine an unsupervised method of
training; generalization to the changing demands and condi-
modeling the performance of the neural network at each point
tions of real-world speech recognition tasks is only achiev-
in time without the having the target outputs. Second, it is
able by training on all likely conditions. Here we propose
necessary to generate an error signal that accurately measures
an alternative framework in which the network uses prior
the deviation of the network from its desired behavior. Lastly,
knowledge of the expected output to change its synaptic
we need a method that updates the network weights in order
weights online in order to deal with unexpected inputs. The
to reduce the overall error. In this study we propose one possi-
proposed prior model here is the expected pattern of node co-
ble solution for each of these three problems and demonstrate
activations, which can be computed for any hidden or output
the feasibility of this technique by incorporating it into a deep
layer. The network can then produce an unsupervised error
neural network model trained for phoneme recognition and
signal during the test phase based on this model which can
demonstrate its superior performance over the same network
be back-propagated readily to update the network parameters
without feedback.
online, after the supervised training period is over. Finally,
we show that dynamic adaptation in a deep neural network
model trained for phoneme classification can significantly 2. METHODS
improve the accuracy across various noise conditions over a
baseline recognition performance. 2.1. An unsupervised statistical model of the expected
Index Terms— Neural Networks, Speech, Deep Learning network operation
To create a statistical model of the activation of nodes (ei-
1. INTRODUCTION ther in hidden or output layers) in normal working conditions
[10], [11], [12], we model the structure of node co-activation
A biological neural network has the ability to reorganize itself patterns, which can be captured using the cross correlation
in real-time to implement new, task-related computations [1], of responses over time. This particular statistic, also known
[2]–[4]. In several auditory experiments, we and others have as temporal coherence, has been proposed as a biologically
shown that this top-down, knowledge-driven, global plastic- plausible computation in the auditory cortex [13]. We esti-
ity can facilitate the extraction of acoustic parameters that mated thePco-activation pattern as an autocorrelation matrix
are relevant for the task [5]–[7], [3]. For example, we have Cij = T1 ⌧ yij (⌧ )yij (⌧ ), or in matrix form as C = T1 Y Y T .
shown that the same neural population can rapidly change to This correlation matrix may have a particular interpretation
selectively represent the spectral and temporal feature of a depending on the task the network is designed for. For ex-
speaker of interest when switching attention in a multi-talker ample, in an acoustic-to-phoneme neural network model, the
speech perception task [1], [8], [9]. Current neural network correlation of the output layer reflects the phoneme confusion
models lack this high-level feedback signals; generalization pattern in the network, since co-activated nodes (phoneme
to the changing situations is achieved by attempting to train posteriors) correspond to phonemes that are more likely to
on every possible signal condition, which can pose a problem be confused. Because of the critical role the autocorrelation
with low-resource tasks or when the network is faced with matrix plays in generating the feedback signals, it is key to
unpredictable conditions. Here, we propose a different ap- evaluate the dependence of this statistics on the properties of
proach based on the notion that while a system may face un- the particular sample of speech used to compute it. Therefore,
it is important to ascertain what duration of signal is required vised measure and the supervised phoneme classification
to compute a consistent estimate of the actual correlation ma- error in various types of noise at different SNRs.
trix that captures all the statistical interdependencies needed
to adequately restore it when the input is distorted. We deter-
mined this duration by computing the correlation matrix for 0.8
Frobenius distance btw

different segments of the speech signal and compared their Decreasing
similarity, repeated for signals of various lengths. Figure 1
clean and noisy

SNR
shows the similarity of autocorrelation values estimated from 0.6
non-overlapping segments of speech as a function of T , the White
duration of segments. Figure 1 shows that a robust estimation 0.4 Babble
of this measure requires only a short time interval of about a Destroyer
few seconds and remains largely unchanged for different seg- F16
ments of speech [12]. Therefore, the correlation is a suitable 0.2
choice for our purpose and can be computed online during the
test phase without needing the actual labels. 10 20 30 40 50
Phoneme classification accuracy (%)
1 Fig. 2. Relationship between autocorrelation distance in clean

Similarity measure (r value)
and noisy and phoneme classification accuracy for various

noise types and signal-to-noise ratios.
0.8
2.3. Adapting the network weights to reduce the error

0.6
An important property of the proposed error metric (Frobe-
nius norm of the difference between the correlations) is that
0.4 it is differentiable with respect to each of the output nodes
0.1 0.5 0.9 1.3 1.7 2.1 2.5 2.9 and time points. Therefore, we can use the chain rule to com-
Duration (s) pute the error derivative with respect to network connection
weights and use gradient descent to back-propagate this un-
Fig. 1. Similarity of the autocorrelation statistics obtained supervised error to optimize the network parameters. In order
from different segments of speech, as a function of segment to compute the gradient of the error with respect to any arbi-
duration. Dashed line indicates the standard deviation. trary weight in the network, @E@w , we need to find the derivative
of E with respect to each output. The partial derivative of er-
ror, with respect to a particular output, can be computed using
the chain rule:
2.2. Creating an unsupervised error signal to measure the
network performance @E
N
X @
T
1X
=2 (( Yi⌧ Yj⌧ ) COij ) (2)
We define the deviation of the network from its desired per- @Ykt ij
@Ykt T ⌧
formance as the distance between the current autocorrelation
pattern, C, and the target autocorrelation, CO , as Because COij is invariant with respect to Ykt , and because the
derivative of the last summation is zero for t 6= ⌧ :
2
E = kC CO kF (1)
N
X T
@E @ 1X @ 1
The error as stated here is defined as the square of the Frobe- =2 ((( Yi⌧ Yj⌧ ) COij ) ( Yit Yjt ))
@Ykt @Ykt T ⌧ @Ykt T
nius norm of the difference between the two autocorrelation ij
(3)
matrices. To justify this particular error metric, we first need
There are now two cases in which the equation above will re-
to show that it accurately reflects the performance of the
sult in a nonzero derivative; therefore, the result of the partial
network. In a network designed to map acoustic signals to
derivative can be expressed in terms of the delta function as
phoneme posteriors, this means a monotonic and reliable re-
lationship between autocorrelation distance, E, and phoneme N T
accuracy. The simulation in Figure 2 confirms this notion, @E 2X X
= ((( Yi⌧ Yj⌧ ) COij )(Yjt ik Yit jk )) (4)
and shows a consistent relationship between this unsuper- @Ykt T ij ⌧
N
X @ 1 even though the objective function only includes the autocor-
=2 ij ( Yit Yjt ) relation measure. The adapted weight matrix from hidden-to-
@Ykt T
ij
output layer shown in Figure 3f shows a selective suppression
PT of the frequencies where the noise is larger than speech (blue),
where ik is defined as T1 (( ⌧ Yi⌧ Yj⌧ ) COij ). Us-
ing the symmetric properties of the correlation matrix, we for which the network has learned to restore the speech signal
can combinePthese sums into one simplified expression: from its correlated occurrence with lower frequencies (red ar-
@E 2 N
Yit jk ). Here the repre- eas in Figure 3f). It is worth emphasizing that the network has
@Ykt = T ij ij (Yjt ik
sents the Kronecker delta function. Since the autocorrelation no knowledge of the noise; it only uses the expected statistics
matrix is symmetric and the iterators match, of the output in clean conditions to re-wire itself, which re-
PN the expressions sults in suppression of unwanted variability and a more robust
can be combined as follows: @Y @E
= T
4
j ( kj Yjt ). Aver-
kt
aging this value over time can be interpreted as the expected output.
change in the error with respect to a single output over time.
Using traditional back-propagation algorithms, we can then Spectrogram Autocorrelation
adjust every weight in the network to reduce the error. At the a) Clean
8
Freq. (KHz)
output layer, this update can be accomplished as:
@E
wi = wi ↵ Y (1 Y )Yi (5)
@Y
Where Yi denotes the input at that layer, ↵ is the learning 0.25
b) Noisy
rate, and wi is any weight in that layer. The weights can also 8
Freq. (KHz)
be updated at any hidden layer by back-propagating the error
vector further.
X
wi = wi ↵ ( n wn )Yj (1 Yj )Yi (6)
n2Ij 0.25
c) Recovered
8
Freq. (KHz)
Where n is defined as the backwards pointing error vector

from the next layer along any weight wn and n 2 Ij denotes
iteration over all nodes in the next layer, j.
0.25
3. RESULTS 0 1 0.25 8
Time (s) Freq. (KHz)
3.1. An autoencoder neural network model with high- d) e) W
Error compared
level feedback
Frequency
to clean
Autocorrelation
To provide an intuitive account of how this process works in a Spectrogram
neural network model, we first incorporate the proposed feed- 0
back in an autoencoder network trained to map the speech
spectrogram onto itself. 0 15 Frequency
Iteration
Since the network has not been trained in any condition
other than clean speech, any distortion in the input will also
be mapped to the output. However, if the feedback described Fig. 3. An autoencoder network with feedback (a-c) Spec-
by equation 6 is integrated into this network, with CO being trogrms and the autocorrelation statistics in clean, noisy and
the expected statistics of clean speech, the network will auto- the recovered conditions. (d) Decreased autocorrelation and
matically change the synaptic weights from the hidden to the spectrogram errors with respect to clean as a function of iter-
output layer to minimize the distance between autocorrelation ation. (e) Quantifying the change in hidden-to-output weight
of output nodes in noise and the pattern measured in clean matrix.
(Figure 3, a and b). As shown in Figure 3b, the autocorrela-
tion of the speech output in jet noise at 5dB SNR [15] signifi-
cantly differs from the clean condition (Figure 3a). However, 3.2. Phoneme classification in a DNN with feedback
the network is able to recover the autocorrelation pattern as
shown in Figure 3c. The reduced error for the autocorrela- While the simulation results of the autoencoder network qual-
tion statistics is quantified in Figure 3e, and correlates well itatively demonstrate the computations that can be achieved
with the reduction of error of the spectrogram, shown in red, with the proposed closed-loop framework, here we test this
mechanism in a deep neural network model trained for HL1
20
Frobenius Error
phoneme recognition. HL2
15 HL3
The network used in this study was trained for context- HL4
10 HL5
independent phoneme recognition on the clean training set of
the WSJ Aurora 4 corpus. The input to the network consists 5
of 11 shifted frames of 24-dimensional log Mel filter bank 0
coefficients and their two temporal derivatives. There were 0 40 80 120
five hidden layers utilizing a sigmoid nonlinearity (256 nodes Iteration
each). The softmax output of the network has 41 nodes cor-
responding to the HMM emission probability of 40 English Fig. 4. Distance between autocorrelation measures is reduced
phonemes and silence. The model weights were initialized after adaptation in all hidden layers of the network.
using unsupervised RBM layer-wise pretraining; parameters
were then optimized using 25 epochs of back-propagation Layers with feedback Accuracy (%)
with a cross-entropy objective function. None 34.90
Hidden Layer 1 40.70
This mechanism is performed on each layer of the net-
Hidden Layers 1-2 41.84
work separately, meaning that each layer attempts to regulate
itself by changing the weights connecting it to the previous
layer. The autocorrelation values for each layer are estimated
in clean, and the network adapts to the unseen noisy condi- Hidden Layers 1-5 42.87
tions not included in any of the training phases. During the
test phase, the network monitors the autocorrelation statistics Table 1. Phoneme classification accuracy for feedback in var-
between the nodes, and adapts the weights between layers to ious layers
reduce the Frobenius distance between the co-activation of
nodes in each layer and the expected co-activation measured
in the clean condition. The phoneme classification accuracy 4. CONCLUSION
was tested after progressively adapting each layer to examine
the cumulative effect on the output accuracy. We tested the We have shown the feasibility of integrating top-down,
network with and without feedback in variety of noise types knowledge-based feedback in common deep neural network
and SNR conditions. architectures to implement novel, task-related computations.
This feedback methodology could theoretically be used for
Figure 4 shows how the error (E, equation 1) is reduced in any application where the statistical structure of the desired
different layers of the network. The layers are adapted con- network output is known, but a training set of known in-
secutively, so the reduction of error seen in all layers confirms puts and desired outputs may be impractical, unavailable,
the benefit of implementing feedback in every layer of the net- or limited. We have also demonstrated the benefit of this
work. Next, we tested the effect of the feedback on phoneme new dynamic network in phoneme classification accuracy
classification accuracy in noise and observed significant im- in conditions not included in the training of the network.
provement in all conditions tested for both noise types (Fig- However, more realistic benchmarks will be required to more
ure 5a) and SNR (Figure 5b). Figure 5 show increasingly rigorously assess the added benefit. Future work also in-
improved phoneme classification accuracy, where the biggest cludes back-propagating the error signal to all layers of the
improvement occurs after the adaptation of the first layer. In- network, and examining the interactions between error mea-
troducing feedback to the network improves performance par- sures obtained in each layer. Previous studies have shown
ticularly in cases with low signal-to-noise ratio (SNR). The an increasingly complex selectivity to higher order features
improvement in phoneme classification can be seen across of speech in deeper layers of neural network models [14],
a wide variety of noise types. Unsurprisingly, the improve- we will therefore investigate whether the feedback in various
ment is less noticeable for the ‘babble’ noise condition than layers is able to perform different types of computation.
any other type of noise, as babble noise results in an output
correlation that is distinctly similar to speech and therefore
cannot be as readily separated (Figure 5b). Figure 5 also sug-
gests that adapting all the layers of the network achieves the
best performance, therefore justifying its computational cost.
Lastly, Table 1 shows how the phoneme classification accu-
racy averaged over all noise types and SNRs, demonstrating
the overall effect of the feedback in various hidden layers.
a) 60 None
HL1
Phoneme accuracy (%)
50
40
30
20
10
0
0 10 20
b) SNR (dB)
45
Phoneme accuracy (%)
40
35
30
25
White Pink Babble Destroyer F16
Fig. 5. (a) Phoneme classification accuracy for each net-

work layer averaged over noise types as a function of signal-
to-noise ratio. (b) Classification accuracy in different noise
types, averaged over all SNR conditions.
5. REFERENCES [13] S. A. Shamma, M. Elhilali, and C. Micheyl, “Tempo-
ral coherence and attention in auditory scene analysis,”
[1] N. Mesgarani and E. F. Chang, “Selective cortical repre- Trends Neurosci., 2010.
sentation of attended speaker in multi-talker speech per-
ception,” in Nature, vol. 485, no. 7397, pp. 233-236, [14] T. Nagamine, M. L. Seltzer, and N. Mesgarani. “Explor-
2012. ing How Deep Neural Networks Form Phonemic Cate-
gories,” in Interspeech, Dresden, Germany, 2015.
[2] M. Elhilali, J. B. Fritz, T. S. Chi, and S. A. Shamma, “Au-
ditory cortical receptive fields: stable entities with plastic [15] A. Varga and H. J. M. Steeneken, “Assessment for au-
abilities,” J Neurosci, vol. 27, no. 39, pp. 10372-10382, tomatic speech recognition: II. NOISEX-92: A database
2007. and an experiment to study the effect of additive noise on
speech recognition systems”, Speech Commun., vol. 12,
[3] J. B. Fritz, M. Elhilali, S. V David, and S. A. Shamma, no. 3, pp. 247251, 1993.
“Auditory attention–focusing the searchlight on sound,”
Curr. Opin. Neurobiol., vol. 17, no. 4, pp. 437-455, 2007.
[4] S. Shamma and J. Fritz, “Adaptive auditory computa-
tions,” Curr. Opin. Neurobiol., vol. 25, pp. 164-168,
2014.
[5] J. Fritz, S. Shamma, M. Elhilali, and D. Klein, “Rapid
task-related plasticity of spectrotemporal receptive fields
in primary auditory cortex,” Nat Neurosci, vol. 6, no. 11,
pp. 1216–1223, 2003.
[6] J. Fritz, M. Elhilali, and S. Shamma, “Active listen-
ing: task-dependent plasticity of spectrotemporal recep-
tive fields in primary auditory cortex,” Hear Res, vol. 206,
no. 1–2, pp. 159?176, 2005.
[7] Y. Wang, H. O’Donohue, and P. Manis, “Short-term plas-
ticity and auditory processing in the ventral cochlear nu-
cleus of normal and hearing-impaired animals,” Hear.
Res., vol. 279, no. 1, pp. 131–139, 2011.
[8] N. Ding and J. Z. Simon, “Emergence of neural encoding
of auditory objects while listening to competing speak-
ers,” Proc. Natl. Acad. Sci., vol. 109, no. 29, pp. 11854–
11859, 2012.
[9] J. A. O’Sullivan, A. J. Power, N. Mesgarani, S. Ra-
jaram, J. J. Foxe, B. G. Shinn-Cunningham, M. Slaney,
S. A. Shamma, and E. C. Lalor, “Attentional Selection
in a Cocktail Party Environment Can Be Decoded from
Single-Trial EEG,” Cereb. Cortex, p. bht355, 2014.
[10] N. Mesgarani, S. Thomas, and H. Hermansky, “Adaptive
Stream Fusion in Multistream Recognition of Speech,”
2011.
[11] N. Mesgarani, S. Thomas, and H. Hermansky, “A Multi-
stream Multiresolution Framework for Phoneme Recog-
nition,” in proc. Interspeech, vol. Makuhari, , pp. 318–
321, 2010.
[12] N. Mesgarani, S. Thomas, and H. Hermansky, “Toward
optimizing stream fusion in multistream recognition of
speech,” J. Acoust. Soc. Am., vol. 130, no. 1, pp. EL14–
EL18, 2011.

Icassp 2016

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Icassp 2016

Hochgeladen von

Copyright:

Verfügbare Formate

A DEEP NEURAL NETWORK WITH TOP-DOWN FEEDBACK

Winston Mann1 , Tasha Nagamine1 , Michael L. Seltzer2 , Nima Mesgarani1

ABSTRACT predictable inputs, the expected output of the model is always

Frobenius distance btw

clean and noisy

1 Fig. 2. Relationship between autocorrelation distance in clean

and noisy and phoneme classification accuracy for various

2.3. Adapting the network weights to reduce the error

Where n is defined as the backwards pointing error vector

Fig. 5. (a) Phoneme classification accuracy for each net-

Das könnte Ihnen auch gefallen