Beruflich Dokumente
Kultur Dokumente
Freq. (KHz)
output layer, this update can be accomplished as:
@E
wi = wi ↵ Y (1 Y )Yi (5)
@Y
Where Yi denotes the input at that layer, ↵ is the learning 0.25
b) Noisy
rate, and wi is any weight in that layer. The weights can also 8
Freq. (KHz)
be updated at any hidden layer by back-propagating the error
vector further.
X
wi = wi ↵ ( n wn )Yj (1 Yj )Yi (6)
n2Ij 0.25
c) Recovered
8
Freq. (KHz)
0.25
3. RESULTS 0 1 0.25 8
Time (s) Freq. (KHz)
3.1. An autoencoder neural network model with high- d) e) W
Error compared
level feedback
Frequency
to clean
Autocorrelation
To provide an intuitive account of how this process works in a Spectrogram
neural network model, we first incorporate the proposed feed- 0
back in an autoencoder network trained to map the speech
spectrogram onto itself. 0 15 Frequency
Iteration
Since the network has not been trained in any condition
other than clean speech, any distortion in the input will also
be mapped to the output. However, if the feedback described Fig. 3. An autoencoder network with feedback (a-c) Spec-
by equation 6 is integrated into this network, with CO being trogrms and the autocorrelation statistics in clean, noisy and
the expected statistics of clean speech, the network will auto- the recovered conditions. (d) Decreased autocorrelation and
matically change the synaptic weights from the hidden to the spectrogram errors with respect to clean as a function of iter-
output layer to minimize the distance between autocorrelation ation. (e) Quantifying the change in hidden-to-output weight
of output nodes in noise and the pattern measured in clean matrix.
(Figure 3, a and b). As shown in Figure 3b, the autocorrela-
tion of the speech output in jet noise at 5dB SNR [15] signifi-
cantly differs from the clean condition (Figure 3a). However, 3.2. Phoneme classification in a DNN with feedback
the network is able to recover the autocorrelation pattern as
shown in Figure 3c. The reduced error for the autocorrela- While the simulation results of the autoencoder network qual-
tion statistics is quantified in Figure 3e, and correlates well itatively demonstrate the computations that can be achieved
with the reduction of error of the spectrogram, shown in red, with the proposed closed-loop framework, here we test this
mechanism in a deep neural network model trained for HL1
20
Frobenius Error
phoneme recognition. HL2
15 HL3
The network used in this study was trained for context- HL4
10 HL5
independent phoneme recognition on the clean training set of
the WSJ Aurora 4 corpus. The input to the network consists 5
of 11 shifted frames of 24-dimensional log Mel filter bank 0
coefficients and their two temporal derivatives. There were 0 40 80 120
five hidden layers utilizing a sigmoid nonlinearity (256 nodes Iteration
each). The softmax output of the network has 41 nodes cor-
responding to the HMM emission probability of 40 English Fig. 4. Distance between autocorrelation measures is reduced
phonemes and silence. The model weights were initialized after adaptation in all hidden layers of the network.
using unsupervised RBM layer-wise pretraining; parameters
were then optimized using 25 epochs of back-propagation Layers with feedback Accuracy (%)
with a cross-entropy objective function. None 34.90
Hidden Layer 1 40.70
This mechanism is performed on each layer of the net-
Hidden Layers 1-2 41.84
work separately, meaning that each layer attempts to regulate
Hidden Layers 1-3 42.48
itself by changing the weights connecting it to the previous
Hidden Layers 1-4 42.82
layer. The autocorrelation values for each layer are estimated
in clean, and the network adapts to the unseen noisy condi- Hidden Layers 1-5 42.87
tions not included in any of the training phases. During the
test phase, the network monitors the autocorrelation statistics Table 1. Phoneme classification accuracy for feedback in var-
between the nodes, and adapts the weights between layers to ious layers
reduce the Frobenius distance between the co-activation of
nodes in each layer and the expected co-activation measured
in the clean condition. The phoneme classification accuracy 4. CONCLUSION
was tested after progressively adapting each layer to examine
the cumulative effect on the output accuracy. We tested the We have shown the feasibility of integrating top-down,
network with and without feedback in variety of noise types knowledge-based feedback in common deep neural network
and SNR conditions. architectures to implement novel, task-related computations.
This feedback methodology could theoretically be used for
Figure 4 shows how the error (E, equation 1) is reduced in any application where the statistical structure of the desired
different layers of the network. The layers are adapted con- network output is known, but a training set of known in-
secutively, so the reduction of error seen in all layers confirms puts and desired outputs may be impractical, unavailable,
the benefit of implementing feedback in every layer of the net- or limited. We have also demonstrated the benefit of this
work. Next, we tested the effect of the feedback on phoneme new dynamic network in phoneme classification accuracy
classification accuracy in noise and observed significant im- in conditions not included in the training of the network.
provement in all conditions tested for both noise types (Fig- However, more realistic benchmarks will be required to more
ure 5a) and SNR (Figure 5b). Figure 5 show increasingly rigorously assess the added benefit. Future work also in-
improved phoneme classification accuracy, where the biggest cludes back-propagating the error signal to all layers of the
improvement occurs after the adaptation of the first layer. In- network, and examining the interactions between error mea-
troducing feedback to the network improves performance par- sures obtained in each layer. Previous studies have shown
ticularly in cases with low signal-to-noise ratio (SNR). The an increasingly complex selectivity to higher order features
improvement in phoneme classification can be seen across of speech in deeper layers of neural network models [14],
a wide variety of noise types. Unsurprisingly, the improve- we will therefore investigate whether the feedback in various
ment is less noticeable for the ‘babble’ noise condition than layers is able to perform different types of computation.
any other type of noise, as babble noise results in an output
correlation that is distinctly similar to speech and therefore
cannot be as readily separated (Figure 5b). Figure 5 also sug-
gests that adapting all the layers of the network achieves the
best performance, therefore justifying its computational cost.
Lastly, Table 1 shows how the phoneme classification accu-
racy averaged over all noise types and SNRs, demonstrating
the overall effect of the feedback in various hidden layers.
a) 60 None
HL1
Phoneme accuracy (%)
50
40
30
20
10
0
0 10 20
b) SNR (dB)
45
Phoneme accuracy (%)
40
35
30
25
White Pink Babble Destroyer F16