Sie sind auf Seite 1von 22

End-to-End Automatic KUNAL DHAWAN

Speech Recognition KUMAR PRIYADARSHI


SPEECH RECOGNITION WITH
DEEP RECURRENT NEURAL
NETWORKS

ALEX GRAVES, ABDEL-RAHMAN MOHAMED AND GEOFFREY HINTON


DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF TORONTO
ICASSP 2013
Highlights
 This paper investigates deep recurrent neural networks which
combine the multiple levels of representation that have proved so
effective in deep networks with the flexible use of long range
context that empowers RNNs.
 Compares the performance of CTC with RNN Transducer
 Presents an enhancement to the RNN Transducer method
(described earlier) that jointly trains two separate RNNs as acoustic
and linguistic models
 Achieves the best recorded score on the TIMIT phoneme
recognition benchmark at its time of publication
Motivation for Deep Network
 RNNs are inherently deep in time, since their hidden state is a
function of all previous hidden states. The question that inspired this
paper was whether RNNs could also benefit from depth in space;
that is from stacking multiple recurrent hidden layers on top of
each other, just as feedforward layers are stacked in conventional
deep networks.

 deep bidirectional LSTM is the main architecture used in this paper.


This was the first time deep LSTM had been applied to speech
recognition, and it yielded a dramatic improvement over single-
layer LSTM.
Training
 The output distribution was defined in two ways for training the
network:
 Connectionist Temporal Classification (Described earlier)
 RNN Transducer (Described earlier)
 Enhancement : In the original formulation Pr(k|t, u) was defined by taking
an ‘acoustic’ distribution Pr(k|t) from the CTC/Transcription network, a
‘linguistic’ distribution Pr(k|u) from the prediction network, then
multiplying the two together and renormalising. An improvement
introduced in this paper is to instead feed the hidden activations of both
networks into a separate feedforward output network, whose outputs are
then normalised with a softmax function to yield Pr(k|t, u). This allows a
richer set of possibilities for combining linguistic and acoustic information,
and appears to lead to better generalisation.
Decoding and Regularization
 Decoding: RNN transducers can be decoded with beam search to yield
an n-best list of candidate transcriptions. In the past CTC networks have
been decoded using either a form of best-first decoding known as prefix
search, or by simply taking the most active output at every timestep. We
find beam search both faster and more effective than prefix search for
CTC.

 Regularisation: Two regularisers were used in this paper: early


stopping and weight noise (the addition of Gaussian noise to the network
weights during training)
 “Weight noise tends to ‘simplify’ neural networks, in the sense of reducing the
amount of information required to transmit the parameters, which
improves generalisation.
 Phoneme recognition experiments were performed on the TIMIT corpus.
Results and
Take-aways
 The advantage of deep networks is
immediately obvious, with the error rate
for CTC dropping from 23.9% to 18.4% as
the number of hidden levels increases
from one to five.
 (a) LSTM works much better than tanh
for this task, (b) bidirectional LSTM has a
slight advantage over
unidirectional LSTM and (c) depth is
more important than layer size (which
supports previous findings for deep
networks)
 The advantage of the transducer
over CTC is slight when the weights are
randomly initialised, it becomes more
substantial when pretraining is used.
End-to-end Continuous Speech
Recognition using Attention-
based Recurrent NN: First
Results

JAN CHOROWSKI,KYUNGHYUN CHO,YOSHUA BENGIO


UNIVERSITY OF MONTREAL
ARVIX, 2014
Highlight of the paper:

 Utilizes a RNN transducer ( unlike CTC which is an acoustic only


model, RNN transducer has another RNN which acts as a language
model -> discussed in detailed as the second paper in this slide)
 Model used in this paper: a bidirectional RNN encoder coupled to a
RNN decoder where the alignment between the input and output
sequence is established using an attention mechanism(the decoder
emits each symbol based on a context created with a subset of
input symbols selected by the attention mechanism).
Final model
Advantage of the
attention model (the novel
contribution of the paper)
 In sequence models , only few localized
inputs are responsible for finding an output.
Thus it is wastage of resources to look at the
entire input sequence for estimating the
output for a certain time frame and even
worse only to look at input corresponding to a
particular frame to estimate output
corresponding to that frame.
 Thus , we introduce parameters alpha , which
is the contribution of each previous cell while
calculating output of a cell in above layer.
The important part in an attention model is that each decoder output
word now depends on a weighted combination of all the input states,
not just the last state. The alphas‘s are weights that define in how much
of each input state should be considered for each output. So, if is a23
large number, this would mean that the decoder pays a lot of
attention to the second state in the source sentence while producing
the third word of the target sentence. The are typically normalized to
sum to 1 (so they are a distribution over the input states)

A big advantage of attention is that it gives us the


ability to interpret and visualize what the model is
doing. For example, by visualizing the attention weight
matrix when a sentence is translated, we can
understand how the model is translating (for a task
of machi ne translation)
Deep Speech: Scaling up
end-to-end speech
recognition
BAIDU RESEARCH
2014, DECEMBER
Major highlights

 Architecture significantly simpler than traditional speech recognition


systems
 Model itself learns background noise, reverbation and speaker
variations from the data , no need of hand engineered features
 no concept of phoneme
 use novel data synthesis techniques that helps them obtain large
amount of data for training the model
 outperforms traditional systems on the Switchboard dataset
 Possible to recent advancements:
 CTC by graves et al.
 Rapid training of large neural nets by multi-GPU computation : by
Coates et al
Model Description
 Use spectrogram as input
 Employ a character level RNN ( convert an input sequence x into a
sequence of character probabilities for the transcription y)

 RNN model: 5 layers ; first 3 layers non recurrent thus the equation of
operation for the first 3 layers is
 4th layer is a bidirectional recurrent layer

 5th layer: non recurrent

 Output layer is a softmax unit :


Overall model
Other important tricks used

 bec have a lot of training data , thus important to prevent


overfitting -> introduce dropout between 5-10%( only in feedforward
layers , not in recurrent layers)
 Use the concept of jittered inputs(as in CV)
Language Model

 Mostly RNN output good, but sometimes it tends to make certain


phonetic mistakes – logical because spectrogram being fed in , due
to various phonetic affects , some phones could be confusing > thus
need a language model to disambiguate it
 Thus they use a n-gram language model , which is easily trained
from huge unlabelled text corpus
 Thus , now with the introduction of the language model , it is now
required to find a sequence c that maximises:

 They have used a beam search algorithm to optimize the objective


Results

Use a multiple GPU configuration ( >= 5 GPUs) and had


performed data and model parallelism
Thank you!

Das könnte Ihnen auch gefallen