SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS
ALEX GRAVES, ABDEL-RAHMAN MOHAMED AND GEOFFREY HINTON
DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF TORONTO ICASSP 2013 Highlights This paper investigates deep recurrent neural networks which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. Compares the performance of CTC with RNN Transducer Presents an enhancement to the RNN Transducer method (described earlier) that jointly trains two separate RNNs as acoustic and linguistic models Achieves the best recorded score on the TIMIT phoneme recognition benchmark at its time of publication Motivation for Deep Network RNNs are inherently deep in time, since their hidden state is a function of all previous hidden states. The question that inspired this paper was whether RNNs could also benefit from depth in space; that is from stacking multiple recurrent hidden layers on top of each other, just as feedforward layers are stacked in conventional deep networks.
deep bidirectional LSTM is the main architecture used in this paper.
This was the first time deep LSTM had been applied to speech recognition, and it yielded a dramatic improvement over single- layer LSTM. Training The output distribution was defined in two ways for training the network: Connectionist Temporal Classification (Described earlier) RNN Transducer (Described earlier) Enhancement : In the original formulation Pr(k|t, u) was defined by taking an ‘acoustic’ distribution Pr(k|t) from the CTC/Transcription network, a ‘linguistic’ distribution Pr(k|u) from the prediction network, then multiplying the two together and renormalising. An improvement introduced in this paper is to instead feed the hidden activations of both networks into a separate feedforward output network, whose outputs are then normalised with a softmax function to yield Pr(k|t, u). This allows a richer set of possibilities for combining linguistic and acoustic information, and appears to lead to better generalisation. Decoding and Regularization Decoding: RNN transducers can be decoded with beam search to yield an n-best list of candidate transcriptions. In the past CTC networks have been decoded using either a form of best-first decoding known as prefix search, or by simply taking the most active output at every timestep. We find beam search both faster and more effective than prefix search for CTC.
Regularisation: Two regularisers were used in this paper: early
stopping and weight noise (the addition of Gaussian noise to the network weights during training) “Weight noise tends to ‘simplify’ neural networks, in the sense of reducing the amount of information required to transmit the parameters, which improves generalisation. Phoneme recognition experiments were performed on the TIMIT corpus. Results and Take-aways The advantage of deep networks is immediately obvious, with the error rate for CTC dropping from 23.9% to 18.4% as the number of hidden levels increases from one to five. (a) LSTM works much better than tanh for this task, (b) bidirectional LSTM has a slight advantage over unidirectional LSTM and (c) depth is more important than layer size (which supports previous findings for deep networks) The advantage of the transducer over CTC is slight when the weights are randomly initialised, it becomes more substantial when pretraining is used. End-to-end Continuous Speech Recognition using Attention- based Recurrent NN: First Results
JAN CHOROWSKI,KYUNGHYUN CHO,YOSHUA BENGIO
UNIVERSITY OF MONTREAL ARVIX, 2014 Highlight of the paper:
Utilizes a RNN transducer ( unlike CTC which is an acoustic only
model, RNN transducer has another RNN which acts as a language model -> discussed in detailed as the second paper in this slide) Model used in this paper: a bidirectional RNN encoder coupled to a RNN decoder where the alignment between the input and output sequence is established using an attention mechanism(the decoder emits each symbol based on a context created with a subset of input symbols selected by the attention mechanism). Final model Advantage of the attention model (the novel contribution of the paper) In sequence models , only few localized inputs are responsible for finding an output. Thus it is wastage of resources to look at the entire input sequence for estimating the output for a certain time frame and even worse only to look at input corresponding to a particular frame to estimate output corresponding to that frame. Thus , we introduce parameters alpha , which is the contribution of each previous cell while calculating output of a cell in above layer. The important part in an attention model is that each decoder output word now depends on a weighted combination of all the input states, not just the last state. The alphas‘s are weights that define in how much of each input state should be considered for each output. So, if is a23 large number, this would mean that the decoder pays a lot of attention to the second state in the source sentence while producing the third word of the target sentence. The are typically normalized to sum to 1 (so they are a distribution over the input states)
A big advantage of attention is that it gives us the
ability to interpret and visualize what the model is doing. For example, by visualizing the attention weight matrix when a sentence is translated, we can understand how the model is translating (for a task of machi ne translation) Deep Speech: Scaling up end-to-end speech recognition BAIDU RESEARCH 2014, DECEMBER Major highlights
Architecture significantly simpler than traditional speech recognition
systems Model itself learns background noise, reverbation and speaker variations from the data , no need of hand engineered features no concept of phoneme use novel data synthesis techniques that helps them obtain large amount of data for training the model outperforms traditional systems on the Switchboard dataset Possible to recent advancements: CTC by graves et al. Rapid training of large neural nets by multi-GPU computation : by Coates et al Model Description Use spectrogram as input Employ a character level RNN ( convert an input sequence x into a sequence of character probabilities for the transcription y)
RNN model: 5 layers ; first 3 layers non recurrent thus the equation of operation for the first 3 layers is 4th layer is a bidirectional recurrent layer
5th layer: non recurrent
Output layer is a softmax unit :
Overall model Other important tricks used
bec have a lot of training data , thus important to prevent
overfitting -> introduce dropout between 5-10%( only in feedforward layers , not in recurrent layers) Use the concept of jittered inputs(as in CV) Language Model
Mostly RNN output good, but sometimes it tends to make certain
phonetic mistakes – logical because spectrogram being fed in , due to various phonetic affects , some phones could be confusing > thus need a language model to disambiguate it Thus they use a n-gram language model , which is easily trained from huge unlabelled text corpus Thus , now with the introduction of the language model , it is now required to find a sequence c that maximises:
They have used a beam search algorithm to optimize the objective
Results
Use a multiple GPU configuration ( >= 5 GPUs) and had