Sie sind auf Seite 1von 13

Using SONIC to build a

speech recognizer

Pellom & Hacioglu, ``Sonic: The University of


Colorado Continuous Speech Recognizer,'' Center for
Spoken Language Research Technical Report TR-
CSLR-2001-01, U. Colorado, 2003

Presented by Yang Shao, CIS788K04 Wi04


Performance on standard tasks

 On a 1.7GHz Pentium 4
Procedures
 Preparation
– identify the goal;
– decide the recognition unit: phoneme, syllable,
word etc;
– preparing the corpus: training, development,
testing;
– label part of training data (opt).
– etc.
Procedures cont.
Wˆ  arg max p(O | W ) P(W )
W

 Training
– Acoustic model training;
– Language model training;
 Adaptation
– Speaker adaptation (VTLN, MLLR, MAP);
– Environment adaptation (mismatch of training and
testing);
 Testing
Acoustic model training
 Feature extraction and iterative steps of viterbi state-
based alignment and model estimation;
 Outputs a set of decision-tree state-clustered HMMs;
Feature extraction (PMVDR)
 Perceptual Minimum Variance Distortionless
Response cepstral coefficients;
– fea [options] speechfile.raw featurefile.fea

 Dynamic features;
Language Model I
 Finite state grammar in terms of a regular
expression;
Language model II
 Language model:
– P(W) = P(w1, w2, …, wm) gives the probability of a
given word sequence;
– expanded as

– N-gram

– Calculated as

 Bigram example: P(Mary loves that person) =


P(Mary|<s>)P(loves|Mary)P(that|loves)P(person|that)
Recognition overview
 Speech-enabled applications can be built by
calling functions within the Sonic API.
– Sonic_batch –c config.txt [-l]
Configuration file
 It is a text file that has a set of parameters
followed by arguments to establish the basic
settings of the recognizer.
– location of the acoustic model files;
– location of the language model file;
– location of the pronunciation lexicon;
– recognizer settings such as search beams, pruning
settings, etc.;
– (opt) a pointer to a control file containing a list of audio
files to process.
Components
 Audio file format:
– 16-bit linear PCM format (raw);
– sampling rate is configurable (8k default);
 Phoneme configuration file format
– support 55-phoneme symbol set adopted by
CMU Sphinx-II speech recognizer.
Components cont.
 LM format
– support up to 4-gram language model
 Pronunciation lexicon format

 Acoustic model format


– using binary files from trainer function;
– <phoneme>.<state>-<context>, ex. AA.1-l;
Discussion
 Unlike HTK, the trainer code estimates
models for one base phone at a time.
Potential problem?

Das könnte Ihnen auch gefallen