Pattern Recognition Speech

Pattern Recognition and Classification
(for speech recognition)
11-751 Speech Recognition

09-15-2008
09-15-2008
Course information & quick review
The components of a modern ASR system
Pattern Recognition / Classification
2/49
Will continue with this topic on Wednesday
Course Grading
30%
3/49
Homework Assignments
4 assignments over the course
40%
Exam
[12-Dec]
In-class final exam at the end of the course
close book, covers the material presented in the course
30%
Speech term project
Proposal (1-pager)
[due: 08-Oct]
oral presentation (15min)
[start of Dec]
written report (10 pages max)
[due: 15-Dec]
demonstration (if applicable)
your ideas and creativity for projects are highly welcome
details and project ideas given on Wednesday
Instructors
interACT
2F, 407 S.Craig St.
Newell-Simon Hall
Doherty Hall
4/49
RM 203:
Alex Waibel
( ahw@cs.cmu.edu)
RM 221:
Ian Lane
( ianlane@cs.cmu.edu )
RM 209:
Yik-Cheung (Wilson) Tam
( yct@cs.cmu.edu )
What we have looked at so far
Why speech recognition?
Speech production
How humans generate speech
Vocal tract model of speech
Features used for speech recognition
Spectral representation of speech
LPC (Linear Predictive Coding)
MFCC (Mel frequency cepstral coefficients)
Dynamic time warping and template matching
5/49
Isolated word recognition
Vocal Tract Model of Speech

pitch period
vocal tract parameters

t
impulse train
generator
w
G(t)
w
vocal tract
V(t)
x
random noise
generator
A
w
radiation
model R(t)
speech
P L (n )
Alex Waibel Speech Recognition
Sloppy Speech
Actual Input: I have been I have been getting into
ConverSational
Speech
Recognition: and I am I being too yeah

Read
Speech
Recognition: I have been ties than getting into the

Alex Waibel Speech Recognition
Feature Extraction
Speech Waveform
FFT
FFT based spectrum
x
Mel scale triangular filters
Log
8/49
DCT
39 Element
Acoustic
Vector
Acoustic vectors computed every 10ms
Mel-Scale filters mimic auditory processing
DCT decorrelates the signal to improve statistical independence
First and second differentials appended dynamic information of signal
Template matching and DTW

First idea to overcome the varying length of
Utterances, Problem (2):
1. Normalize their length
2. Make a linear alignment
Linear alignment can handle

the problem of different speaking rates
But: it can not handle the problem of
varying speaking rates during
the same utterance.
9/49
Components of a Modern ASR System
Suggested reading:
S. Young, Large vocabulary continuous speech recognition: A review
10/49
ASR the big picture
???
Input
Speech
11/49
Output
Text
Hello world
ASR the big picture

The purpose of Signal Preprocessing is:
1) Signal Digitalization (Quantization and Sampling)
Represent an analog signal in an appropriate form to be
processed by the computer
2) Digital Signal Preprocessing (Feature Extraction)
Extract features that are suitable for recognition process
Front-end
Processing
Input
Speech
12/49
???
Output
Text
Hello world
Fundamental Equation
For observed feature vector sequence x
find the most likely word sequence W
=argmax P Wx=argmax P W P xW
W
P x
W
W
Front-end
Processing
Input
Speech
13/49
???
Output
Text
Hello world
Speech Recognition Decoding

For observed feature vector sequence x
find the most likely word sequence W
Search (how to efficiently find maximum W)

=argmax P Wx=argmax P W P xW
W
P x
W
W
Front-end
Processing
Input
Speech
14/49
P xW P W
Output
Text
Hello world
Acoustic
Model
Language
Model
Acoustic Model
Given W , what is the likelihood to see feature vector(s) x
we need a representation for W in terms of feature vectors
Usually a two-part representation:
pronunciation dictionary: describe W as concatenation of phones
phones models that explain phones in terms of feature vectors

Front-end
Processing
Input
Speech
P xW P W
Output
Text
Hello world
Acoustic
Model
(phones)
15/49
I
/i/
you /j/ /u/
we /v/ /e/
Pronunciation
Dictionary
(map words to
phone sequences)
Why breaking down the words into phones
Need collection of reference patterns for each word
High computational effort (esp. for large vocabularies), proportional

to vocabulary size
Large vocabulary also means: need huge amount of training data
Difficult to train suitable references (or sets of references)
Impossible to recognize untrained words
Replace whole words by suitable sub units
Poor performance when the environment changes
Works only well for speaker-dependent recognition (variations)
Unsuitable where speaker is unknown and no training is feasible
Unsuitable for continuous speech (combinatorial explosion)
Difficult to train/recognize subword units
Replace the template approach by a better modeling process

16/49
Speech Production as a Stochastic Process
The same word / phoneme sounds different every time it is uttered
Regard words / phonemes as states of a speech production process
In a given state we can observe different acoustic sounds
Not all sounds are possible / likely in every state
We say: in a given state the speech process "emits" sound

according to some probability distribution
The production process makes transitions from one state to
another
Not all transitions are possible, they have different probabilities
When we specify the probabilities for sound-emissions (emission

probabilities) and for the state transitions, we call this a model.
17/49
HMM Acoustic Modelling

a22
a12
b2(y1)
a33
a23
b2(y2)
a44
a34
b3(y3)
a45
4
b4(y4)
b4(y5)
Acoustic Vector
Sequence
Y
18/49
= y1
y2
y3
y4
y5
Hidden Markov Model for each phone or senome (context dependent model)
Transition probabilities: ai j model durational variability in speech
Output distribution: bi(yk) model spectral variability
Language Model
What is the likelihood to see word sequence W
prior probability of W independent of observed event x
Front-end
Processing
Input
Speech
19/49
P xW P W
Output
Text
Hello world
Language
Model
(likelihood of word
sequences)
p(you|how are)
p(today|are you)
p(world|Hello)
Language Modelling
P(W) is the a-priori probability of observing word

sequence W independent of the observed signal x
N-gram language model estimate the probability of
some word wk given the preceding n-1 words
Typically use n=3,4
P W = P w kw k 1 , w k 2
20/49
Smoothing required account for word sequences not seen

during training
Decoding with classifiers

Speech
features
Speech
Feature
extraction
Hypotheses
(phonemes)
Decision
(apply trained
classifiers)
...
/h/
21/49
/h/ /e/ /l/ /o/

/w/ /o/ /r/ /l/ /d/
Training classifiers
Speech
features
Aligned
Speech
Feature
extraction
/h/ /e/ /l/ /o/
22/49
Improved
Classifiers
Train
Classifier
/h/ /e/ /l/ /o/
/e/
Use all aligned speech

features (e.g. of phoneme /
e/) to train the reference
vectors of /e/ (=Codebook)
- kmeans
- LVQ
Pattern Recognition and Classification

(for speech recognition)
Suggested reading:
X. Huang, A. Acero, H. Hon, Spoken Language Processing, Chapter 4
R. Duda and P Hart, Pattern Classification and Scene Analysis, John
Wiley & Sons, 2000 (2nd Edition)
23/49
Pattern Recognition Approaches

Pattern
Recognition
knowledge /
connectionist
statistical
supervised
parametric
nonparametric
linear
24/49
unsupervised
nonlinear
Pattern Recognition Approaches
25/49
Knowledge-based approaches:
Compile knowledge
Build decision trees
Connectionist approaches:
Automatic knowledge acquisition, "black-box" behavior
Simulation of biological processes
Statistical approaches:
Build a statistical model of the "real world"
Compute probabilities according to the models
Classification Trees
Simple binary decision tree
for height classification:
T=tall,
t=medium-tall,
M=medium,
m=medium-short,
S=short
Goal: Predict the height of a new person
Using decision tree to predict someones height class by

traversing the tree and answering the yes/no questions
Choice and order of questions is designed subjectively (knowledge-based)
Classification and Regression Trees (CART) provide an
automatic,
26/49
data-driven framework to construct the decision process
The CART Algorithm

Step 1: Create a set of questions Q that consists of all possible questions
Step 2: Pick a splitting criterion that can evaluate all possible questions
Step 3: Create a tree with one root node consisting of all training samples
Step 4: Find the best composite question for each terminal node
(Goal is classification, so objective is to reduce uncertainty; use
entropy H > find question which gives the greatest H reduction)
- generate a tree with several simple-question splits
- cluster leaf nodes into two classes according to the splitting criterion
- construct a corresponding composite question (,)
Step 5: Split: Take the split with the best criterion from step 4
Step 6: Stop criterion: go to step 7 if all leaf nodes contain data samples
from the same class or if the improvements of all potential splits
fall below a defined threshold
Step 7: Prune the tree into the optimal size using an independent test
sample estimate or cross-validation to prevent the tree from
27/49over-modeling the training data = allow generalization
Neural Net Approaches

Parallel Supercomputers:
renewed interest in NN
Appealing to ASR:
parallel evaluation of
many clues and facts
Most common approach:
Multi-layer Perceptron MLP
NN: attempt real-time response and human-like performance
Many simple processing elements operating in parallel
Most common training procedure: error back-propagation:
Generalization of the MMSE (Minimum Mean Squared Error) algorithm
Gradient search: minimize difference between actual and wanted output
MLP approximates the a posteriori probabilities P(Class|Pattern)
Common problem: if an output of 0 0 0 ... 0 1 0 ... 0 0 0 is desired,
the net tends to produce 0 0 0 ... 0 for all inputs
28/49
Pattern Recognition
29/49
Types of classifiers
Supervised - Unsupervised Classifiers
Parametric - Non-Parametric Classifiers
Linear - Non-linear Classifiers
Classical Statistical Methods
Bayes Classifier
K-Nearest Neighbor
Connectionist Methods
Perceptron
Multilayer Perceptrons
Supervised - Unsupervised
Supervised training:
Class to be recognized is known
for each sample in training data.
Requires a priori knowledge of
useful features and knowledge
/labeling of each training token
(cost!).
Unsupervised training:
Class is not known and structure
is to be discovered automatically.
Feature-space-reduction
example: clustering, auto-associative nets
30/49
Unsupervised Classification
F2
++
++ +
+
+
+
++ +
+ ++ ++
++++ + +
+ + + + ++ + + +
+ + +
+
+ +++ + ++
Classification:
F1
Classes Not Known: Find Structure
31/49
Unsupervised Classification
F2
++
++ +
+
+
+
++ +
+ ++ ++
++++ + +
+ + + + ++ + + +
+ + +
+
+ +++ + ++
Classification:
F1
Classes Not Known: Find Structure

Clustering
How to cluster? How many clusters?
32/49
Supervised Classification
Income
++
++ +
+
credit-worthy
+ non-credit-worthy
+
++ +
+ ++ ++
++++ + +
+ + + + ++ + + +
+ + +
+
+ +++ + ++
Classification:
Age
Classes Known: Creditworthiness: Yes-No

Features: Income, Age
Classifiers
33/49
Classification Problem
Income
++
++ +
+
credit-worthy
+ non-credit-worthy
+
++ +
+ ++ ++ Is
++++ + +
+ + + + ++ + + +
+ + +
+
+ +++ + ++
34/49
x1
x
2
Joe credit-worthy ?
Age
Features: age, income
Classes: creditworthy, non-creditworthy
Problem: Given Joe's income and age, should a loan be made?
Other Classification Problems: Fraud Detection, Customer Selection...
Parametric - Non-parametric
Number
bad loans
good loans
Income
Parametric:
minimum error criterium
assume underlying probability distribution;
estimate the parameters of this distribution.
Example: "Gaussian Classifier"
Non-parametric:
Don't assume distribution.
Estimate probability of error or error criterion directly from training data.
Examples: Parzen Window, k-nearest neighbor, perceptron...
35/49
Bayes Decision Theory

Bayes Rule:
plays a central role for statistical pattern recognition
Concept of decision making based on:
P ( j )
1) prior knowledge of categories prior probability:
observation of x
AND
2) knowledge from observation data posterior probability:
Bayes Rule:
p (x / j)P ( j)
P ( j / x ) =
p(x )
where:
p (x ) = p (x / j)P ( j)
j
Class-conditional Probability Density function: p ( x / j )

(referred to as the likelihood function how likely is x generated)
36/49
P ( j / x)
Minimum-Error-Rate Decision Rule

P (e rro r / x ) =
P ( 1 / x ) if we decide
P ( 2 / x ) else
Classification error is minimized, if we:

Decide 1 if P ( 1 / x ) > P ( 2 / x ) ;
2 otherwise
Decide 1 if p ( x / 1 ) P ( 1 ) > p ( x / 2 ) P ( 2 ) ;
2 otherwise
For the multi-category case:
Decide i if P ( i / x ) > P ( j / x ) for all j i
37/49
Classification Error
Bayes decision rule:
move the decision boundary such that the decision is made to
choose the class i based on the maximum value of P(x|i) P(i).
The tail integral area P(error) becomes minimum
38/49
Classifier Discriminant Functions

Decision problem = pattern classification problem where unknown data x
are classified into known categories (e.g. classify sounds into phonemes)
g i(x ),i = 1,...,c
for all j i
Assign x to class i , if g i ( x ) > g j ( x )

Minimum-error-rate classifier:
=
g i(x ) = P ( i / x )
p (x / i)P ( i)
c
p (x / j)( j)
j= 1
independent of class i
g i(x ) = p (x / i)P ( i)
g i ( x ) = lo g ( p ( x / i ) ) + lo g ( P ( i ) )
class
39/49
conditional probability density function A priori probability
Classifier Design in Practice

Need a priori probability P (
Need class conditional PDF p ( x
(not too bad)

/ i)
Problems:
limited training data
limited computation
class-labeling potentially costly and prone to error
classes may not be known
good features not known
Parametric Solution:
Assume that p ( x / i ) has a particular parametric form
Most common representative: multivariate normal density
40/49
Three Binomial distributions with different

probs p; E(X) = np Var(X) = np(1-p)
Three Gaussian distributions with same

mean but different variances (sigma)
Most important probability distribution
since random variables in physical
experiments (incl. Speech signals) have
distribs which are approximately Gaussian
- normal distribution
X has a Gaussian distrib with mean and
variance 2 if X has a continuous pdf of
the form:l
Three Poisson distributions with different

lambda (E(X) = Var(X) =
41/49
Mixtures of Gaussian Densities

Often the shape of the set of vectors that belong to one class does not
look like what can be modeled by a single Gaussian.
A (weighted) sum of Gaussians can approximate many more densities:
In general, a class can be modeled as a mixture of Gaussians:
42/49
Gaussian Densities
The most often used model for (preprocessed) speech signals are
Gaussian densities.
Often the "size" of the parameter spaces is measured in "number of
densities
A multivariate Gaussian density looks like this:
Its parameters are:

The mean vector (a vector with d coefficients)
The covariance matrix (a symmetric dxd matrix),
if X indep. is diagonal
The determinant of the covariance matrix| |
43/49
Gaussian Classifier
For each class i, need to estimate from training data:
44/49
covariance matrix
mean vector i
Estimation of Parameters
MLE, Maximum Likelihood Estimation, i.e. Find the

set of parameters that maximizes the likelihood of
generating the observed data
If p(x|W) is assumed to be Gaussian, then W will be
defined by the mean and the covariance matrix:
1 n
= x k
n k =1
n
1
= (x k )( x k ) t
n k =1
45/49
Problems of Classifier Design
Features:
What and how many features should be selected?
Any features?
The more the better?
46/49
If additional features not useful (same mean and

covariance), classifier will automatically ignore
them?
Curse of Dimensionality
Adding more features
47/49
Adding independent features may help

BUT: adding indiscriminant features may lead to
worse performance!
Reason:
Training Data vs. Number of Parameters
Limited training data.
Solution:
select features carefully
reduce dimensionality
Principle Component Analysis
Trainability
48/49
Two-phoneme classification example (Huang et al.), Phonemes modeled

by Gaussian mixtures
Parameters are trained with a varied set of training samples
Problems
f(x)
49/49
Normal distribution does not model this situation well.

other densities may be mathematically intractable.
non-parametric techniques

Pattern Recognition Speech

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Pattern Recognition Speech

Hochgeladen von

Copyright:

Verfügbare Formate

Pattern Recognition and Classification

(for speech recognition)

11-751 Speech Recognition

Course information & quick review

The components of a modern ASR system

Pattern Recognition / Classification

Will continue with this topic on Wednesday

4 assignments over the course

In-class final exam at the end of the course

close book, covers the material presented in the course

Speech term project

oral presentation (15min)

written report (10 pages max)

demonstration (if applicable)

your ideas and creativity for projects are highly welcome

details and project ideas given on Wednesday

Yik-Cheung (Wilson) Tam

What we have looked at so far

Why speech recognition?

How humans generate speech

Vocal tract model of speech

Features used for speech recognition

Spectral representation of speech

LPC (Linear Predictive Coding)

MFCC (Mel frequency cepstral coefficients)

Dynamic time warping and template matching

Isolated word recognition

Vocal Tract Model of Speech

vocal tract parameters

Alex Waibel Speech Recognition

Recognition: and I am I being too yeah

Recognition: I have been ties than getting into the

Acoustic vectors computed every 10ms

Mel-Scale filters mimic auditory processing

DCT decorrelates the signal to improve statistical independence

First and second differentials appended dynamic information of signal

Template matching and DTW

Linear alignment can handle

Components of a Modern ASR System

ASR the big picture

ASR the big picture

find the most likely word sequence W

Speech Recognition Decoding

find the most likely word sequence W

Search (how to efficiently find maximum W)

Given W , what is the likelihood to see feature vector(s) x

we need a representation for W in terms of feature vectors

Usually a two-part representation:

pronunciation dictionary: describe W as concatenation of phones

phones models that explain phones in terms of feature vectors

Why breaking down the words into phones

Need collection of reference patterns for each word

High computational effort (esp. for large vocabularies), proportional

Large vocabulary also means: need huge amount of training data

Difficult to train suitable references (or sets of references)

Impossible to recognize untrained words

Replace whole words by suitable sub units

Poor performance when the environment changes

Works only well for speaker-dependent recognition (variations)

Unsuitable where speaker is unknown and no training is feasible

Unsuitable for continuous speech (combinatorial explosion)

Difficult to train/recognize subword units

Replace the template approach by a better modeling process