Sie sind auf Seite 1von 49

Pattern Recognition and Classification

(for speech recognition)

11-751 Speech Recognition


09-15-2008

09-15-2008

Course information & quick review

The components of a modern ASR system

Pattern Recognition / Classification

2/49

Will continue with this topic on Wednesday

Course Grading

30%

3/49

Homework Assignments

4 assignments over the course

40%

Exam

[12-Dec]

In-class final exam at the end of the course

close book, covers the material presented in the course

30%

Speech term project

Proposal (1-pager)

[due: 08-Oct]

oral presentation (15min)

[start of Dec]

written report (10 pages max)

[due: 15-Dec]

demonstration (if applicable)

your ideas and creativity for projects are highly welcome

details and project ideas given on Wednesday

Instructors
interACT
2F, 407 S.Craig St.

Newell-Simon Hall

Doherty Hall

4/49

RM 203:

Alex Waibel

( ahw@cs.cmu.edu)

RM 221:

Ian Lane

( ianlane@cs.cmu.edu )

RM 209:

Yik-Cheung (Wilson) Tam

( yct@cs.cmu.edu )

What we have looked at so far

Why speech recognition?

Speech production

How humans generate speech

Vocal tract model of speech

Features used for speech recognition

Spectral representation of speech

LPC (Linear Predictive Coding)

MFCC (Mel frequency cepstral coefficients)

Dynamic time warping and template matching

5/49

Isolated word recognition

Vocal Tract Model of Speech


pitch period

vocal tract parameters


t
impulse train
generator

w
G(t)

w
vocal tract
V(t)

x
random noise
generator
A

w
radiation
model R(t)

speech
P L (n )

Alex Waibel Speech Recognition

Sloppy Speech
Actual Input: I have been I have been getting into
ConverSational
Speech

Recognition: and I am I being too yeah


Read
Speech

Recognition: I have been ties than getting into the


Alex Waibel Speech Recognition

Feature Extraction
Speech Waveform

FFT
FFT based spectrum

x
Mel scale triangular filters

Log

8/49

DCT

39 Element
Acoustic
Vector

Acoustic vectors computed every 10ms

Mel-Scale filters mimic auditory processing

DCT decorrelates the signal to improve statistical independence

First and second differentials appended dynamic information of signal

Template matching and DTW


First idea to overcome the varying length of
Utterances, Problem (2):
1. Normalize their length
2. Make a linear alignment

Linear alignment can handle


the problem of different speaking rates
But: it can not handle the problem of
varying speaking rates during
the same utterance.
9/49

Components of a Modern ASR System

Suggested reading:
S. Young, Large vocabulary continuous speech recognition: A review
10/49

ASR the big picture

???
Input
Speech
11/49

Output
Text
Hello world

ASR the big picture


The purpose of Signal Preprocessing is:
1) Signal Digitalization (Quantization and Sampling)
Represent an analog signal in an appropriate form to be
processed by the computer
2) Digital Signal Preprocessing (Feature Extraction)
Extract features that are suitable for recognition process

Front-end
Processing

Input
Speech
12/49

???

Output
Text
Hello world

Fundamental Equation
For observed feature vector sequence x

find the most likely word sequence W

=argmax P Wx=argmax P W P xW
W
P x
W
W

Front-end
Processing

Input
Speech
13/49

???

Output
Text
Hello world

Speech Recognition Decoding


For observed feature vector sequence x

find the most likely word sequence W

Search (how to efficiently find maximum W)


=argmax P Wx=argmax P W P xW
W
P x
W
W

Front-end
Processing

Input
Speech
14/49

P xW P W

Output
Text
Hello world

Acoustic
Model

Language
Model

Acoustic Model

Given W , what is the likelihood to see feature vector(s) x

we need a representation for W in terms of feature vectors

Usually a two-part representation:

pronunciation dictionary: describe W as concatenation of phones

phones models that explain phones in terms of feature vectors


Front-end
Processing

Input
Speech

P xW P W

Output
Text
Hello world

Acoustic
Model
(phones)

15/49

I
/i/
you /j/ /u/
we /v/ /e/

Pronunciation
Dictionary
(map words to
phone sequences)

Why breaking down the words into phones

Need collection of reference patterns for each word

High computational effort (esp. for large vocabularies), proportional


to vocabulary size

Large vocabulary also means: need huge amount of training data

Difficult to train suitable references (or sets of references)

Impossible to recognize untrained words

Replace whole words by suitable sub units

Poor performance when the environment changes

Works only well for speaker-dependent recognition (variations)

Unsuitable where speaker is unknown and no training is feasible

Unsuitable for continuous speech (combinatorial explosion)

Difficult to train/recognize subword units

Replace the template approach by a better modeling process


16/49

Speech Production as a Stochastic Process

The same word / phoneme sounds different every time it is uttered

Regard words / phonemes as states of a speech production process

In a given state we can observe different acoustic sounds

Not all sounds are possible / likely in every state

We say: in a given state the speech process "emits" sound


according to some probability distribution
The production process makes transitions from one state to
another
Not all transitions are possible, they have different probabilities

When we specify the probabilities for sound-emissions (emission


probabilities) and for the state transitions, we call this a model.
17/49

HMM Acoustic Modelling


a22

a12

b2(y1)

a33

a23

b2(y2)

a44

a34

b3(y3)

a45

4
b4(y4)

b4(y5)

Acoustic Vector
Sequence
Y

18/49

= y1

y2

y3

y4

y5

Hidden Markov Model for each phone or senome (context dependent model)

Transition probabilities: ai j model durational variability in speech

Output distribution: bi(yk) model spectral variability

Language Model

What is the likelihood to see word sequence W

prior probability of W independent of observed event x

Front-end
Processing

Input
Speech

19/49

P xW P W

Output
Text
Hello world

Language
Model
(likelihood of word
sequences)

p(you|how are)
p(today|are you)
p(world|Hello)

Language Modelling

P(W) is the a-priori probability of observing word


sequence W independent of the observed signal x
N-gram language model estimate the probability of
some word wk given the preceding n-1 words

Typically use n=3,4

P W = P w kw k 1 , w k 2

20/49

Smoothing required account for word sequences not seen


during training

Decoding with classifiers


Speech
features

Speech
Feature
extraction

Hypotheses
(phonemes)
Decision
(apply trained
classifiers)

...
/h/

21/49

/h/ /e/ /l/ /o/


/w/ /o/ /r/ /l/ /d/

Training classifiers
Speech
features

Aligned
Speech
Feature
extraction

/h/ /e/ /l/ /o/

22/49

Improved
Classifiers
Train
Classifier

/h/ /e/ /l/ /o/

/e/

Use all aligned speech


features (e.g. of phoneme /
e/) to train the reference
vectors of /e/ (=Codebook)

- kmeans
- LVQ

Pattern Recognition and Classification


(for speech recognition)

Suggested reading:
X. Huang, A. Acero, H. Hon, Spoken Language Processing, Chapter 4
R. Duda and P Hart, Pattern Classification and Scene Analysis, John
Wiley & Sons, 2000 (2nd Edition)
23/49

Pattern Recognition Approaches


Pattern
Recognition

knowledge /
connectionist

statistical

supervised

parametric

nonparametric

linear

24/49

unsupervised

nonlinear

Pattern Recognition Approaches

25/49

Knowledge-based approaches:

Compile knowledge

Build decision trees

Connectionist approaches:

Automatic knowledge acquisition, "black-box" behavior

Simulation of biological processes

Statistical approaches:

Build a statistical model of the "real world"

Compute probabilities according to the models

Classification Trees
Simple binary decision tree
for height classification:
T=tall,
t=medium-tall,
M=medium,
m=medium-short,
S=short
Goal: Predict the height of a new person

Using decision tree to predict someones height class by


traversing the tree and answering the yes/no questions
Choice and order of questions is designed subjectively (knowledge-based)
Classification and Regression Trees (CART) provide an
automatic,
26/49

data-driven framework to construct the decision process

The CART Algorithm


Step 1: Create a set of questions Q that consists of all possible questions
Step 2: Pick a splitting criterion that can evaluate all possible questions
Step 3: Create a tree with one root node consisting of all training samples
Step 4: Find the best composite question for each terminal node
(Goal is classification, so objective is to reduce uncertainty; use
entropy H > find question which gives the greatest H reduction)
- generate a tree with several simple-question splits
- cluster leaf nodes into two classes according to the splitting criterion
- construct a corresponding composite question (,)
Step 5: Split: Take the split with the best criterion from step 4
Step 6: Stop criterion: go to step 7 if all leaf nodes contain data samples
from the same class or if the improvements of all potential splits
fall below a defined threshold
Step 7: Prune the tree into the optimal size using an independent test
sample estimate or cross-validation to prevent the tree from
27/49over-modeling the training data = allow generalization

Neural Net Approaches


Parallel Supercomputers:
renewed interest in NN
Appealing to ASR:
parallel evaluation of
many clues and facts
Most common approach:
Multi-layer Perceptron MLP
NN: attempt real-time response and human-like performance
Many simple processing elements operating in parallel
Most common training procedure: error back-propagation:
Generalization of the MMSE (Minimum Mean Squared Error) algorithm
Gradient search: minimize difference between actual and wanted output
MLP approximates the a posteriori probabilities P(Class|Pattern)
Common problem: if an output of 0 0 0 ... 0 1 0 ... 0 0 0 is desired,
the net tends to produce 0 0 0 ... 0 for all inputs
28/49

Pattern Recognition

29/49

Types of classifiers

Supervised - Unsupervised Classifiers

Parametric - Non-Parametric Classifiers

Linear - Non-linear Classifiers

Classical Statistical Methods

Bayes Classifier

K-Nearest Neighbor

Connectionist Methods

Perceptron

Multilayer Perceptrons

Supervised - Unsupervised
Supervised training:
Class to be recognized is known
for each sample in training data.
Requires a priori knowledge of
useful features and knowledge
/labeling of each training token
(cost!).
Unsupervised training:
Class is not known and structure
is to be discovered automatically.
Feature-space-reduction
example: clustering, auto-associative nets
30/49

Unsupervised Classification
F2
++
++ +
+
+
+
++ +
+ ++ ++
++++ + +
+ + + + ++ + + +
+ + +
+
+ +++ + ++

Classification:

F1

Classes Not Known: Find Structure

31/49

Unsupervised Classification
F2
++
++ +
+
+
+
++ +
+ ++ ++
++++ + +
+ + + + ++ + + +
+ + +
+
+ +++ + ++

Classification:

F1

Classes Not Known: Find Structure


Clustering
How to cluster? How many clusters?
32/49

Supervised Classification
Income
++
++ +
+

credit-worthy
+ non-credit-worthy

+
++ +
+ ++ ++
++++ + +
+ + + + ++ + + +
+ + +
+
+ +++ + ++

Classification:

Age

Classes Known: Creditworthiness: Yes-No


Features: Income, Age
Classifiers
33/49

Classification Problem
Income
++
++ +
+

credit-worthy
+ non-credit-worthy

+
++ +
+ ++ ++ Is
++++ + +
+ + + + ++ + + +
+ + +
+
+ +++ + ++

34/49

x1
x
2

Joe credit-worthy ?

Age

Features: age, income

Classes: creditworthy, non-creditworthy

Problem: Given Joe's income and age, should a loan be made?

Other Classification Problems: Fraud Detection, Customer Selection...

Parametric - Non-parametric
Number

bad loans
good loans

Income

Parametric:

minimum error criterium

assume underlying probability distribution;

estimate the parameters of this distribution.

Example: "Gaussian Classifier"

Non-parametric:

Don't assume distribution.

Estimate probability of error or error criterion directly from training data.

Examples: Parzen Window, k-nearest neighbor, perceptron...

35/49

Bayes Decision Theory


Bayes Rule:
plays a central role for statistical pattern recognition
Concept of decision making based on:

P ( j )

1) prior knowledge of categories prior probability:

observation of x

AND

2) knowledge from observation data posterior probability:

Bayes Rule:

p (x / j)P ( j)
P ( j / x ) =
p(x )

where:

p (x ) = p (x / j)P ( j)
j

Class-conditional Probability Density function: p ( x / j )


(referred to as the likelihood function how likely is x generated)
36/49

P ( j / x)

Minimum-Error-Rate Decision Rule


P (e rro r / x ) =

P ( 1 / x ) if we decide
P ( 2 / x ) else

Classification error is minimized, if we:


Decide 1 if P ( 1 / x ) > P ( 2 / x ) ;
2 otherwise
Decide 1 if p ( x / 1 ) P ( 1 ) > p ( x / 2 ) P ( 2 ) ;
2 otherwise
For the multi-category case:
Decide i if P ( i / x ) > P ( j / x ) for all j i

37/49

Classification Error
Bayes decision rule:
move the decision boundary such that the decision is made to
choose the class i based on the maximum value of P(x|i) P(i).
The tail integral area P(error) becomes minimum

38/49

Classifier Discriminant Functions


Decision problem = pattern classification problem where unknown data x
are classified into known categories (e.g. classify sounds into phonemes)

g i(x ),i = 1,...,c

for all j i

Assign x to class i , if g i ( x ) > g j ( x )


Minimum-error-rate classifier:
=

g i(x ) = P ( i / x )

p (x / i)P ( i)
c

p (x / j)( j)
j= 1

independent of class i

g i(x ) = p (x / i)P ( i)
g i ( x ) = lo g ( p ( x / i ) ) + lo g ( P ( i ) )

class
39/49

conditional probability density function A priori probability

Classifier Design in Practice


Need a priori probability P (

Need class conditional PDF p ( x

(not too bad)


/ i)

Problems:
limited training data
limited computation
class-labeling potentially costly and prone to error
classes may not be known
good features not known

Parametric Solution:
Assume that p ( x / i ) has a particular parametric form
Most common representative: multivariate normal density
40/49

Three Binomial distributions with different


probs p; E(X) = np Var(X) = np(1-p)

Three Gaussian distributions with same


mean but different variances (sigma)
Most important probability distribution
since random variables in physical
experiments (incl. Speech signals) have
distribs which are approximately Gaussian
- normal distribution
X has a Gaussian distrib with mean and
variance 2 if X has a continuous pdf of
the form:l

Three Poisson distributions with different


lambda (E(X) = Var(X) =
41/49

Mixtures of Gaussian Densities


Often the shape of the set of vectors that belong to one class does not
look like what can be modeled by a single Gaussian.
A (weighted) sum of Gaussians can approximate many more densities:

In general, a class can be modeled as a mixture of Gaussians:

42/49

Gaussian Densities
The most often used model for (preprocessed) speech signals are
Gaussian densities.
Often the "size" of the parameter spaces is measured in "number of
densities
A multivariate Gaussian density looks like this:

Its parameters are:


The mean vector (a vector with d coefficients)
The covariance matrix (a symmetric dxd matrix),
if X indep. is diagonal
The determinant of the covariance matrix| |

43/49

Gaussian Classifier

For each class i, need to estimate from training data:

44/49

covariance matrix
mean vector i

Estimation of Parameters

MLE, Maximum Likelihood Estimation, i.e. Find the


set of parameters that maximizes the likelihood of
generating the observed data
If p(x|W) is assumed to be Gaussian, then W will be
defined by the mean and the covariance matrix:

1 n
= x k
n k =1
n
1
= (x k )( x k ) t
n k =1

45/49

Problems of Classifier Design

Features:

What and how many features should be selected?

Any features?

The more the better?

46/49

If additional features not useful (same mean and


covariance), classifier will automatically ignore
them?

Curse of Dimensionality

Adding more features

47/49

Adding independent features may help


BUT: adding indiscriminant features may lead to
worse performance!

Reason:

Training Data vs. Number of Parameters

Limited training data.

Solution:

select features carefully

reduce dimensionality

Principle Component Analysis

Trainability

48/49

Two-phoneme classification example (Huang et al.), Phonemes modeled


by Gaussian mixtures
Parameters are trained with a varied set of training samples

Problems
f(x)

49/49

Normal distribution does not model this situation well.


other densities may be mathematically intractable.
non-parametric techniques

Das könnte Ihnen auch gefallen