Sie sind auf Seite 1von 33

Lecture # 7 Session 2003

Pattern Classication
Introduction Parametric classiers Semi-parametric classiers Dimensionality reduction Signicance testing

6.345 Automatic Speech Recognition

Pattern Classication 1

Pattern Classication
Goal: To classify objects (or patterns) into categories (or classes)

Feature Extraction

Classier

Observation s Types of Problems:

Feature Vector x

Class i

1. Supervised: Classes are known beforehand, and data samples of each class are available 2. Unsupervised: Classes (and/or number of classes) are not known beforehand, and must be inferred from data

6.345 Automatic Speech Recognition

Pattern Classication 2

Probability Basics
Discrete probability mass function (PMF): P (i ) P (i ) = 1
i

Continuous probability density function (PDF): p(x) p(x)dx = 1

Expected value: E (x) E (x ) =

xp(x)dx

6.345 Automatic Speech Recognition

Pattern Classication 3

Kullback-Liebler Distance
Can be used to compute a distance between two probability mass distributions, P (zi ), and Q (zi ) D (P Q ) =
i

P (zi ) 0 P (zi ) log Q (zi )

Makes use of inequality log x x 1


i

Q (zi ) Q (zi ) 1) = P (zi ) log P (zi )( Q (zi ) P (zi ) = 0 P (zi ) P (zi ) i i

Known as relative entropy in information theory The divergence of P (zi ) and Q (zi ) is the symmetric sum D (P Q ) + D (Q P )

6.345 Automatic Speech Recognition

Pattern Classication 4

Bayes Theorem

p(x|1 )

p(x|2 )

PDF

Dene:

{i } P (i ) p(x|i ) P (i |x)

x a set of M mutually exclusive classes a priori probability for class i PDF for feature vector x in class i a posteriori probability of i given x P (i |x) = p (x ) =
M i =1

From Bayes Rule: where

p(x|i )P (i ) p (x ) p(x|i )P (i )

6.345 Automatic Speech Recognition

Pattern Classication 5

Bayes Decision Theory


The probability of making an error given x is: P (error |x) = 1 P (i |x) if decide class i

To minimize P (error |x) (and P (error )): Choose i if mathP (i |x) > P (j |x) For a two class problem this decision rule means: Choose 1 if p(x|1 )P (1 ) p(x|2 )P (2 ) > ; else 2 p (x ) p (x ) j = i

This rule can be expressed as a likelihood ratio: p(x|1 ) P (2 ) > ; else choose 2 Choose 1 if p(x|2 ) P (1 )

6.345 Automatic Speech Recognition

Pattern Classication 6

Bayes Risk
Dene cost function ij and conditional risk R(i |x): ij is cost of classifying x as i when it is really j R(i |x) is the risk for classifying x as class i R(i |x) =
M j =1

ij P (j |x)

Bayes risk is the minimum risk which can be achieved: j = i Choose i if R(i |x) < R(j |x) Bayes risk corresponds to minimum P (error |x) when All errors have equal cost (ij = 1, i = j )

There is no cost for being correct (ii = 0) R(i |x) = P (j |x) = 1 P (i |x)
j =i
6.345 Automatic Speech Recognition Pattern Classication 7

Discriminant Functions
Alternative formulation of Bayes decision rule Dene a discriminant function, gi (x), for each class i Choose i if gi (x) > gj (x) j = i

Functions yielding identical classication results: gi (x) = = = P (i |x) p(x|i )P (i ) log p(x|i ) + log P (i )

Choice of function impacts computation costs Discriminant functions partition feature space into decision regions, separated by decision boundaries

6.345 Automatic Speech Recognition

Pattern Classication 8

Density Estimation
Used to estimate the underlying PDF p(x|i ) Parametric methods: Assume a specic functional form for the PDF Optimize PDF parameters to t data Non-parametric methods: Determine the form of the PDF from the data Grow parameter set size with the amount of data Semi-parametric methods: Use a general class of functional forms for the PDF Can vary parameter set independently from data Use unsupervised methods to estimate parameters
6.345 Automatic Speech Recognition Pattern Classication 9

Parametric Classiers
Gaussian distributions Maximum likelihood (ML) parameter estimation Multivariate Gaussians Gaussian classiers

6.345 Automatic Speech Recognition

Parametric Classiers 1

Gaussian Distributions
Gaussian PDFs are reasonable when a feature vector can be viewed as perturbation around a reference
0.4

Probability Density

0.0

0.1

0.2

0.3

Simple estimation procedures for model parameters Classication often reduced to simple distance metrics Gaussian distributions also called Normal
6.345 Automatic Speech Recognition Parametric Classiers 2

Gaussian Distributions: One Dimension


One-dimensional Gaussian PDFs can be expressed as: p (x ) = 1 2 (x )2 N (, 2 ) e 2 2

The PDF is centered around the mean = E (x ) = xp(x)dx The spread of the PDF is determined by the variance = E ((x ) ) =
2 2

(x )2 p(x)dx

6.345 Automatic Speech Recognition

Parametric Classiers 3

Maximum Likelihood Parameter Estimation


Maximum likelihood parameter estimation determines an for parameter by maximizing the likelihood L() of estimate observing data X = {x1 , . . . , xn } = arg max

L ( )

Assuming independent, identically distributed data L() = p(X |) = p(x1 , . . . , xn |) =


n i =1

p(xi |)

ML solutions can often be obtained via the derivative L ( ) = 0 For Gaussian distributions log L() is easier to solve
6.345 Automatic Speech Recognition Parametric Classiers 4

Gaussian ML Estimation: One Dimension


The maximum likelihood estimate for is given by: 2 ( x ) i n n 1 2 2 L() = p(xi |) = e i =1 i =1 2 1 2 log L() = (xi ) n log 2 2 2 i 1 log L() = 2 (xi ) = 0 i 1 = xi n i The maximum likelihood estimate for is given by: 1 )2 = (xi n i
2
6.345 Automatic Speech Recognition Parametric Classiers 5

Gaussian ML Estimation: One Dimension


[s] Duration (1000 utterances, 100 speakers)

Probability Density

10

0.05

0.10

0.15 0.20 Duration (sec)

0.25

0.30

120 ms, 40 ms) (


6.345 Automatic Speech Recognition Parametric Classiers 6

ML Estimation: Alternative Distributions


[s] Duration: Gamma Distribution

Probability Density

10

0.05

0.10

0.15 0.20 Duration (sec)

0.25

0.30

6.345 Automatic Speech Recognition

Parametric Classiers 7

ML Estimation: Alternative Distributions


[s] Log Duration: Normal Distribution
1.4 Probability Density 0.0 0.2 0.4 0.6 0.8 1.0 1.2

-3.5

-3.0

-2.5 -2.0 log Duration (sec)

-1.5

6.345 Automatic Speech Recognition

Parametric Classiers 8

Gaussian Distributions: Multiple Dimensions


A multi-dimensional Gaussian PDF can be expressed as: 1 t 1 ( x ) (x ) 1 N ( , ) e 2 p (x ) = d/ 2 1 / 2 (2 ) | | d is the number of dimensions x = {x1 , . . . , xd } is the input vector = E (x) = {1 , . . . , d } is the mean vector = E ((x )(x )t ) is the covariance matrix with elements ij , inverse 1 , and determinant | | ij = ji = E ((xi i )(xj j )) = E (xi xj ) i j

6.345 Automatic Speech Recognition

Parametric Classiers 9

Gaussian Distributions: Multi-Dimensional Properties


If the i th and j th dimensions are statistically or linearly independent then E (xi xj ) = E (xi )E (xj ) and ij = 0 If all dimensions are statistically or linearly independent, then ij = 0 i = j and has non-zero elements only on the diagonal If the underlying density is Gaussian and is a diagonal matrix, then the dimensions are statistically independent and p (x ) =
d i =1

p(xi )

p(xi ) N (i , ii )

ii = i2

6.345 Automatic Speech Recognition

Parametric Classiers 10

Diagonal Covariance Matrix: = 2 I


2 0 = 0 2 3-Dimensional PDF
4

PDF Contour

4
0

2 -4 -2 0 2 4 -4
-4 -4 -2 0 2 4

0
-2

-2

6.345 Automatic Speech Recognition

Parametric Classiers 11

Diagonal Covariance Matrix: ij = 0


2 0 = 0 1 3-Dimensional PDF
4

i = j

PDF Contour

4
0

2 -4 -2 0 2 4 -4
-4 -4 -2 0 2 4

0
-2

-2

6.345 Automatic Speech Recognition

Parametric Classiers 12

General Covariance Matrix: ij = 0


2 1 = 1 1 3-Dimensional PDF
4

PDF Contour

4
0

2 -4 -2 0 2 4 -4
-4 -4 -2 0 2 4

0
-2

-2

6.345 Automatic Speech Recognition

Parametric Classiers 13

Multivariate ML Estimation
The ML estimates for parameters = {1 , . . . , l } are determined by maximizing the joint likelihood L( ) of a set of i.i.d. data X = { x 1 , . . . , xn } L( ) = p(X | ) = p(x1 , , xn | ) =
n i =1

p(xi | )

we solve L( ) = 0 , or log L( ) = 0 To nd = { , , } 1 l

The ML estimates of and are: 1 = xi n i


6.345 Automatic Speech Recognition

1 = )(xi )t ( xi n i
Parametric Classiers 14

Multivariate Gaussian Classier


p(x) N ( , ) Requires a mean vector i , and a covariance matrix i for each of M classes {1 , , M } The minimum error discriminant functions are of form: gi (x) = log P (i |x) = log p(x|i ) + log P (i ) 1 1 d t 1 gi (x) = (x i ) i (x i ) log 2 log | i | + log P (i ) 2 2 2 Classication can be reduced to simple distance metrics for many situations
6.345 Automatic Speech Recognition Parametric Classiers 15

Gaussian Classier: i = 2 I
Each class has the same covariance structure: statistically independent dimensions with variance 2 The equivalent discriminant functions are: x i 2 gi (x) = + log P (i ) 2 2 If each class is equally likely, this is a minimum distance classier, a form of template matching The discriminant functions can be replaced by the following linear expression: gi (x) = wit x + i 0 where wi =
1 2 i t 1 and i 0 = 2 2 i i + log P (i )

6.345 Automatic Speech Recognition

Parametric Classiers 16

Gaussian Classier: i = 2 I
For distributions with a common covariance structure the decision regions are hyper-planes.
5

4 2 0

4 2

-1 -1 0 1 2 3 4 5

6.345 Automatic Speech Recognition

Parametric Classiers 17

Gaussian Classier: i =
Each class has the same covariance structure The equivalent discriminant functions are: 1 gi (x) = (x i )t 1 (x i ) + log P (i ) 2 If each class is equally likely, the minimum error decision rule is the squared Mahalanobis distance The discriminant functions remain linear expressions: gi (x) = wit x + i 0 where i 0
6.345 Automatic Speech Recognition

wi = 1 i 1 t 1 = i i + log P (i ) 2
Parametric Classiers 18

Gaussian Classier: i Arbitrary


Each class has a dierent covariance structure i The equivalent discriminant functions are: 1 1 1 gi (x) = (x i )t ( x ) log | i | + log P (i ) i i 2 2 The discriminant functions are inherently quadratic: gi (x) = xt Wi x + wit x + i 0 where 1 1 Wi = i 2 1 wi = i i 1 1 1 = t log | i | + log P (i ) i i i 2 2

i 0

6.345 Automatic Speech Recognition

Parametric Classiers 19

Gaussian Classier: i Arbitrary


For distributions with arbitrary covariance structures the decision regions are dened by hyper-spheres.
3

-1

-2

-3 -1 0 1 2 3 4 5

6.345 Automatic Speech Recognition

Parametric Classiers 20

3 Class Classication (Atal & Rabiner, 1976)


Distinguish between silence, unvoiced, and voiced sounds Use 5 features: Zero crossing count Log energy Normalized rst autocorrelation coecient First predictor coecient, and Normalized prediction error Multivariate Gaussian classier, ML estimation Decision by squared Mahalanobis distance Trained on four speakers (2 sentences/speaker), tested on 2 speakers (1 sentence/speaker)
6.345 Automatic Speech Recognition Parametric Classiers 21

Maximum A Posteriori Parameter Estimation


Bayesian estimation approaches assume the form of the PDF p(x|) is known, but the value of is not Knowledge of is contained in: An initial a priori PDF p() A set of i.i.d. data X = {x1 , . . . , xn } The desired PDF for x is of the form p(x|X ) = p(x, |X )d = p(x|)p(|X )d that maximizes p(|X ) is called the maximum a The value posteriori (MAP) estimate of
n p(X |)p() = p(xi |)p() p(|X ) = p (X ) i =1
6.345 Automatic Speech Recognition Parametric Classiers 23

Gaussian MAP Estimation: One Dimension


For a Gaussian distribution with unknown mean : p(x|) N (, 2 )
2 p() N (0 , 0 )

MAP estimates of and x are given by: p(|X ) = p(x|X ) = where n =


n i =1 2 ) p(x|)p(|X )d N (n , 2 + n 2 p(xi |)p() N (n , n )

2 n0 2 n0 + 2

2
2 n0 + 2

2 n =

2 2 0 2 n0 + 2

, and p(x|X ) converges to As n increases, p(|X ) converges to , 2 ) the ML estimate N (


6.345 Automatic Speech Recognition Parametric Classiers 24

References
Huang, Acero, and Hon, Spoken Language Processing, Prentice-Hall, 2001. Duda, Hart and Stork, Pattern Classication, John Wiley & Sons, 2001. Atal and Rabiner, A Pattern Recognition Approach to Voiced-Unvoiced-Silence Classication with Applications to Speech Recognition, IEEE Trans ASSP, 24(3), 1976.

6.345 Automatic Speech Recognition

Parametric Classiers 25

Das könnte Ihnen auch gefallen