Pattern Classification

Lecture # 7 Session 2003
Pattern Classication
Introduction Parametric classiers Semi-parametric classiers Dimensionality reduction Signicance testing
6.345 Automatic Speech Recognition
Pattern Classication 1
Pattern Classication
Goal: To classify objects (or patterns) into categories (or classes)
Feature Extraction
Classier
Observation s Types of Problems:
Feature Vector x
Class i
1. Supervised: Classes are known beforehand, and data samples of each class are available 2. Unsupervised: Classes (and/or number of classes) are not known beforehand, and must be inferred from data
Probability Basics
Discrete probability mass function (PMF): P (i ) P (i ) = 1
i
Continuous probability density function (PDF): p(x) p(x)dx = 1
Expected value: E (x) E (x ) =
xp(x)dx
Kullback-Liebler Distance
Can be used to compute a distance between two probability mass distributions, P (zi ), and Q (zi ) D (P Q ) =
i
P (zi ) 0 P (zi ) log Q (zi )
Makes use of inequality log x x 1

i
Q (zi ) Q (zi ) 1) = P (zi ) log P (zi )( Q (zi ) P (zi ) = 0 P (zi ) P (zi ) i i
Known as relative entropy in information theory The divergence of P (zi ) and Q (zi ) is the symmetric sum D (P Q ) + D (Q P )
Bayes Theorem
p(x|1 )
p(x|2 )
PDF
Dene:
{i } P (i ) p(x|i ) P (i |x)
x a set of M mutually exclusive classes a priori probability for class i PDF for feature vector x in class i a posteriori probability of i given x P (i |x) = p (x ) =
M i =1
From Bayes Rule: where
p(x|i )P (i ) p (x ) p(x|i )P (i )
Bayes Decision Theory

The probability of making an error given x is: P (error |x) = 1 P (i |x) if decide class i
To minimize P (error |x) (and P (error )): Choose i if mathP (i |x) > P (j |x) For a two class problem this decision rule means: Choose 1 if p(x|1 )P (1 ) p(x|2 )P (2 ) > ; else 2 p (x ) p (x ) j = i
This rule can be expressed as a likelihood ratio: p(x|1 ) P (2 ) > ; else choose 2 Choose 1 if p(x|2 ) P (1 )
Bayes Risk
Dene cost function ij and conditional risk R(i |x): ij is cost of classifying x as i when it is really j R(i |x) is the risk for classifying x as class i R(i |x) =
M j =1
ij P (j |x)
Bayes risk is the minimum risk which can be achieved: j = i Choose i if R(i |x) < R(j |x) Bayes risk corresponds to minimum P (error |x) when All errors have equal cost (ij = 1, i = j )
There is no cost for being correct (ii = 0) R(i |x) = P (j |x) = 1 P (i |x)
j =i
6.345 Automatic Speech Recognition Pattern Classication 7
Discriminant Functions
Alternative formulation of Bayes decision rule Dene a discriminant function, gi (x), for each class i Choose i if gi (x) > gj (x) j = i
Functions yielding identical classication results: gi (x) = = = P (i |x) p(x|i )P (i ) log p(x|i ) + log P (i )
Choice of function impacts computation costs Discriminant functions partition feature space into decision regions, separated by decision boundaries
Density Estimation
Used to estimate the underlying PDF p(x|i ) Parametric methods: Assume a specic functional form for the PDF Optimize PDF parameters to t data Non-parametric methods: Determine the form of the PDF from the data Grow parameter set size with the amount of data Semi-parametric methods: Use a general class of functional forms for the PDF Can vary parameter set independently from data Use unsupervised methods to estimate parameters
6.345 Automatic Speech Recognition Pattern Classication 9
Parametric Classiers
Gaussian distributions Maximum likelihood (ML) parameter estimation Multivariate Gaussians Gaussian classiers
Parametric Classiers 1
Gaussian Distributions
Gaussian PDFs are reasonable when a feature vector can be viewed as perturbation around a reference
0.4
Probability Density
0.0
0.1
0.2
0.3
Simple estimation procedures for model parameters Classication often reduced to simple distance metrics Gaussian distributions also called Normal
6.345 Automatic Speech Recognition Parametric Classiers 2
Gaussian Distributions: One Dimension

One-dimensional Gaussian PDFs can be expressed as: p (x ) = 1 2 (x )2 N (, 2 ) e 2 2
The PDF is centered around the mean = E (x ) = xp(x)dx The spread of the PDF is determined by the variance = E ((x ) ) =
2 2
(x )2 p(x)dx
Maximum Likelihood Parameter Estimation

Maximum likelihood parameter estimation determines an for parameter by maximizing the likelihood L() of estimate observing data X = {x1 , . . . , xn } = arg max
L ( )
Assuming independent, identically distributed data L() = p(X |) = p(x1 , . . . , xn |) =

n i =1
p(xi |)
ML solutions can often be obtained via the derivative L ( ) = 0 For Gaussian distributions log L() is easier to solve
Gaussian ML Estimation: One Dimension

The maximum likelihood estimate for is given by: 2 ( x ) i n n 1 2 2 L() = p(xi |) = e i =1 i =1 2 1 2 log L() = (xi ) n log 2 2 2 i 1 log L() = 2 (xi ) = 0 i 1 = xi n i The maximum likelihood estimate for is given by: 1 )2 = (xi n i
2
Gaussian ML Estimation: One Dimension

[s] Duration (1000 utterances, 100 speakers)
Probability Density
10
0.05
0.10
0.15 0.20 Duration (sec)
0.25
0.30
120 ms, 40 ms) (

ML Estimation: Alternative Distributions

[s] Duration: Gamma Distribution
Probability Density
10
0.05
0.10
0.15 0.20 Duration (sec)
0.25
0.30
ML Estimation: Alternative Distributions

[s] Log Duration: Normal Distribution
1.4 Probability Density 0.0 0.2 0.4 0.6 0.8 1.0 1.2
-3.5
-3.0
-2.5 -2.0 log Duration (sec)
-1.5
Gaussian Distributions: Multiple Dimensions

A multi-dimensional Gaussian PDF can be expressed as: 1 t 1 ( x ) (x ) 1 N ( , ) e 2 p (x ) = d/ 2 1 / 2 (2 ) | | d is the number of dimensions x = {x1 , . . . , xd } is the input vector = E (x) = {1 , . . . , d } is the mean vector = E ((x )(x )t ) is the covariance matrix with elements ij , inverse 1 , and determinant | | ij = ji = E ((xi i )(xj j )) = E (xi xj ) i j
Gaussian Distributions: Multi-Dimensional Properties

If the i th and j th dimensions are statistically or linearly independent then E (xi xj ) = E (xi )E (xj ) and ij = 0 If all dimensions are statistically or linearly independent, then ij = 0 i = j and has non-zero elements only on the diagonal If the underlying density is Gaussian and is a diagonal matrix, then the dimensions are statistically independent and p (x ) =
d i =1
p(xi )
p(xi ) N (i , ii )
ii = i2
Diagonal Covariance Matrix: = 2 I

2 0 = 0 2 3-Dimensional PDF
4
PDF Contour
4
0
2 -4 -2 0 2 4 -4
-4 -4 -2 0 2 4
0
-2
-2
Diagonal Covariance Matrix: ij = 0

4
i = j
PDF Contour
4
0
2 -4 -2 0 2 4 -4
-4 -4 -2 0 2 4
0
-2
-2
General Covariance Matrix: ij = 0

4
PDF Contour
4
0
2 -4 -2 0 2 4 -4
-4 -4 -2 0 2 4
0
-2
-2
Multivariate ML Estimation
The ML estimates for parameters = {1 , . . . , l } are determined by maximizing the joint likelihood L( ) of a set of i.i.d. data X = { x 1 , . . . , xn } L( ) = p(X | ) = p(x1 , , xn | ) =
n i =1
p(xi | )
we solve L( ) = 0 , or log L( ) = 0 To nd = { , , } 1 l
The ML estimates of and are: 1 = xi n i

1 = )(xi )t ( xi n i
Multivariate Gaussian Classier

p(x) N ( , ) Requires a mean vector i , and a covariance matrix i for each of M classes {1 , , M } The minimum error discriminant functions are of form: gi (x) = log P (i |x) = log p(x|i ) + log P (i ) 1 1 d t 1 gi (x) = (x i ) i (x i ) log 2 log | i | + log P (i ) 2 2 2 Classication can be reduced to simple distance metrics for many situations
Gaussian Classier: i = 2 I
Each class has the same covariance structure: statistically independent dimensions with variance 2 The equivalent discriminant functions are: x i 2 gi (x) = + log P (i ) 2 2 If each class is equally likely, this is a minimum distance classier, a form of template matching The discriminant functions can be replaced by the following linear expression: gi (x) = wit x + i 0 where wi =
1 2 i t 1 and i 0 = 2 2 i i + log P (i )
Gaussian Classier: i = 2 I
For distributions with a common covariance structure the decision regions are hyper-planes.
5
4 2 0
4 2
-1 -1 0 1 2 3 4 5
Gaussian Classier: i =
Each class has the same covariance structure The equivalent discriminant functions are: 1 gi (x) = (x i )t 1 (x i ) + log P (i ) 2 If each class is equally likely, the minimum error decision rule is the squared Mahalanobis distance The discriminant functions remain linear expressions: gi (x) = wit x + i 0 where i 0
wi = 1 i 1 t 1 = i i + log P (i ) 2
Gaussian Classier: i Arbitrary

Each class has a dierent covariance structure i The equivalent discriminant functions are: 1 1 1 gi (x) = (x i )t ( x ) log | i | + log P (i ) i i 2 2 The discriminant functions are inherently quadratic: gi (x) = xt Wi x + wit x + i 0 where 1 1 Wi = i 2 1 wi = i i 1 1 1 = t log | i | + log P (i ) i i i 2 2
i 0
Gaussian Classier: i Arbitrary

For distributions with arbitrary covariance structures the decision regions are dened by hyper-spheres.
3
-1
-2
-3 -1 0 1 2 3 4 5
3 Class Classication (Atal & Rabiner, 1976)

Distinguish between silence, unvoiced, and voiced sounds Use 5 features: Zero crossing count Log energy Normalized rst autocorrelation coecient First predictor coecient, and Normalized prediction error Multivariate Gaussian classier, ML estimation Decision by squared Mahalanobis distance Trained on four speakers (2 sentences/speaker), tested on 2 speakers (1 sentence/speaker)
Maximum A Posteriori Parameter Estimation

Bayesian estimation approaches assume the form of the PDF p(x|) is known, but the value of is not Knowledge of is contained in: An initial a priori PDF p() A set of i.i.d. data X = {x1 , . . . , xn } The desired PDF for x is of the form p(x|X ) = p(x, |X )d = p(x|)p(|X )d that maximizes p(|X ) is called the maximum a The value posteriori (MAP) estimate of
n p(X |)p() = p(xi |)p() p(|X ) = p (X ) i =1
Gaussian MAP Estimation: One Dimension

For a Gaussian distribution with unknown mean : p(x|) N (, 2 )
2 p() N (0 , 0 )
MAP estimates of and x are given by: p(|X ) = p(x|X ) = where n =

n i =1 2 ) p(x|)p(|X )d N (n , 2 + n 2 p(xi |)p() N (n , n )
2 n0 2 n0 + 2
2
2 n0 + 2
2 n =
2 2 0 2 n0 + 2
, and p(x|X ) converges to As n increases, p(|X ) converges to , 2 ) the ML estimate N (

References
Huang, Acero, and Hon, Spoken Language Processing, Prentice-Hall, 2001. Duda, Hart and Stork, Pattern Classication, John Wiley & Sons, 2001. Atal and Rabiner, A Pattern Recognition Approach to Voiced-Unvoiced-Silence Classication with Applications to Speech Recognition, IEEE Trans ASSP, 24(3), 1976.

Pattern Classification

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Pattern Classification

Hochgeladen von

Copyright:

Verfügbare Formate

Lecture # 7 Session 2003

6.345 Automatic Speech Recognition

Observation s Types of Problems:

6.345 Automatic Speech Recognition

Continuous probability density function (PDF): p(x) p(x)dx = 1

Expected value: E (x) E (x ) =

6.345 Automatic Speech Recognition

P (zi ) 0 P (zi ) log Q (zi )

Makes use of inequality log x x 1

Q (zi ) Q (zi ) 1) = P (zi ) log P (zi )( Q (zi ) P (zi ) = 0 P (zi ) P (zi ) i i

6.345 Automatic Speech Recognition

From Bayes Rule: where

6.345 Automatic Speech Recognition

Bayes Decision Theory

6.345 Automatic Speech Recognition

6.345 Automatic Speech Recognition

6.345 Automatic Speech Recognition

Gaussian Distributions: One Dimension

6.345 Automatic Speech Recognition

Maximum Likelihood Parameter Estimation

Assuming independent, identically distributed data L() = p(X |) = p(x1 , . . . , xn |) =

Gaussian ML Estimation: One Dimension

Gaussian ML Estimation: One Dimension

0.15 0.20 Duration (sec)

120 ms, 40 ms) (

ML Estimation: Alternative Distributions

0.15 0.20 Duration (sec)

6.345 Automatic Speech Recognition

ML Estimation: Alternative Distributions

-2.5 -2.0 log Duration (sec)

6.345 Automatic Speech Recognition

Gaussian Distributions: Multiple Dimensions

6.345 Automatic Speech Recognition

Gaussian Distributions: Multi-Dimensional Properties

6.345 Automatic Speech Recognition

Diagonal Covariance Matrix: = 2 I

6.345 Automatic Speech Recognition

Diagonal Covariance Matrix: ij = 0

6.345 Automatic Speech Recognition

General Covariance Matrix: ij = 0

6.345 Automatic Speech Recognition

The ML estimates of and are: 1 = xi n i

Multivariate Gaussian Classier

6.345 Automatic Speech Recognition

6.345 Automatic Speech Recognition

Gaussian Classier: i Arbitrary

6.345 Automatic Speech Recognition

Gaussian Classier: i Arbitrary

6.345 Automatic Speech Recognition

3 Class Classication (Atal & Rabiner, 1976)

Maximum A Posteriori Parameter Estimation

Gaussian MAP Estimation: One Dimension

MAP estimates of and x are given by: p(|X ) = p(x|X ) = where n =

, and p(x|X ) converges to As n increases, p(|X ) converges to , 2 ) the ML estimate N (

6.345 Automatic Speech Recognition

Das könnte Ihnen auch gefallen