Sie sind auf Seite 1von 7

ECE 8443 – Pattern Recognition

LECTURE 14: SUFFICENT STATISTICS

• Objectives:
Sufficient Statistics
Dimensionality
Complexity
Overfitting
• Resources:
DHS – Chap. 3 (Part 2)
Rice – Sufficient Statistics
Ellem – Sufficient Statistics
TAMU – Dimensionality

• URL: .../publications/courses/ece_8443/lectures/current/lecture_14.ppt
14: SUFFICIENT STATISTICS
DEFINITION
• Direct computation of p(D|) and p(|D) for large data sets
is challenging (e.g. neural networks)
• We need a parametric form for p(x|) (e.g., Gaussian)
• Gaussian case: computation of the sample mean and
covariance, which was straightforward, contained all the
information relevant to estimating the unknown population
mean and covariance.
• This property exists for other distributions.
• A sufficient statistic is a function s of the samples D that
contains all the information relevant to a parameter, .
• A statistic, s, is said to be sufficient for  if p(D|s,) is
independent of :
p( D | s ,  ) p(  | s )
p(  | s , D )   p(  | s )
p( D | s )
14: SUFFICIENT STATISTICS
FACTORIZATION THEOREM

• Theorem: A statistic, s, is sufficient for , if and only if


p(D|) can be written as: p( D |  )  g( s ,  )h( D ) .
• There are many ways to formulate sufficient statistics
(e.g., define a vector of the samples themselves).
• Useful only when the function g() and the sufficient
statistic are simple (e.g., sample mean calculation).
• The factoring of p(D|) is not unique:
g( s ,  )  f ( s ) g( s ,  ) h( D )  h( D ) / f ( s)
• Define a kernel density invariant to scaling:

~ g( s ,  )
g( s, ) 
 g( s ,  )d
14: SUFFICIENT STATISTICS
GAUSSIAN DISTRIBUTION
n 1 1 t 1
p( D |  )   exp[  ( x k   )  ( x k   )]
d 2 12
k 1( 2 )  2
1 1 n t 1 t 1 1
 exp[       2  x k  x t
k  xk ]
d 2 12
( 2 )  2 k 1

t 1  
n t 1 n
 exp[          x k  ]
2  k 1 
1 1 n t 1
 exp[   x k  x k ]
d 2 12
( 2 )  2 k 1
• This isolates the  dependence in the first term, and
hence, the sample mean is a sufficient statistic.
• The kernel is: ~ 1 1  1 1 
g( 
ˆ n , )  exp[ (   
ˆ n )t   (   
ˆ n) ]
1
12 2 n 
( 2 )d 2

n
14: SUFFICIENT STATISTICS
EXPONENTIAL FAMILY
• This can be generalized:
p( x |  )  x  exp[ a(  )  b(  )t c( x )]
and: n n
p( D |  )  exp[ na(  )  b(  )  c( x k ) ]  x k   g( s ,  )h( D )
t
k 1 k 1

• Examples:
14: PROBLEMS OF DIMENSIONALITY
DIRECTIONS OF DISCRIMINATION
• If features are statistically independent, in theory we can
get excellent performance.
• Recall the Bayes error rate for a two-class multivariate
normal problem: 1   u2 2
p( e )  e du
2 r 2
where r2 is the Mahalanobis distance:
r 2  ( 1   2 )t  1 ( 1   2 )
• For conditionally independent features:
2
d  i1   i 2 
r   
2

i 1 i 
Most useful features are those for which the difference of
the means is large w.r.t. the standard deviation.
14: PROBLEMS OF DIMENSIONALITY
COMPUTATIONAL COMPLEXITY
• “Big Oh” notation used to describe complexity:
if f(x) = 2+3x+4x2, f(x) has computational complexity O(x2)
• Recall:
1 t ˆ 1 d 1 ˆ
g( x )   ( x  
ˆ )  (x 
ˆ )  ln( 2 )  ln   ln P (  )
2 2 2

O( dn ) O( nd 2 ) O( 1 ) O( d 2 n ) O ( n )

• Watch those constants of proportionality (e.g., O(nd2).


• If the number of data samples is inadequate, we can
experience overfitting (which implies poor generalization).
• Hence, later in the course, we will study ways to control
generalization and to smooth estimates of key parameters
such as the mean and covariance (see textbook).

Das könnte Ihnen auch gefallen