Sie sind auf Seite 1von 64

2

Introduction to
Predictive Learning
Electrical and Computer Engineering
LECTURE SET 4
Statistical Learning Theory
3
OUTLINE of Set 4
Objectives and Overview
Inductive Learning Problem Setting
Keep-It-Direct Principle
Analysis of ERM
VC-dimension
Generalization Bounds
Structural Risk Minimization (SRM)
Summary and Discussion
4
Objectives
Problems with philosophical approaches
- lack quantitative description/ characterization of
ideas;
- no real predictive power (as in Natural Sciences)
- no agreement on basic definitions/ concepts (as in
Natural Sciences)

Goal: to introduce Predictive Learning as a scientific
discipline



5
Characteristics of Scientific Theory
Problem setting
Solution approach
Math proofs (technical analysis)
Constructive methods
Applications

Note: Problem Setting and Solution Approach are
independent (of each other)

6
History and Overview
SLT aka VC-theory (Vapnik-Chervonenkis)
Theory for estimating dependencies from finite
samples (predictive learning setting)
Based on the risk minimization approach
All main results originally developed in 1970s for
classification (pattern recognition) why?
but remained largely unknown
Recent renewed interest due to practical success of
Support Vector Machines (SVM)
7
History and Overview(contd)
MAIN CONCEPTUAL CONTRIBUTIONS
Distinction between problem setting, inductive principle
and learning algorithms
Direct approach to estimation with finite data (KID
principle)
Math analysis of ERM (standard inductive setting)
Two factors responsible for generalization:
- empirical risk (fitting error)
- complexity(capacity) of approximating functions
8
Importance of VC-theory
Math results addressing the main question:
- under what general conditions the ERM approach leads to
(good) generalization?
New approach to induction:
Predictive vs generative modeling (in classical statistics)

Connection to philosophy of science
- VC-theory developed for binary classification (pattern
recognition) ~ the simplest generalization problem
- natural sciences: from observations to scientific law
VC-theoretical results can be interpreted using general
philosophical principles of induction, and vice versa.
9
Inductive Learning Setting
The learning machine observes samples (x ,y), and returns an
estimated response
Two modes of inference: identification vs imitation
Risk
) , ( w f y x =
min ,y) ,w)) dP( Loss(y, f(
}
x x
10
The Problem of Inductive Learning
Given: finite training samples Z={(xi, yi),i=1,2,n}
choose from a given set of functions f(x, w) the one
that approximates best the true output. (in the sense
of risk minimization)
Concepts and Terminology
approximating functions f(x, w)
(non-negative) loss function L(f(x, w),y)
expected risk functional R(Z,w)
Goal: find the function f(x, wo) minimizing R(Z,w) when
the joint distribution P(x,y) is unknown.
11
Empirical Risk Minimization
ERM principle in model-based learning
Model parameterization: f(x, w)
Loss function: L(f(x, w),y)
Estimate risk from data:
Choose w* that minimizes R
emp

Statistical Learning Theory developed from
the theoretical analysis of ERM principle
under finite sample settings

=
=
n
i
i i emp
y f L
n
R
1
) ), , ( (
1
) ( w x w
12
Probabilistic Modeling vs ERM
13
Probabilistic Modeling vs ERM: Example
-2 0 2 4 6 8 10
-6
-4
-2
0
2
4
6
8
10
x1
x
2
Known class distribution optimal decision boundary
14
Probabilistic Approach
Estimate parameters of Gaussian class distributions , and
plug them into quadratic decision boundary
-2 0 2 4 6 8 10
-6
-4
-2
0
2
4
6
8
10
x1
x
2
15
ERM Approach
Quadratic and linear decision boundary estimated via
minimization of squared loss
-2 0 2 4 6 8 10
-6
-4
-2
0
2
4
6
8
10
x1
x
2
16
Estimation of multivariate functions
Is it possible to estimate a function from finite data?
Simplified problem: estimation of unknown continuous
function from noise-free samples
Many results from function approximation theory:
To estimate accurately a d-dimensional function one needs
O(n^^d) data points
For example, if 3 points are needed to estimate 2-nd order
polynomial for d=1, then 3^^10 points are needed to estimate 2-
nd order polynomial in 10-dimensional space.
Similar results in signal processing
Never enough data points to estimate multivariate
functions in most practical applications (image recognition,
genomics etc.)
For multivariate function estimation, the number of free
parameters increases exponentially with problem
dimensionality (the Curse of Dimensionality)
17
Properties of high-dimensional data
Sparse data looks like a porcupine: the volume of a unit sphere inscribed in a d-
dimensional cube gets smaller even as the volume of d-cube gets exponentially larger!
A point is closer to an edge than to another point
Pairwise distances between points are the same
Intuition behind kernel (local) methods no longer holds.

How generalization is possible, in spite of the curse of dimensionality?

18
OUTLINE of Set 4
Objectives and Overview
Inductive Learning Problem Setting
Keep-It-Direct Principle
Analysis of ERM
VC-dimension
Generalization Bounds
Structural Risk Minimization (SRM)
Summary and Discussion
19
Keep-It-Direct Principle
The goal of learning is generalization rather than
estimation of true function (system identification)


Keep-It-Direct Principle (Vapnik, 1995)
Do not solve an estimation problem of interest by
solving a more general (harder) problem as an
intermediate step
Good predictive model reflects some properties of
unknown distribution P(x,y)
Since model estimation with finite data is ill-posed, one
should never try to solve a more general problem than
required by given application
Importance of formalizing application requirements
as a learning problem.
min ,y) ,w)) dP( Loss(y, f(
}
x x
20
Learning vs System Identification
Consider regression problem
where unknown target function

Goal 1: Prediction

Goal 2: Function Approximation (system identification)

or

Admissible models: algebraic polynomials
Purpose of comparison: contrast goals (1) and (2)
NOTE: most applications assume Goal 2, i.e.
Noisy Data ~ true signal + noise
o + = ) (x g y
) / ( ) ( x x y E g =
min ) , ( )) , ( ( ) (
2
=
}
y dP f y R x w x w
min )) ( ) , ( ( ) (
2
=
}
x x w x w d g f R
min ) / ( ) , ( x w x y E f
21
Empirical Comparison
Target function: sine-squared








Input distribution: non-uniform Gaussian pdf
Additive gaussian noise with st. deviation = 0.1


) 2 ( sin ) (
2
x x g t =
] 1 , 0 [ e x
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
22
Empirical Comparison (contd)
Model selection: use separate data sets
- training : for parameter estimation
- validation: for selecting polynomial degree
- test: for estimating prediction risk (MSE)

Validation set generated differently to contrast (1)&(2)
Predictive Learning (1) ~ Gaussian
Funct. Approximation (2) ~ uniform fixed sampling

Training + test data ~ Gaussian

Training set size: 30 Validation set size : 30



23
Regression estimates (2 typical realizations of data):












Dotted line ~ estimate obtained using predictive learning
Dashed line ~ estimate via function approximation setting


24
Conclusion
The goal of prediction (1) is different (less
demanding) than the goal of estimating the
true target function (2) everywhere in the
input space.
The curse of dimensionality applies to system
identification setting (2), but may not hold
under predictive setting (1).
Both settings coincide if the input distribution
is uniform (i.e., in signal and image denoising
applications)
25
Philosophical Interpretation of KID
Interpretation of predictive models
Realism ~ objective truth (hidden in Nature)
Instrumentalism ~ creation of human mind
(imposed on the data) favored by KID
Objective Evaluation still possible (via prediction
risk reflecting application needs) Natural
Science
Methodological implications
Importance of good learning formulations
(asking the right question)
Accounts for 80% of success in applications

26
OUTLINE of Set 4
Objectives and Overview
Inductive Learning Problem Setting
Keep-It-Direct Principle
Analysis of ERM
VC-dimension
Generalization Bounds
Structural Risk Minimization (SRM)
Summary and Discussion
27
VC-theory has 4 parts:
1. Analysis of consistency/convergence of
ERM

2. Generalization bounds
3. Inductive principles (for finite samples)
4. Constructive methods (learning
algorithms) for implementing (3)
NOTE: (1)(2)(3)(4)

1
1
( ) ( , ( )) min
n
emp i i
i
R L y f
n
e e
=
=

x ,
28
Consistency/Convergence of ERM
Empirical Risk known but Expected Risk unknown
Asymptotic consistency requirement:
under what (general) conditions models providing min
Empirical Risk will also provide min Prediction Risk,
when the number of samples grows large?

Why asymptotic analysis is needed?
- helps to develop useful concepts
- necessary and sufficient conditions ensure that VC-
theory is general and can not be improved
29
Consistency of ERM
Convergence of empirical risk to expected risk does not
imply consistency of ERM
Models estimated via ERM (w*) are always biased estimates
of the functions minimizing true risk:








( ) e
emp
R
( ) e R
( ) ( )
* *
n n emp
R R e e <
30
Conditions for Consistency of ERM
Main insight: consistency is not possible without
restricting the set of possible models
Example: 1-nearest neighbor classification method.
- is it consistent ?

Consider binary decision functions (classification)
How to measure their flexibility, or ability to explain the
training data (for binary classification)?
This complexity index for indicator functions:
- is independent of unknown data distribution;
- measures the capacity of a set of possible models,
rather than characteristics of the true model

31
OUTLINE of Set 4
Objectives and Overview
Inductive Learning Problem Setting
Keep-It-Direct Principle
Analysis of ERM
VC-dimension
Generalization Bounds
Structural Risk Minimization (SRM)
Summary and Discussion
32
SHATTERING
Linear indicator functions: can split 3 data points in 2D in
all 2^^3 = 8 possible binary partitions





If a set of n samples can be separated by a set of
functions in all 2^^n possible ways, this sample is said to
be shattered (by the set of functions)
Shattering ~ a set of models can explain a given sample of
size n (for all possible labelings)


33
VC DIMENSION
Definition: A set of functions has VC-dimension h is there exist h
samples that can be shattered by this set of functions, but there
are no h+1 samples that can be shattered






VC-dimension h=3 ( h=d+1 for linear functions )
VC-dim. is a positive integer (combinatorial index)
What is VC-dim. of 1-nearest neighbor classifier ?






34
VC-dimension and Consistency of ERM
VC-dimension is infinite if a sample of size n can be split
in all 2^^n possible ways
(in this case, no valid generalization is possible)
Finite VC-dimension gives necessary and sufficient
conditions for:
(1) consistency of ERM-based learning
(2) fast rate of convergence
(these conditions are distribution-independent)

Interpretation of the VC-dimension via falsifiability:
functions with small VC-dim can be easily falsified

35
VC-dimension and Falsifiability
A set of functions has VC-dimension h if
(a) It can explain (shatter) a set of h samples
~ there exists h samples that cannot falsify it
and
(b) It can not shatter h+1 samples
~ any h+1 samples falsify this set

Finiteness of VC-dim is necessary and sufficient
condition for generalization
(for any learning method based on ERM)

36
Recall Occams Razor:
Main problem in predictive learning
- Complexity control (model selection)
- How to measure complexity?
Interpretation of Occams razor (in Statistics):
Entities ~ model parameters
Complexity ~ degrees-of-freedom
Necessity ~ explaining (fitting) available data
Model complexity = number of parameters (DoF)
Consistent with classical statistical view:
learning = function approx. / density estimation

37
Philosophical Principle of VC-falsifiability
Occams Razor: Select the model that explains
available data and has the smallest number of
free parameters (entities)

VC theory: Select the model that explains
available data and has low VC-dimension (i.e. can
be easily falsified)
New principle of VC-falsifiability
38
Calculating the VC-dimension

How to estimate the VC-dimension (for a given set of
functions)?

Apply definition (via shattering) to derive analytic
estimates works for simple sets of functions

Generally, such analytic estimates are not possible for
complex nonlinear parameterizations (i.e., for practical
machine learning and statistical methods)

39

Example: VC-dimension of spherical indicator functions.
Consider spherical decision surfaces in a d-dimensional x-space,
parameterized by center c and radius r parameters:

In a 2-dim space (d=2) there exists 3 points that can be shattered,
but 4 points cannot be shattered h=3

( )
( )
2 2
, , ( ) f r I r = s x c x c
40

Example: VC-dimension of a linear combination of fixed basis functions (i.e.
polynomials, Fourier expansion etc.)
Assuming that basis functions are linearly independent, the VC-dim equals
the number of basis functions (free parameters).
Example: single parameter but infinite VC-dimension


f x,w ( ) = I sinwx >0 ( )
41

Example: Wide linear decision boundaries
Consider linear functions such that the distance btwn D(x) and
the closest data sample is larger than given value
Then VC-dimension depends on the width parameter, rather
than d (as in linear models):

( ) b D + = ) ( x w x
A
1 , min
2
2
+
|
|
.
|

\
|
A
s d
R
h
42
Linear combination of fixed basis functions


is equivalent to linear functions in m-dimensional space
VC-dimension = m + 1
(this assumes linear independence of basis functions)

In general, analytic estimation of VC-dimension is hard

VC-dimension can be
- equal to DoF
- larger than DoF
- smaller than DoF

( )
0
1
( ) 0
m
i i
i
f I wg w
=
| |
= + >
|
\ .

x, w x
43
VC-dimension vs number of parameters
VC-dimension can be equal to DoF (number of
parameters)
Example: linear estimators

VC-dimension can be smaller than DoF
Example: penalized estimators

VC-dimension can be larger than DoF
Example: feature selection
sin (wx)

44
VC-dimension for Regression Problems
VC-dimension was defined for indicator functions

Can be extended to real-valued functions, i.e.
third-order polynomial for univariate regression:

linear parameterization VC-dim = 4

Qualitatively, the VC-dimension ~ the ability to fit (or
explain) finite training data for regression.
This can be also related to falsifiability

b x w x w x w b x f + + + =
1
2
2
3
3
) , , ( w
45
Example: what is VC-dim of kNN Regression?
Ten training samples from

Using k-nn regression with k=1 and k=4:



25 . 0 ), , 0 ( 1 . 0
2 2 2
= + + = o o where N x x y
46
OUTLINE of Set 4
Objectives and Overview
Inductive Learning Problem Setting
Keep-It-Direct Principle
Analysis of ERM
VC-dimension
Generalization Bounds
Structural Risk Minimization (SRM)
Summary and Discussion
47
Recall consistency of ERM
Two Types of VC-bounds:
(1) How close is the empirical risk to the true risk

(2) How close is the empirical risk to the minimal possible risk ?








( ) * e
emp
R
( ) * e R
48
Generalization Bounds
Bounds for learning machines (implementing ERM)
evaluate the difference btwn (unknown) risk and known
empirical risk, as a function of sample size n and general
properties of admissible models (their VC-dimension)
Classification: the following bound holds with probability of
for all approximating functions

where is called the confidence interval
Regression: the following bound holds with probability of
for all approximating functions


where

( ) n h n R R R
emp emp
/ ln , / ), ( ) ( ) ( q e e e u + <
q 1
( )
+
< c e e c R R
emp
1 / ) ( ) (
q 1
c = c
n
h
,
lnq
n
|
\
|
.
= a
1
h ln
a
2
n
h
+1
|
\
|
.
ln q / 4 ( )
n
u
49
Practical VC Bound for regression
Practical regression bound can be obtained by setting
the confidence level and theoretical constants:



can be used for model selection (examples given later)
Compare to analytic bounds (SC, FPE) in Lecture Set 2
Analysis (of denominator) shows that
h < 0.8 n for any estimator
In practice:
h < 0.5 n for any estimator



1
2
ln
ln 1 ) ( ) (

+
|
|
.
|

\
|
+ s
n
n
n
h
n
h
n
h
h R h R
emp
( ) 1 , / 4 n min = q
50
VC Regression Bound for model selection
VC-bound can be used for analytic model selection
(if the VC-dimension is known)
Example: polynomial regression for estimating Sine_Squared target
function from 25 noisy samples

Optimal model found:
6-th degree polynomial
(no resampling needed)






0 0.2 0.4 0.6 0.8 1
-0.5
0
0.5
1
1.5
x
y
51
Modeling pure noise with x in [0,1] via poly regression
sample size n=30, noise
1 = o

Comparison of different model selection methods:
- prediction risk (MSE)
- selected DoF (~ h)

fpe gcv vc cv
10
-10
10
0
10
10
R
i
s
k

(
M
S
E
)
fpe gcv vc cv
0
10
20
30
D
e
g
r
e
e

o
f

F
r
e
e
d
o
m
52
OUTLINE of Set 4
Objectives and Overview
Inductive Learning Problem Setting
Keep-It-Direct Principle
Analysis of ERM
VC-dimension
Generalization Bounds
Structural Risk Minimization (SRM)
Summary and Discussion
53
Structural Risk Minimization
Analysis of generalization bounds

suggests that when n/h is large, the term is small

This leads to parametric modeling approach (ERM)
When n/h is not large (say, less than 20), both terms in the right-
hand side of VC- bound need to be minimized
make the VC-dimension a controlling variable
SRM = formal mechanism for controlling model complexity
Set of admissible models has a nested structure
such that
structure formally defines complexity ordering

( ) n h n R R R
emp emp
/ ln , / ), ( ) ( ) ( q e e e u + <
u
) ( ~ ) ( e e
emp
R R
S
1
cS
2
c...cS
k
c...
) , ( e x f
h
1
s h
2
...s h
k
...
54
Structural Risk Minimization
An upper bound on the true risk and the empirical risk, as a
function of VC-dimension h (for fixed sample size n)

55
SRM vs ERM modeling
56
SRM Approach
Use VC-dimension as a controlling parameter for minimizing
VC bound:


Two general strategies for implementing SRM:

1. Keep fixed and minimize
(most statistical and neural network methods)

2. Keep fixed and minimize
(Support Vector Machines)

( ) h n R R
emp
/ ) ( ) ( u + < e e ( ) h n R R
emp
/ ) ( ) ( u + < e e
) (e
emp
R
) (e
emp
R
( ) h n/ u
( ) h n/ u
57
Common SRM structures
Dictionary structure
A set of algebraic polynomials
is a structure since

More generally
where is a set of basis functions (dictionary).

The number of terms (basis functions) m specifies an element
of a structure.
For fixed basis fcts, VC-dim ~ number of parameters

( )

=
+ =
m
i
i
i m
x w b x f
0
, w
f
1
c f
2
c....c f
k
c....
( ) ( )

=
+ =
m
i
i i m
, g w b f
0
, , v x V w x
g x, v
i
( )
w
i
58
Common SRM structures
Feature selection (aka subset selection)
Consider sparse polynomials of degree m:
for m=1:
for m=2:
etc.
Each monomial is a feature. The goal is to select a set of m
features providing min. empirical risk (MSE)
This is a structure since

More generally,
where m basis fcts are selected from a (large) set of M fcts
Note: nonlinear optimization, VC-dimension is unknown

. ... f .... f f
m 2 1
c c c c
f
m
x, w, V ( ) = w
i
g x, v
i
( )
i =0
m

1
) , , , (
1 1
k
wx b k b w x f + =
2 1
2 1 2 1 2
) , , , , (
k k
x w x w b k k b x f + + = w
59
Common SRM structures
Penalization
Consider algebraic polynomial of fixed degree
where

For each (positive) value c this set of functions specifies an
element of a structure
Minimization of empirical risk (MSE) on each element of a
structure is a constrained minimization problem
This optimization problem can be equivalently stated as
minimization of the penalized empirical risk functional:
where the choice of
Note: VC-dimension is unknown

( )

=
=
10
0
,
i
i
i
x w x f w
k
c s
2
w
S
k
= { f x,w ( ), w
2
s c
k
}
...
3 2 1
c c c < <
S
k
R
pen
e,
k
( )= R
emp
e ( ) +
k
w
2
k k
c ~
60
Example: SRM structures for regression
Regression data set
x-values~ uniformly sampled in [0,1]
y-values ~ target fct
additive Gaussian noise with st. dev 0.05
Experimental set-up
training set ~ 40 samples
validation set ~ 40 samples (for model selection)
SRM structures defined on algebraic polynomials
- dictionary (polynomial degrees 1 to 10)
- penalization (fixed degree-10 polynomial)
- sparse polynomials (degree 1 to 5)

x x x y 5 . 0 2 . 0 ) 2 sin( 8 . 0
2
+ = t
61
Estimated models using different SRM structures:
- dictionary
- penalization lambda=1.013e-005
- sparse polynomial
Visual results: target fct~ red line, feature selection~ black solid,
dictionary ~ green, penalization ~ yellow line
5 4 3 2
9565 . 55 3952 . 158 7679 . 163 2162 . 68 4198 . 6 4078 . 0 x x x x x y + + + =
4 3 2
2736 . 19 1772 . 41 7337 . 22 6186 . 0 x x x y + =
0 0.2 0.4 0.6 0.8 1
-1.5
-1
-0.5
0
0.5
1
x
y

62
SRM Summary
SRM structure ~ complexity ordering on a set of
admissible models (approximating functions)
Many different structures on the same set of
approximating functions (possible models)
How to choose the best structure?
- depends on application data
- VC theory cannot provide answer
SRM = mechanism for complexity control
- selecting optimal complexity for a given data set
- new measure of complexity: VC-dimension
- model selection using analytic VC-bounds

63
OUTLINE of Set 4
Objectives and Overview
Inductive Learning Problem Setting
Keep-It-Direct Principle
Analysis of ERM
VC-dimension
Generalization Bounds
Structural Risk Minimization (SRM)
Summary and Discussion
64
Summary and Discussion: VC-theory
Methodology
- learning problem setting (KID principle)
- concepts (risk minimization, VC-dimension, structure)
Interpretation/ evaluation of existing methods
Model selection using VC-bounds
New types of inference (TBD later)
What theory can not do:
- provide formalization (for a given application)
- select good structure
- always a gap between theory and applications

Das könnte Ihnen auch gefallen