Beruflich Dokumente
Kultur Dokumente
Ravi Kumar
M. Hanmandlu
I. INTRODUCTION
Classification algorithms are one of the most widely used,
extensively researched and an active area of research in
statistical learning theory with applications in a wide range of
problems. Broadly defined, some decision or prediction is
shaped based on the available information and a classification
method is employed for discrimination in new circumstances.
The goal of any learning algorithm is to formulate a rule that
generalizes from the given data to new situations in a
reasonable way. Specifically, given a sample of training
vectors {(x , y ); i 1,..., l} , our aim is to find a function
i
h:
R( ) L( y, f ( x, ))dF ( x, y)
(2)
0 if y f ( x, )
L( y, f ( x, ))
1 if y f ( x, )
(3)
f .m. vi yi (wT xi b)
{where,
(5)
assumes
+1
if
A. Intuition
(wT xi b) 0
0
0
T x
0
0
and -1 otherwise}
pw,b (x)
( w, b) by ( w /
w ,b / w ) .
at all.
w 2 1 and
f .m. v min vi
i 1,...,l
(6)
(4)
For example, we consider the following figure 2
(Obtained from the method implemented in this paper), in
xi d i w / w
which lie on
wT x b 0
w T ( xi d i w / w ) b 0
d i y i {( w / w )T xi (b / w )}
(7)
Finally, for a training set {(x , y ); i 1,..., l} , we define
the geometric margin of ( w, b)
with respect to
i
G.M . d min d i
i 1,...,l
(8)
max d ,w,b d
s.t yi (wT xi b) d , i 1,..., l
w 1
(9)
v
w
(10)
wT x b 1 for
y =-1
(12)
min
w,b ,
l
1
2
w C i
2
i 1
s.t y i (w T xi b) 1 i
(13)
i 0
s.t y ( w x b) v, i 1,..., l
i
(11)
max d , w,b
1
2
w
2
i
T i
s.t y ( w x b) 1, i 1,..., l
min d , w,b
i 1
x1 ,..., xh that
p( ) E | Q( z, ) | .
(14)
vl ( )
1 l
| i Q( zi , ) |
l i 1
(15)
lim P{sup( p( ) v( )) } 0
(16)
Q *( z, ) ( z ), (, )
(17)
The equality
*(l / h) = P{sup( p *( ) v *( )) }
p *( ) E[Q *( )]
v *( )
1 l
Q *( zi , ) .
l i 1
(19)
R( )
Since,
1 l
1 2l
(i 2 Q( xi , )2 2iQ( xi , )) (i 2 Q( xi , ) 2 2iQ( xi , ))
l i 1
l i 11
2l
i 1
i l 1
R( ) 4 2 [ iQ( xi , ) iQ( xi , )]
l
2l
i 1
i l 1
R( ) [ i Q( xi , ) i Q( xi , )]
Where
(27)
f ( xi ) such that
Where
(26)
(20)
(25)
(22)
Where
(28)
f ( xi ) .
xi s, we take our
to be
f ( xi ) , f ( xi ) , f ( xi 2 ) , f ( xi ) ,..., f ( xi ) , f ( xi )
1
2l
2l
E. Estimation of Parameter
For the given real function, all we require to estimate is the
value of within the range of real function in order to form an
indicator function for which we can use the proposed
theoretical match with the our experimental measurements to
get the VC dimension. In order to estimate the parameter ,
2l
lets denote the sequence of 2l samples by Z . Denote a
new sequence Z 2l which can be obtained from the set Z 2l
(29)
Where is a small value which is smaller than the
least difference between all values of real functions.
From this exhaustive search we can easily find an optimal
value of which will provide us the indicator function which
we need for calculation of maximum difference between the
errors rates measured on two separate sets.
Once we have this value of , we can use it to calculate
by changing
set.
values
of
for
the
first
half
R( )
1 l
1 2l
(i Q( xi , ))2 (i Q( xi , )) 2
l i 1
l i 11
1 l
|i Q( xi , ) |
l i 1
1 2l
v2l ( Z 2l , ) |i Q( xi , ) |
l i l 1
v1l ( Z 2l , )
(30)
attains
maximum value.
Using experimental measurements we can fit these
measurements to analytic function that depends only on VCdim.
F. Experimental Procedure
We present an overall summary of the method used
in the calculation of VC dimension. The following procedure
summaries it all:
1) Generate a randomly labeled set of size 2n
G. Results
Here we discuss the results that has been obtained by
using the above method for the linear discrimination function.
Theoretical estimate of VC dimension with n inputs in n+1.
We first present the experimental set up used and then the
results obtained.
Experimental setup:
We have taken 25 different datasets (generating
random samples in each dataset) to conduct this
experiment.
Z1l x1 , 1 ; x2 , 2 ,..., xl , l ;
Z 2l xl 1 , l 1 ; xl 2 , l 2 ,..., x2l , 2l ;
Z x1 , 1 ; x2 , 2 ,..., xl , l ;
l
1
Z 2l xl 1 , l 1 ; xl 2 , l 2 ,..., x2l , 2l ;
4) Merge the two sets and train our learning machine to
get value of from the procedure described above.
5) Separate the sets and flip the labels on the second set
back again to calculate the maximal difference
between the error rates on two separate data sets.
6) Measure the difference between the error rates on the
two sets:
(l ) sup(v1l ( Z 2l , ) v2l ( Z 2l , ))
n from which
h according
30 .
n1 , n2 ,..., nd
in
i 1
(31)
Then,
(a) For any
w W and k 0
(36)
(b) For any
and k 0
(32)
yY
Proof.
(a) We start with,
wk 1 w = W [wk w (wk , k )] w
Where X and Y are closed convex set and is a convexconcave function define over XxY. In particular
(, y) : X
(x, ) : Y
wk w (wk , k ) w
= wk 1 w
(33)
Such a point is called saddle point of the function . In the
next section, we see how to obtain the saddle point solution
using subgradient method.
A. Subgradient Algorithm for Saddle Point
We first describe the notation used in this section.
[We
have
convex
min max ( w, )
(34)
for k 0,1,...
w0 and 0
w (wk , k )
and
with respect to
for k 0,1,...
respectively.
subgradient of
each and
w W and k 0 ,
k 1 [k (wk , k )]
2
k (wk , k )
So,
(35)
in w for
and
an
: Min-max problem.
We consider the minimax problem described by (32)
here
at w wk ]
: Step size.
Where
used
inequality,
w (wk , k )(wk w) (w, k ) (wk , k ) . Since the
k 1 [k ( wk , k )]
wk 1 W [wk w (wk , k )]
function ( w, ) is
wX
(37)
k 1 k 2 (k ) (wk , k ) (wk , k )
2
k 0
Since, ( wk , k ) is a subgradient of the concave function
( wk , ) at k , We have for all ,
(k ) (wk , k ) ( wk , k ) ( wk , )
Hence,
for
any
and
k 0
2
2
k 1 k 2 ( (wk , k ) (wk , ))
2 ( wk , k )
k 1
( (wi , i ) (w, i ))
w (wk , k ) ,
( wk , k )
k 0
i 0
Hence,
1 k 1
1 k 1
1
2
2
( w, i )
w0 w
(wi , i ) k
k i 0
2 k
2
i 0
Since, the function ( w, ) is concave in . For any fixed
w W there holds,
1 k 1
( w, k ) ( w, i ) with w W and
k i 0
1 k 1
k i
k i 0
Combining the preceding two relations, we get
Lemma 2:
Let the two sequences generated by eqn. (35) be denoted by
2
1 k 1
) 1 w w 2
(
w
,
(
w
,
i i
k
0
k i 0
2 k
2
k and
hold. Further, we let w
by
1 k 1
wi ,
k i 0
1 k 1
k i
k i 0
We then have k 1
1
(wi , i ) (w, k )
k i 0
for any w W ,
0
w0 w
2 k
i0
wi 1 w wi w 2 ( (wi , i ) (w, i )) 2 2
2
So,
1
2
2
2
( i 1 i )
( wi , i ) ( wi , )
2
2
and
So,
1 k 1
( wi , i ) ( w k , )
2 k
2
k i 0
for any ,
Proof.
Similarly,
i 0,
i 1 i 2 ( (wi , i ) ( wi , )) 2 2
w k
k 1
1
k 2
2
2
( w0 w wk w )
2
2
1
k 2 k 1
2
2
( k 0 )
( ( wi , i ) ( wi , ))
2
2
i 0
Implying that
1 k 1
1 k 1
( wi , i ) ( wi , )
2 k
2
k i 0
k i 0
Because the function ( w, ) convex in w for any
fixed , we have
1 k 1
with
and
(wi , ) (w k , )
k i 0
1 k 1
w k wi
k i 0
1
0
2
1
2 Combining the preceding two relations, we get
2
2
( wi w wi 1 w )
2
2
2
0
2 1 k 1
( wi , i ) ( w k , )
By adding these relations over i=0,,k-1, we get
2 k
2
k i 0
( wi , i ) ( w, i )
1
2 k
1
2 k
0 k
1 k 1
( wi , i ) ( w* , * )
2
k i 0
2
1
2
w0 w*
2 k
2
* 2
relation
* 2
k 1
2
2 ( w k , k ) ( w* , * )
2 k
2
2
1 k 1
( wi , i ) ( w k , k )
k i 0
1
2 k
w0 w k
k 1
1
( w k , k ) ( wi , i )
2
k i 0
2
2
1
()
0 k
2 k
2
2
w0 w k
k 1
1 k 1
1
( wi , i ) ( w k , k )
w0 w k
k i 0
2 k
where
2
w0 w* 0 k
1 k 1
wi
k i 0
1 k 1
k i
k i 0
w k
2 k
is the subgradient bound of Assumption 1.
Proof.
(a) In lemma 2, we take w w and in eqn. ()
and () respectively, we get for any k 1
*
1 k 1
1
( wi , i ) ( w* , k )
w0 w k
k i 0
2 k
0 *
2 k
2
2
2
2
1
2 k
0 *
a)
2
2
1
1 k 1
( wi , i ) ( w* , * )
k i 0
2
w0 w*
(5.7)
2 k
2
1 k 1
( wi , i ) ( w* , * )
converges to
k i 0
Within the error level which depends on step
size and bound of the subgradient and at the rate of
1/k.
1 k 1
( wi , i ) ( w k , * )
k i 0
b)
( w k , k )
converges to
( w* , * )
Within
I2 : [0, 0*),
I3 : [-0*,0),
I4 : (-1, -0*),
where 0*is an optimal probability threshold based on the risk
decision rule.
Based on risk decision rule of ERM we then calculate the
number of positive and negative samples in different intervals
If f ( x0 ) 1 Then sample
positive sample class.
b) If f ( x0 ) 1 Then sample
negative sample class.
c)
If
x0
and
B. Optimization Threshold
0* is critical parameter in decision rule for any given
optimization problem, Threshold 0*, there exists two types of
errors. The type 1 error, which are negative samples wrongly
classified to positive class and type 2 error which are positive
samples wrongly classified to negative class.
Probability of Type I error:
1.Pr(1<0),
Probability of Type II error:
(1-1).Pr(1 0)
The risk decision rule of ERM minimizes the expectation of
two types of errors by
x0
1 f ( x0 ) 1 Then sample x0
is classified to
is classified to
nij i=1,2.
is classified to
1 and
=
()
=
Then we apply the following classification rule to samples in
different interval of non-separable domain. The intervals are:
I1 : [ 0*,1),
C. Results
Table 3 shows the results on different datasets using the
subgradient method combined with above risk decision rule.
The result has been compared with the Linear C-SVM with
c=1 with the same data sets as in previous result in section 4.
VII. REFERENCES
[1] H. Drucker, C.J.C. Burges,A.J. Smola,V. Vapnik, Support
Vector Regression Machines, NIPS, pp.155-161(1996).
[2] A. Chambolle, T. Pock, A First-Order Primal-Dual
Algorithm for Convex Problems with Applications to
Imaging, Journal of Mathematical Imaging and Vision
40(1): pp.120-145(2011).
[3] I.Y. Zabotin, A subgradient method for finding a saddle
point of a convex-concave function, Issledovania po
prikladnoi matematike 15, pp. 6-12(1988).
[4] A.S. Nemirovski and D.B. Judin, Cezare convergence of
gradient method approximation of saddle points for
convex-concave functions, Doklady Akademii Nauk
SSSR 239, pp.1056-1059 (1978).
[5] G.M. Korpelevich, The extragradient method for finding
saddle points and other problems, Matekon 13, pp.35-49
(1977).
[6] A. Nedic, A. E. Ozdaglar, Approximate Primal Solutions
and Rate Analysis for Dual Subgradient Methods, SIAM
Journal on Optimization 19(4): pp. 1757-1780 (2009).
[7] T. Larsson, M. Patriksson, and A. Stromberg, Ergodic
results and bounds on the optimal value in subgradient
optimization, Operations Research Proceedings (P.
Kelinschmidt et al., ed.), Springer, pp. 30-35(1995).
[8] C. Schwartz, "Estimating the dimension of a model,"
Annals of Statistics 6, pp. 461-464 (1978).
[9] N.V. Snlirnov, Theory of probability and mathematical
statistics, (Selected works), Nauka, Moscow (1970).
[10] K.J. Arrow, L. Hurwicz, and H. Uzawa, Studies in linear
and non-linear programming, Stanford University Press,
Stanford, CA (1958).
[11] D. Maistroskii, Gradient methods for finding saddle
points, Matekon 13, pp. 3-22 (1977).
[12] Hamed Masnadi-Shirazi and Nuno Vasconcelos, Risk
minimization, probability elicitation, and cost-sensitive
SVMs, Proceedings of International Conference on
Machine Learning (ICML), 2010.
[13] H. Uzawa, Iterative methods in concave programming,
Studies in Linear and Nonlinear Programming (K. Arrow,
L. Hurwicz, and H. Uzawa, eds.), Stanford University
Press, , pp. 154-165(1958).
[14] Xuemei Zhang and Li Yang. Improving SVM through a
Risk Decision Rule Running on MATLAB, Journal of
software, VOL. 7, NO. 10, pp. 2252-2257, october 2012.
[15] Angelia Nedic, Asuman E. Ozdaglar,Subgradient methods
in network resource allocation: Rate analysis, CISS 2008:
pp.1189-1194.
[16] R.J. Solomonoff, "A formal theory of inductive inference,"
Parts 1 and 2, Inform. Contr.,7, pp. 1-22, pp. 224-254
(1964).
[17] R.A. Tapia and J.R. Thompn, Nonparametric probability
density estimation, The Job Hopkins University Press,
Baltimore (1978).
[18] E.G. Gol'shtein, A generalized gradient method for
finding saddle points, Matekon 10 (1974), 36-52.
[19] V. Vapnik, The Nature of Statistical Learning Theory,
Springer-Verlag, New York, 1995.
[20] V. Vapnik, Statistical Learning Theory, John Wiley and
Sons, Inc., New York, 1998.
[21] Vapnik,V., Levin,E. & LeCun,Y. Measuring the VCDimension of a Learning Machine, neural Computation, 6
, 851-876,(1994).