Beruflich Dokumente
Kultur Dokumente
1. Introduction
Kernel-based algorithms, such as Gaussian processes
(Mackay, 1997), support vector machines (Burges, 1998),
and kernel PCA (Mika et al., 1998), are enjoying great popularity in the statistical learning community. The common
idea behind these methods is to express our prior beliefs
about the correlations, or more generally, the similarities,
between pairs of points in data space in terms of a kernel
function
, and thereby to implicitly construct a mapping
to a Hilbert space
, in
which the kernel appears as the inner product,
(1)
$93'8!476"&54320('&$#!
% '1 )%"
In addition to adequately expressing the known or hypothesized structure of the data space, the function must satisfy two mathematical requirements to be able to serve as a
kernel: it must be symmetric (
) and
positive semi-denite. Constructing appropriate positive
denite kernels is not a simple task, and this has largely
been the reason why, with a few exceptions, kernel methods
,
have mostly been conned to Euclidean spaces
where several families of provably positive semi-denite
and easily interpretable kernels are known. When dealing
with intrinsically discrete data spaces, the usual approach
has been either to map the data to Euclidean space rst
(as is commonly done in text classication, treating integer word counts as real numbers (Joachims, 1998)) or,
when no such simple mapping is forthcoming, to forgo using kernel methods altogether. A notable exception to this
is the line of work stemming from the convolution kernel
idea introduced in (Haussler, 1999) and related but independently conceived ideas on string kernels rst presented
in (Watkins, 1999). Despite the promise of these ideas, relatively little work has been done on discrete kernels since
the publication of these articles.
' D0 )
'C$" % 5BA' % $#!@
) "
The application of kernel-based learning algorithms has, so far, largely been conned to realvalued data and a few special data types, such
as strings. In this paper we propose a general
method of constructing natural families of kernels over discrete structures, based on the matrix
exponentiation idea. In particular, we focus on
generating kernels on graphs, for which we propose a special class of exponential kernels, based
on the heat equation, called diffusion kernels, and
show that these can be regarded as the discretisation of the familiar Gaussian kernel of Euclidean
space.
Abstract
f d
" Be df H)
is dened as
(3)
p p n
Qppq o mj l q kj i h) e
g
(4)
It is well known that any even power of a symmetric matrix is positive semi-denite, and that the set of positive
semi-denite matrices is complete with respect to limits of
sequences under the Frobenius norm. Taking to be symmetric and replacing by
shows that the exponential
of any symmetric matrix is symmetric and positive semidenite, hence it is a candidate for a kernel.
ei e
W) R
HIG h
(7)
where
is a conditionally positive semi-denite kernel;
that is, it satises (2) under the additional constraint that
.1
Whereas (5) involves matrix exponentiation via (3), formula (7) prescribes the more straight-forward componentwise exponentiation. On the other hand, conditionally
1
Instead of using the term conditionally positive denite,
this type of object is sometimes referred to by saying that
is negative denite. Confusingly, a negative denite kernel is
then not the same as the negative of a positive denite kernel, so
we shall avoid using this terminology.
) "
P F F 0 %' $#!
(6)
(5)
guaranteeing positive deniteness without seriously restricting our choice of kernel. Furthermore, differentiating
with respect to and examining the resulting differential equation
e'3
yxwv u ' % $#!@
"
) P tFF
s s s s
q g
' riphR
W U % Hfed %' bT5C %' 5 R &5 R I I
c c " '
a a
Y
`F R X
W R
@UV %' $"T5 FP R F SIH!GPQF HIHGF
E E
(2)
d
d e'5W @
g'5W
) '
d
" p) c c
H'
for any
(Haussler, 1999). Such kernels form continuous families
, indexed by a real parameter , and are related to innitely divisible probability distributions, which are the limits of sums of independent random variables (Feller, 1971). The tautology
becomes, as goes to innity,
2. Exponential kernels
convolution
" e r
)
d )
se
xd )
" f de c ed H~}
c
z ) d
e
|e{ u ' 20' @
d
Y ) d x
' @#' yX pg e
s v s v s)
but v pQpp w but f buthr
'
i q q) { q ' z q )
" q ' 5 HG 7 g h0
)
' ' q '
)
0 E
E
"
k&'5W P b' P P S$e'3W P b' P P )
z
E ' ' E )
c
{ z )
"
W
{ b'' ' $'' ' '
)
@4 c )
p '5Wfb' 0' f
)
q s
$'' 6" pQpp 6"' q 76"' s 20' f q ms
)
g Y q $" s tX Y q b" s QX
"' )
&$p# 0'
p $'' G )
' ' v0'
s h) s
)
Y g &' $" Qppp " q b" s !2tkX
e
' d q ' d s v' d
)
q
d
s s) s s s
q q g s g s g q q g s
oq) c
t s )
W V e)$#!
) '"
z
' s% b" s ! dP F 6tF { q z ' %q $" q 5 mP F F { s ) P F P F 7Qq F 7yF g{ q z
r&
)
s g
q
s ' b" q s !
Y ' d q tX Y ' s d s yX s
p q W s
" Dq q tpQp q q q q )
R
) '
r&!' R 5 Y v R X ' % ! P F HF tPF
positive denite matrices are somewhat elusive mathematical objects, and it is not clear where Schoenbergs beautiful result will nd application in statistical learning theory. The advantage of our relatively brute-force approach
to constructing positive denite objects is that it only requires that the generator be symmetric (more generally,
self-adjoint) and guarantees the positive semi-deniteness
of the resulting kernel .
where
if
and otherwise. In other words,
we take the generator over the product set
to be
, where
and
are the
and
dimensional diagonal kernels, respectively.
Plugging into (6) shows that the corresponding kernels will
be given simply by
that is,
any exponential kernel on
over length sequences
by
(8)
showing that
is, in fact, negative semi-denite. Acting on functions
by
,
can also be regarded as an operator.
In fact, it is easy to show that on a square grid in dimensional Euclidean space with grid spacing ,
is just the nite difference approximation to the familiar
continuous Laplacian
are used to describe the diffusion of heat and other substances through continuous media, our equation
(10)
with
as dened in (9) is called the heat equation on ,
and the resulting kernels are called diffusion or heat kernels.
3.1 A stochastic and a physical model
can be written as
(11)
The negative of this matrix (sometimes up to normalization) is called the Laplacian of , and it plays a central role
in spectral graph theory (Chung, 1997). It is instructive to
note that for any vector
,
which simplies to
Cov
(12)
where
basis of loops
' % 5 ' % 5 F
)
d
p %' ! F %' ! F c c
)
' % 5 F
R
'
&5 R
d
p %8eC|'%8! P FP '%%&$"#! 0|'85 F c c
% c
)%
' % bTa" 50' % ! F
)
where
is the set of all paths on , gives a representation of the
kernel in the space
of linear combinations of paths
of the form
D
2
Note that
Laplacian.
' )
p 5Tk' "5
Y ' X
It is well known that diffusion is closely related to random walks. A random walk on an unweighted graph
is a stochastic process generating sequences
where
in such a way that
if
and zero otherwise.
)
p b'' ' & ( G h0' c c
E
' i ) q
' i ) s
Y s s b" p) p p " s g b" X )
"
i '' %% "" %' "
)
)
' A
"
X
Y s t th g Qx' " pQp p " s " 3) h)
" s s d eG
)
$uqs f j 0 E q u '!
c
q)
c
' 7"5
d d
' f) ' )
s
W d
c d V`vQs ) s tw d
) ' )
d
' c A!
' q s
g t c
pp)Qp kks) tds ) tHf
" " b"
These equations are the same as those describing the relaxation of a network of capacitors of unit capacitance,
where one plate of each capacitor is grounded, and the
other plates are connected according to the graph structure,
each edge corresponding to a connection of resistance
.
The
then measure the potential at each capacitor at
time . In particular,
is the potential
at capacitor , time after having initialized the system by
decharging every capacitor, except for capacitor , which
starts out at unit potential.
d
q )%"
i 0|'8bT!@
Closely related to the above is an electrical model. Differentiating (11) with respect to yields the differential equations
i q v)
q u P F F e
D
)
q q '
W
" u f ' k '
)
4 '
'3We'Y i X
' 7"5 q ) {
%
d
" b u P F F H e 0|'85 F
)%
' d ) ) "
s E e 0' h0' 3W
H
V U
P
`a
H
8 6
F4
8 6
974
U
V
X
Y
H W
where
8 6
974
ec q b $'&q' ' 3 q d
ke'3Wa ' h d) 'i$7"!@
)
U ' 7"!&wc
c )
Hc
z
{ `' 3cT $ `q ' ' c5# b ' q fV $ ' 3
F s
)h$'' 7" !&c a i 0' "5
)
' "5 8c
' )
p d Tk ' 7"!@
2 0 ) '
3 1#($
% "
&$ #
8 6 4
975
@
A 8
@ 8
C B
(14)
8 6
F4
@
A 8
@ 8
A D
8 6
74
8 6 4
FE
$
%
&$ "
S 6
BR
, and
for
)
p
' !e i H s E e 0' 7"!
q i )
)
e i sd e 0'
E
W
) d
e'W7"! k'!
)
d
)
"6' f e C i ' c c
i
6"' e ' i ' e q ) d
' cc
(13)
S 6
BT
I
P
' d Tk
8 6 4
7D
e W ) { z
pe
for
" s )
" s r
)
(15)
Q
!
for
) "
' 7"!@ e ' "58c
G
d e ) ' 7"5
" )
e 0' 7"!@
)
)
' e e
e ' 7"5 ) e
Y " X
)
B
sq sq
sq sq sq sq )
Qppp
.
..
measures distances on
Using the tensor product of complete graphs approach developed above, it is easy to add an extra character to the
alphabet to represent . We can then map and to a
generalized hypercube
of dimensionality
by mapping each string to the vertices corresponding to
all its posssible extensions by s. Let us represent an
alignment between
and
by the vector of matches
where
,
,
q s
q
s % s s
pQpp $'' $" Q" Qpp p "' qs b s " q t 6"' s b" s Qppbp )
%
f
g
f
h
d
e
d
e
b
c
p ' b" 5
)
0' b" !
E
E
q q
o
)
s
s
3
%s s % U e
s s
(16)
q
C
c
' q ' " 8c ' "583c n i $'' "583c n i 0' 7"!@
)
R
g @ d
hf e
r
9
q
C
)
l
'
.
.
.
'5 5)' % %)
b
c
' % V5&c
" %
s
f r ' s %' br5@
"
P
s
s
%
' % "V5&c
d
"
P ' )
f qq ('br5@
% "
P
y w ut
v
yx w
F#et
differ.
y x #v
w ut
y
w x v
ut
Y " yWX
where
and
q
p
y x #w x v
ut
450
400
r
s
350
300
250
p
t
u
v
5
200
150
100
if
otherwise.
50
1
0
0.1
0.2
0.3
0.4
For ease of experimentation we use a large margin classier based on the voted perceptron, as described in (Freund
& Schapire, 1999).3 In each set of experiments we compare models trained using a diffusion kernel, the Euclidean
distance, and a kernel based on the Hamming distance.
Data sets having a majority of categorical variables were
chosen; any continuous features were ignored. The diffusion kernels used are given by the natural extension of the
hypercube kernels given in Section 4.4, namely
P F F
w
s s
f ' s m
b
w
xp
y
h
w p
b
c
s s
%' V5
"
where
is the number of values in the alphabet of the
-th attribute.
i
0.6
0.1
0.2
0.3
0.4
0.5
0.6
Figure 2. The average error rate (left) and number of support vectors (right) as a function of the diffusion coefcient on the
data set. The horizontal line is the baseline performance using the Hamming kernel.
0.5
where
10
~ } |
3P9{
~ } |
k99B9{
o
k
and
p
q
j i
k
j l
(17)
s
% ) s )
s s
s s s s
' 3 )
e
p s s ' d
s )
s' % "V 5
" P G ) "
%' V5 E 0 %' bV3
' % 5 % % B#
"
)
and let
be the set of all alignments
between and . Assuming that all virtual strings are
weighted equally, the resulting kernel will be
7. Conclusions
We have presented a natural approach to constructing kernels on graphs and related discrete objects by using the
analogue on graphs of the heat equation on Riemannian
manifolds. The resulting kernels are easily shown to satisfy the crucial positive semi-deniteness criterion, and
they come with intuitive interpretations in terms of random walks, electrical circuits, and other aspects of spectral
graph theory. We showed how the explicit calculation of
diffusion kernels is possible for specic families of graphs,
and how the kernels correspond to standard Gaussian kernels in a continuous limit. Preliminary experiments on categorical data, where standard kernel methods were previously not applicable, indicate that diffusion kernels can be
effectively used with standard margin-based classication
schemes. While the tensor product construction allows one
to incrementally build up more powerful kernels from simple components, explicit formulas will be difcult to come
by in general. Yet the use of diffusion kernels may still be
practical when the underlying graph structure is sparse by
using standard sparse matrix techniques.
3.64%
17.66%
18.50%
0.75%
3.91%
s
0 s
387.0
750.0
1149.5
96.3
286.0
d s
s
7.64%
17.98%
19.19%
3.36%
4.69%
62.9
314.9
1033.4
28.2
252.9
0.30
1.50
0.40
0.10
2.00
Improvement
error
62%
2%
4%
77%
17%
s s
A
i
10
2
42
10
2
Diffusion Kernel
error
s
0 s
9
13
11
22
16
error
#Attr
Hamming Distance
Data Set
83%
58%
8%
70%
12%
}
fkt3h
}
PfBB
hkf3
~ } |
k99B9{
}
hfh
Table 1. Results on ve UCI data sets. For each data set, only the categorical features are used. The column marked
indicates
the maximum number of values for an attribute; thus the
data set has binary attributes. Results are reported for the setting of the
diffusion coefcient achieving the best error rate.
5F
}
BBfB
} } |
fhh3fhffhFhhf#599E~
References
| | ~
BBhB993h9#Dkf99#9h9E~
It is often said that the key to the success of kernel-based algorithms is the implicit mapping from a data space to some,
usually much higher dimensional, feature space which better captures the structure inherent in the data. The motivation behind the approach to building kernels presented in
this paper is the realization that the kernel is a general representation of this inherent structure, independent of how
we represent individual data points. Hence, by constructing a kernel directly on whatever object the data points naturally lie on (e.g. a graph), we can avoid the arduous process of forcing the data through any Euclidean space altogether. In effect, the kernel trick is a method for unfolding
structures in Hilbert space. It can be used to unfold nontrivial correlation structures between points in Euclidean
space, but it is equally valuable for unfolding other types
of structures which intrinsically have nothing to do with
linear spaces at all.