Sie sind auf Seite 1von 8

Diffusion Kernels on Graphs and Other Discrete Structures

Risi Imre Kondor


John Lafferty

KONDOR @ CMU . EDU


LAFFERTY @ CS . CMU . EDU

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 USA

1. Introduction
Kernel-based algorithms, such as Gaussian processes
(Mackay, 1997), support vector machines (Burges, 1998),
and kernel PCA (Mika et al., 1998), are enjoying great popularity in the statistical learning community. The common
idea behind these methods is to express our prior beliefs
about the correlations, or more generally, the similarities,
between pairs of points in data space in terms of a kernel
function
, and thereby to implicitly construct a mapping
to a Hilbert space
, in
which the kernel appears as the inner product,




(1)



    
 

$93'8!476"&54320('&$#!
% '1 )%"

(Sch lkopf & Smola, 2001). With respect to a basis of


o
,
each datapoint then splits into (a possibly innite number
of) independent features, a property which can be exploited
to great effect.
Graph-like structures occur in data in many guises, and in
order to apply machine learning techniques to such discrete data it is desirable to use a kernel to capture the longrange relationships between data points induced by the local structure of the graph. One obvious example of such
data is a graph of documents related to one another by
links, such as the hyperlink structure of the World Wide
Web. Other examples include social networks, citations between scientic articles, and networks in linguistics (Albert
& Barab si, 2001).
a

In addition to adequately expressing the known or hypothesized structure of the data space, the function must satisfy two mathematical requirements to be able to serve as a
kernel: it must be symmetric (
) and
positive semi-denite. Constructing appropriate positive
denite kernels is not a simple task, and this has largely
been the reason why, with a few exceptions, kernel methods
,
have mostly been conned to Euclidean spaces
where several families of provably positive semi-denite
and easily interpretable kernels are known. When dealing
with intrinsically discrete data spaces, the usual approach
has been either to map the data to Euclidean space rst
(as is commonly done in text classication, treating integer word counts as real numbers (Joachims, 1998)) or,
when no such simple mapping is forthcoming, to forgo using kernel methods altogether. A notable exception to this
is the line of work stemming from the convolution kernel
idea introduced in (Haussler, 1999) and related but independently conceived ideas on string kernels rst presented
in (Watkins, 1999). Despite the promise of these ideas, relatively little work has been done on discrete kernels since
the publication of these articles.

' D0 ) 
'C$" % 5BA' % $#!@
) "

The application of kernel-based learning algorithms has, so far, largely been conned to realvalued data and a few special data types, such
as strings. In this paper we propose a general
method of constructing natural families of kernels over discrete structures, based on the matrix
exponentiation idea. In particular, we focus on
generating kernels on graphs, for which we propose a special class of exponential kernels, based
on the heat equation, called diffusion kernels, and
show that these can be regarded as the discretisation of the familiar Gaussian kernel of Euclidean
space.

Graphs are also sometimes used to model complicated or


only partially understood structures in a rst approximation. In chemistry or molecular biology, for example, it
might be anticipated that molecules with similar chemical
structures will have broadly similar properties. While for
two arbitrary molecules it might be very difcult to quantify exactly how similar they are, it is not so difcult to
propose rules for when two molecules can be considered
neighbors, for example, when they only differ in the presence or absence of a single functional group, movement
of a bond to a neigbouring atom, etc. Representing such
relationships by edges gives rise to a graph, each vertex
corresponding to one of our original objects. Finally, adjacency graphs are sometimes used when data is expected
to be conned to a manifold of lower dimensionality than
the original space (Saul & Roweis, 2001; Belkin & Niyogi,
2001) and (Szummer & Jaakkola, 2002). In all of these
cases, the challenge is to capture in the kernel the local
and global structure of the graph.

Abstract

In this paper we use ideas from spectral graph theory to


propose a natural class of kernels on graphs, which we refer
to as diffusion kernels. We start out by presenting a more
general class of kernels, called exponential kernels, appli-

f d
" Be df H)

is dened as
(3)

where the limit always exists and is equivalent to

Note that the exponential kernel construction is not related


to the result described in (Berg et al., 1984; Haussler, 1999)
and (Sch lkopf & Smola, 2001), based on Schoenbergs
o
pioneering work in the late 1930s in the theory of positive
denite functions (Schoenberg, 1938). This work shows
that any positive semi-denite can be written as

p p n
Qppq o mj l q kj i h) e
g

(4)

It is well known that any even power of a symmetric matrix is positive semi-denite, and that the set of positive
semi-denite matrices is complete with respect to limits of
sequences under the Frobenius norm. Taking to be symmetric and replacing by
shows that the exponential
of any symmetric matrix is symmetric and positive semidenite, hence it is a candidate for a kernel.

ei e

Conversely, it is easy to show that any innitely divisible


kernel can be expressed in the exponential form (3). Innite divisibility means that can be written as an -fold

W) R
HIG h

The exponential of a square matrix

(7)

where
is a conditionally positive semi-denite kernel;
that is, it satises (2) under the additional constraint that
.1
Whereas (5) involves matrix exponentiation via (3), formula (7) prescribes the more straight-forward componentwise exponentiation. On the other hand, conditionally
1
Instead of using the term conditionally positive denite,
this type of object is sometimes referred to by saying that
is negative denite. Confusingly, a negative denite kernel is
then not the same as the negative of a positive denite kernel, so
we shall avoid using this terminology.

In the discrete case, for nite , the kernel can be uniquely


represented by an
matrix (which we shall denote
by the same letter ) with rows and columns indexed by
the elements of , and related to the kernel by
. Since this matrix, called the Gram matrix, and
the function
are essentially equivalent (in
particular, the matrix inherits the properties of symmetry
and positive semi-deniteness), we can refer to one or the
other as the kernel without risk of confusion.

with the accompanying initial conditions


, lends
itself natually to the interpretation that
is the product
of a continuous process, expressed by , gradually transforming it from the identity matrix (
) to a kernel with
stronger and stronger off-diagonal effects as increases.
We shall see in the examples below that by virtue of this
relationship, choosing to express the local structure of
will result in the global structure of naturally emerging
in . We call
an exponential family of kernels, with
generator and bandwidth parameter .

) "
P F F  0 %' $#!

for all square integrable real functions


; the latter is sometimes referred to as Mercers condition.

(6)

, and in the continuous

(5)

guaranteeing positive deniteness without seriously restricting our choice of kernel. Furthermore, differentiating
with respect to and examining the resulting differential equation

for all sets of real coefcients


case,

The above already suggests looking for kernels over nite


sets in the form

e'3


yxwv u ' % $#!@
"
) P tFF

s s s s

q g
'  riphR
W U % Hfed %' bT5C %' 5 R &5 R I I
c c " '
a a
Y
`F R X
W  R
@UV %' $"T5 FP R F SIH!GPQF HIHGF
E E

(2)

d

d e'5W @
g'5W 
) '
d
" p) c c

Recall that in the discrete case, positive semi-deniteness


amounts to

which is equivalent to (3) for

H'

In this section we show how the exponentiation operation


on matrices naturally yields the crucial positive-denite criterion of kernels, and describe how to build kernels on the
direct product of graphs.

for any
(Haussler, 1999). Such kernels form continuous families
, indexed by a real parameter , and are related to innitely divisible probability distributions, which are the limits of sums of independent random variables (Feller, 1971). The tautology
becomes, as goes to innity,

2. Exponential kernels

convolution

" e r
)

d )
se 
xd )
" f de c ed  H~}
c
 z ) d
e
|e{ u '  20' @
d
Y  ) d x
' @#' yX pg e
s v s v s)
but v pQpp w but f buthr

cable to a wide variety of discrete objects. In Section 3 we


present the ideas behind diffusion kernels and the interpretation of these kernels on graphs. In Section 4 we show
how diffusion kernels can be computed for some special
families of graphs, and these techniques are further developed in Section 5. Experiments using diffusion kernels for
classication of categorical data are presented in Section 6,
and we conclude and summarize our results in Section 7.

'
i q q) { q '  z q )
" q ' 5 HG 7 g h0
)
'  '  q ' 
)

0  E

E
"
k&'5W P b'  P P S$e'3W P b'  P P )

z
E '  '  E )
c

 { z )
"
W
{ b''  '  $''  '   ' 

)
@4 c )
p '5Wfb'  0' f
 ) 
q s
$''  6" pQpp 6"'  q 76"'  s 20' f q ms
 ) 

g Y q $" s tX Y q b" s QX
"'  )
&$p# 0' 

p $'' G )
 '   '  v0' 
s h) s

pQpp " i " ) E


q g

"
 ) "
q
'% $" 5 0 %' bV5
'5W

 )
Y g &' $" Qppp " q b" s !2tkX
e

' d  q ' d  s v' d 
) 

P F 6tF { ' d  q z P F 7yF { ' d  s ) P F P F tF F { ' d  z


z


q
d
s s) s s s
q q g s g s g q q g s
oq) c

t s )
W V e)$#!
) '"

z
' s% b" s ! dP F 6tF { q z ' %q $" q 5 mP F F { s ) P F P F 7Qq F 7yF g{ q z
r&

)
s g
q
s ' b" q s !

Y ' d  q tX Y ' s d  s yX s
p q W s
" Dq q tpQp q q q q )

R
) '
r&!' R 5 Y  v R X ' % ! P F HF tPF

positive denite matrices are somewhat elusive mathematical objects, and it is not clear where Schoenbergs beautiful result will nd application in statistical learning theory. The advantage of our relatively brute-force approach
to constructing positive denite objects is that it only requires that the generator be symmetric (more generally,
self-adjoint) and guarantees the positive semi-deniteness
of the resulting kernel .

There is a canonical way of building exponential kernels


over direct products of sets, which will prove invaluable in
what follows. Let
be a family of kernels over the
set
with generator
, and let
be a family of
with generator
. To construct an expokernels over
nential kernels over the pairs
, with
and
, it is natural to use the generator

where
if
and otherwise. In other words,
we take the generator over the product set
to be
, where
and
are the
and
dimensional diagonal kernels, respectively.
Plugging into (6) shows that the corresponding kernels will
be given simply by

that is,
any exponential kernel on
over length sequences
by

. In particular, we can lift


to an exponential kernel

(8)

or, using the tensor product notation,

showing that
is, in fact, negative semi-denite. Acting on functions
by
,
can also be regarded as an operator.
In fact, it is easy to show that on a square grid in dimensional Euclidean space with grid spacing ,
is just the nite difference approximation to the familiar
continuous Laplacian

and that in the limit


this approximation becomes
exact. In analogy with classical physics, where equations
of the form

are used to describe the diffusion of heat and other substances through continuous media, our equation
(10)

with
as dened in (9) is called the heat equation on ,
and the resulting kernels are called diffusion or heat kernels.
3.1 A stochastic and a physical model

There is a natural class of stochastic processes on graphs


whose covariance structure yields diffusion kernels. Consider the random eld obtained by attaching independent,
to each verzero mean, variance random variables
tex
. Now let each of these random variables send
a fraction
of their value to each of their respective
neighbors at discrete time steps
; that is, let

3. Diffusion kernels on graphs

An undirected, unweighted graph is dened by a vertex


set
and an edge set , the latter being the set of unordered pairs
, where
whenever the
vertices and are joined by an edge (denoted
).
Equation (6) suggests using an exponential kernel with
generator
for
(9)
for
otherwise

Introducing the time evolution operator

can be written as

(11)

The covariance of the random eld at time is


Cov

where is the degree of vertex (number of edges emanating from vertex ).

The negative of this matrix (sometimes up to normalization) is called the Laplacian of , and it plays a central role
in spectral graph theory (Chung, 1997). It is instructive to
note that for any vector
,

which simplies to
Cov

(12)

for loops, i.e.,


otherwise

where
basis of loops

' % 5 ' % 5 F
)
d
p %' ! F %' ! F c c
)
' % 5 F
R
'
&5 R
d
p %8eC|'%8! P FP '%%&$"#! 0|'85 F c c
% c
)%
' % bTa" 50' % ! F
)

is here used to denote innitesimals and not the

where
is the set of all paths on , gives a representation of the
kernel in the space
of linear combinations of paths
of the form

D


2
Note that
Laplacian.

This analogy also shows that


can be regarded as a
sum over paths from to , namely the sum of the probabilities that the lazy walk takes each path. For graphs in which
every vertex is of the same degree
, mapping each
vertex to every path starting at weighted by the square
root of the probability of a lazy random walk starting at
taking that path,

It is easy to verify that the solution of this equation with


Dirac spike initial conditions
is just the
Gaussian

A lazy random walk on with parameter


is very similar, except that when at vertex , the process will
take each of the edges emanating from with xed probability , i.e.
for
, and
will remain in place with probability
. Considering the distribution
in the limit
with
and
leads exactly to (3) showing
that diffusion kernels are the continuous time limit of lazy
random walks.

'  ) 
p 5Tk' "5


Y '  X

Since the Laplacian is a local operator in the sense that


is only affected by the behavior of in the neighborhood of , as long as
is continuous in , the
above can be rewritten as simply

It is well known that diffusion is closely related to random walks. A random walk on an unweighted graph
is a stochastic process generating sequences
where
in such a way that
if
and zero otherwise.

As a special case, it is instructive to again consider the innitely ne square grid on


. Introducing the similarity
function
, the heat equation (10) gives

3.3 Relationship to random walks

)
p b''  '  & ( G h0'  c c
E

3.2 The continuous limit

is just a diffusion kernel with


. In this sense, diffusion kernels can be regarded as a generalization of Gaussian kernels to graphs.

'  i ) q
'  i ) s
Y s s b" p) p p " s g b" X )

"
i '' %% ""  %' " 
)
)
' A

"
 X
Y s t th g Qx' " pQp p " s " 3) h)
" s s d eG
)
$uqs f j 0 E q u '!

c
q)
c

' 7"5

d d 
'  f) ' )
s 
W d

c d V`vQs ) s tw d
) ' )


 d

' c A!

' q s 
g t c
pp)Qp kks) tds ) tHf
" " b"

These equations are the same as those describing the relaxation of a network of capacitors of unit capacitance,
where one plate of each capacitor is grounded, and the
other plates are connected according to the graph structure,
each edge corresponding to a connection of resistance
.
The
then measure the potential at each capacitor at
time . In particular,
is the potential
at capacitor , time after having initialized the system by
decharging every capacitor, except for capacitor , which
starts out at unit potential.

d
q )%"
i 0|'8bT!@

Closely related to the above is an electrical model. Differentiating (11) with respect to yields the differential equations

i q v)
q u P F F e

which, in the limit


, is exactly of the form (3).
In particular, the covariance becomes the diffusion kernel
Cov
. Since kernels are in some sense nothing but generalized covariances (in fact, in the case of
Gaussian Processes, they are the covariance), this example
supports the contention that diffusion kernels are a natural
choice on graphs.

showing that similarity to any given point , as expressed


by the kernel, really does behave as some substance diffusing in space, and also that the familiar Gaussian kernel on
,

D


)
q q ' 
W

" u f ' k ' 
)
4 '


'3We'Y i X
' 7"5 q ) {

Now we can decrease the time step from to


by replacing by
and by
in (11) 2 , giving

%
d
" b u P F F H e 0|'85 F
)%

by independence at time zero,


. Note
that (
holds regardless of the particular distribution of
the
, as long as their mean is zero and their variance
is .

is the reverse of . In the


and linear combinations

showing that with increasing , the kernel relaxes exponentially to


. The asymptotically exponential character of this solution, and the convergence to the
uniform kernel for nite , are direct consequences of the
fact that
is a linear operator, and we shall see this type
of behavior recur in other examples.
4.3 Closed chains
When is a single closed chain of length ,
will
clearly only depend on the distance
along the chain
between and . Labeling the vertices consecutively from
to
, the similarity function at a particular vertex
(without loss of generality, vertex zero) can be expressed
in terms of its discrete Fourier transform

'  d ) ) " 
s E e 0'  h0' 3W
H

showing that the Fourier coefcients decay independently


of one another. Using the inverse Fourier transform, the solution corresponding to the initial condition
at
will be
Q
!

V U
P

`a
H

8 6
F4

, and the kernel itself will be


Q
!

8 6
974

U
V 

X
Y

H W

where

8 6
974

ec q b $'&q' ' 3 q d 
 ke'3Wa ' h d) 'i$7"!@
)

U ' 7"!&wc
c )
Hc

z
{ `' 3cT $ `q ' ' c5# b ' q fV $ ' 3

F s
)h$'' 7" !&c a i 0' "5
 ) 
' "5 8c

'  ) 
p d Tk ' 7"!@
2 0 ) '
3 1#($
% "
&$ #




8 6 4
975

@
A 8

@ 8
C B

(14)

8 6
F4

@
A 8

@ 8
A D

8 6
74

8 6 4
FE

$
%
&$ "




S 6
BR

, and

which after some trigonometry translates into


Q
!

for

) 
p
' !e  i H s E e 0' 7"!
q i )
)
e i sd e 0' 

E
W
) d
e'W7"! k'!
)
d
)
"6'  f e C i '  c c
i
6"' e  '  i ' e q  ) d
'  cc

(13)

The heat equation implies

S 6
BT

Since in this graph every vertex is created perfectly equal,


can only depend on the relative positions of and , namely the length of the unique path
between them,
. Chung and Yau (1999) show that

I
P

' d Tk





An innite -regular tree is an undirected, unweighted


graph with no cycles, in which every vertex has exactly
neighbors. Note that this differs from the notion of a
rooted -ary tree in that no special node is designated the
root. Any vertex can function as the root of the tree, but
that too must have exactly neighbors. Hence a 3-regular
tree looks much like a rooted binary tree, except that at the
root it splits into three branches and not two.

8 6 4
7D




4.1 -regular trees





e W ) { z

which is easy, because


will be diagonal with
. The diagonalization process is computationally expensive, however, and the kernel matrix must be
stored in memory during the whole time the learning algorithm operates. Hence there is interest in the few special
cases for which the kernel matrix can be computed directly.

pe




is symmetric, and then

for

" s )

" s r
)

which is always possible because


computing

(15)

Q
!

In general, computing exponential kernels involves diagonalizing the generator

for


) "



4. Some special graphs

In the unweighted complete graph with vertices, any pair


of vertices is joined by an edge, hence
.
It is easy to verify that the corresponding solution to (10) is


' 7"!@ e ' "58c


Finally, we remark that diffusion kernels are not restricted


to simple unweighted graphs. For multigraphs or weighted
symmetric graphs, all we need to do is to set
to be the total weight of all edges between and and
reweight the diagonal terms accordingly. The rest of the
analysis carries through as before.

4.2 Complete graphs


G

3.4 Diffusion on weighted graphs

for the diagonal elements.


d e ) ' 7"5

" )
e 0' 7"!@
) 

)
' e e 

e ' 7"5 ) e

Y " X

for all pairs


of non-loops, this does give a diagonal
representation of , but not a representation satisfying (1),
because there are alternating
s and
s on the diagonal.

4.4 The hypercube and tensor products of complete


graphs

)
B

Figure 1. The three-regular tree (left), which extends to innity


in all directions. A little bending of the branches shows that it is
isomorphic to two rooted binary trees joined at the root (right).
The method of images enables us to compute the diffusion kernel
between vertices of the binary tree by mapping each to a pair of
vertices
and
on the three-regular tree and summing
the contributions from
,
,
and
(16).
yx uw u
Aevt

sq sq
sq sq sq sq )

Qppp
.

..

and yields in analytical form the diffusion kernel for innite


binary trees

measures distances on

Another application of conjugated diffusion kernels is in


the construction of string kernels sensitive to matching noncontiguous substrings. The usual approach to this is to introduce blank characters into the strings and to be
compared, so that the characters of the common substring
are aligned..

Using the tensor product of complete graphs approach developed above, it is easy to add an extra character to the
alphabet to represent . We can then map and to a
generalized hypercube
of dimensionality
by mapping each string to the vertices corresponding to
all its posssible extensions by s. Let us represent an
alignment between
and
by the vector of matches
where
,
,

q s
q
s % s s
pQpp $'' $" Q" Qpp p "' qs b s " q t 6"' s b" s Qppbp  )
%
f
g

f
h

d
e

d
e

b
c

p ' b" 5

)
0' b" !
E
E
q q

o

)
s

s
3

where designates the root and


the binary tree.

%s s % U e
s s

(16)

q
C

c

' q ' " 8c ' "583c n i $'' "583c n i 0' 7"!@

) 
R

g @ d
hf e

r
9

q
C

)
l
' 

The crucial observation is that the graph possesses mirror


symmetry about the edge connecting this errant branch to
the rest of the graph. Mapping each vertex of the binary

.
.
.

'5 5)' % %)

b
c

One application of this is in creating virtual data points.


We have noted above that the distinction between -regular
trees and innite
-ary rooted trees is that arbitrarily
designating a vertex in the former as the root, we nd that
it has an extra branch emanating from it (Figure 1). Taking
for simplicity
, the analytical formul (13) and (14)
are hence not directly applicable to binary rooted trees, because if we simply try to ignore this branch by not mapping
any data points to it, in the language of the electrical analogy of Section 3.1, we nd that some current will seep
away through it.

' % V5&c
" %
s
f r ' s %' br5@

"

P
s
s
%
' % "V5&c
d
"
P '  )
f qq ('br5@
% "
P

5. Conjugacy, the method of images and


string kernels

yields a new positive semi-denite kernel


of the form

y w ut
v
yx w
F#et

tree to the analogous vertex on one side of this plane of


in the -regular tree and its mirror imsymmetry
age
on the other side solves the problem, because,
by symmetry, in the electrical analogy, the ow of current
across the critical edge connecting the two halves of the
graph will be zero. This construction, called the method of
images, corresponds to a transformation matrix of the form

differ.

Conjugating the Gram matrix by a not necessarily square


matrix

y x #v
w ut

g ' D b" ppp " q q $s " s 5


" D Qppp

is the number of character places at which

y
w x v
ut

Y " yWX

which only depends on the Hamming distance


between and , and is extremely easy to compute. Similarly, the diffusion kernel on strings over an alphabet of
size
will be

where
and

q
p

y x #w x v

ut

Kernels on the special graphs considered above can serve


as building blocks for tensor product kernels, as in (8).
For example, it is natural to identify binary strings
of length with the vertices
of the -dimensional hypercube. Constructing a diffusion kernel on the hypercube regarded as a
graph amounts to asserting that two sequences are neighbors if they only differ in a single digit. From (15) and (8),
the diffusion kernel on the hypercube will be

450

400

r
s

350

300

250

p
t

u
v

5
200

150

100

if
otherwise.

50

1
0

For the derivation of recursive formul such as this, and


comparison to other measures of similarity between strings,
see (Durbin et al., 1998).

6. Experiments on UCI datasets

0.1

0.2

0.3

0.4

For ease of experimentation we use a large margin classier based on the voted perceptron, as described in (Freund
& Schapire, 1999).3 In each set of experiments we compare models trained using a diffusion kernel, the Euclidean
distance, and a kernel based on the Hamming distance.
Data sets having a majority of categorical variables were
chosen; any continuous features were ignored. The diffusion kernels used are given by the natural extension of the
hypercube kernels given in Section 4.4, namely

P F F
w

s s
f ' s m
b

w
xp

y
h

w p

b
c

s s
%' V5
"

where
is the number of values in the alphabet of the
-th attribute.
i

Table 1 shows sample results of experiments carried out on


ve UCI data sets having a majority of categorical features.
3
SVMs were trained on some of the data sets and the results
were comparable to what we report here for the voted perceptron.

0.6

0.1

0.2

0.3

0.4

0.5

0.6

Figure 2. The average error rate (left) and number of support vectors (right) as a function of the diffusion coefcient on the
data set. The horizontal line is the baseline performance using the Hamming kernel.

The performance over a range of values of on the


data set is shown in Figure 2. We note that this is a
very easy data set for a symbolic learning algorithm, since
it can be learned to an accuracy of about 99.50% with a
few simple logical rules. However, standard kernels perform poorly on this data set, and the Hamming kernel has
an accuracy of only 96.64%. The simple diffusion kernel
brings the accuracy up to 99.23%.

9

In this section we describe preliminary experiments with


diffusion kernels, focusing on the use of kernel-based
methods for classifying categorical data. For such problems, it is often quite unnatural to encode the data as vectors in Euclidean space to allow the use of standard kernels.
However as our experiments show, even simple diffusion
kernels on the hypercube, as described in Section 4.4, can
result in good performance for such data.

0.5

where

10

~ } |
3P9{

In the special case that


, the combinatorial factor
becomes constant for all pairs of strings and (17) becomes
computable by dynamic programming by the recursion

~ } |
k99B9{

o
k

and

p
q

j i
k

j l

for some combinatorial factor

(17)

In each experiment, a voted perceptron was trained, using


10 rounds, for each kernel. Results are reported for the setting of the diffusion coefcient achieving the best crossvalidated error rate. Since the Euclidean kernel performed
poorly in all cases, the results for this classier are not
shown. The results are averaged across 50 random splits
of the training and test data. In addition to the error rates,
also shown is the average number of support vectors (or
perceptrons) used. In general, we see that the best classiers also have the sparsest representation. The reduction in
error rate varies, but the simple diffusion kernel generally
performs well.

s
% ) s )
s s
s s  s s
' 3 )
e
p s  s ' d 
s ) 
s' % "V 5
" P G ) " 
%' V5 E 0 %' bV3
' % 5 % % B#
"
)

and let
be the set of all alignments
between and . Assuming that all virtual strings are
weighted equally, the resulting kernel will be

7. Conclusions
We have presented a natural approach to constructing kernels on graphs and related discrete objects by using the
analogue on graphs of the heat equation on Riemannian
manifolds. The resulting kernels are easily shown to satisfy the crucial positive semi-deniteness criterion, and
they come with intuitive interpretations in terms of random walks, electrical circuits, and other aspects of spectral
graph theory. We showed how the explicit calculation of
diffusion kernels is possible for specic families of graphs,
and how the kernels correspond to standard Gaussian kernels in a continuous limit. Preliminary experiments on categorical data, where standard kernel methods were previously not applicable, indicate that diffusion kernels can be
effectively used with standard margin-based classication
schemes. While the tensor product construction allows one
to incrementally build up more powerful kernels from simple components, explicit formulas will be difcult to come
by in general. Yet the use of diffusion kernels may still be
practical when the underlying graph structure is sparse by
using standard sparse matrix techniques.

3.64%
17.66%
18.50%
0.75%
3.91%

s
0 s

387.0
750.0
1149.5
96.3
286.0

d  s
s

7.64%
17.98%
19.19%
3.36%
4.69%

62.9
314.9
1033.4
28.2
252.9

0.30
1.50
0.40
0.10
2.00

Improvement

error
62%
2%
4%
77%
17%

s s
 A
i

10
2
42
10
2

Diffusion Kernel

error

s
0 s

9
13
11
22
16

error

#Attr

Hamming Distance

Data Set

83%
58%
8%
70%
12%

}
fkt3h
}
PfBB

hkf3
~ } |
k99B9{
}
hfh

Table 1. Results on ve UCI data sets. For each data set, only the categorical features are used. The column marked
indicates
the maximum number of values for an attribute; thus the
data set has binary attributes. Results are reported for the setting of the
diffusion coefcient achieving the best error rate.

5F

}
BBfB

Albert, R., & Barab si, A. (2001).


a
chanics of complex networks.

Statistical meAvailable from


.

} } |
fhh3fhffhFhhf#599E~

Belkin, M., & Niyogi, P. (2001). Laplacian eigenmaps


for dimensionality reduction and data representation.
(Technical Report). Computer Science Department, University of Chicago.
Berg, C., Christensen, J., & Ressel, P. (1984). Harmonic
analysis on semigroups: Theory of positive denite and
related functions. Springer.
Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121167.
Chung, F. R. K. (1997). Spectral graph theory. No. 92 in
Regional Conference Series in Mathematics. American
Mathematical Society.
Chung, F. R. K., & Yau, S.-T. (1999). Coverings, heat kernels and spanning trees. Electronic Journal of Combinatorics, 6.

Feller, W. (1971). An introduction to probability theory and


its applications, vol. II. Wiley. Second edition.
Freund, Y., & Schapire, R. E. (1999). Large margin classication using the perceptron algorithm. Machine Learning, 37, 277296.
Haussler, D. (1999). Convolution kernels on discrete structures (Technical Report UCSC-CRL-99-10). Department of Computer Science, University of California at
Santa Cruz.
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features.
Proceedings of ECML-98, 10th European Conference on
Machine Learning (pp. 137142).
Mackay, D. J. C. (1997). Gaussian processes: A replacement for neural networks? NIPS tutorial. Available from
.
Mika, S., Sch lkopf, B., Smola, A., M ller, K., Scholz, M.,
o
u
& R tsch, G. (1998). Kernel PCA and de-noising in feaa
ture spaces. Advances in Neural Information Processing
Systems 11.
Saul, L. K., & Roweis, S. T. (2001). An introduction to locally linear embedding.
Available from
.
} | }
9f99hBk#B99B599E~

References

Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998).


Biological sequence analysis probabilistic models of
proteins and nucleic acids. Cambridge University Press.

| | ~
BBhB993h9#Dkf99#9h9E~

It is often said that the key to the success of kernel-based algorithms is the implicit mapping from a data space to some,
usually much higher dimensional, feature space which better captures the structure inherent in the data. The motivation behind the approach to building kernels presented in
this paper is the realization that the kernel is a general representation of this inherent structure, independent of how
we represent individual data points. Hence, by constructing a kernel directly on whatever object the data points naturally lie on (e.g. a graph), we can avoid the arduous process of forcing the data through any Euclidean space altogether. In effect, the kernel trick is a method for unfolding
structures in Hilbert space. It can be used to unfold nontrivial correlation structures between points in Euclidean
space, but it is equally valuable for unfolding other types
of structures which intrinsically have nothing to do with
linear spaces at all.

Schoenberg, I. J. (1938). Metric spaces and completely


monotone functions. The Annals of Mathematics, 39,
811841.
Sch lkopf, B., & Smola, A. (2001). Learning with kernels.
o
MIT Press.
Szummer, M., & Jaakkola, T. (2002). Partially labeled classication with Markov random walks. Advances in Neural Information Processing Systems.
Watkins, C. (1999). Dynamic alignment kernels. In A. J.
Smola, B. Sch lkopf, P. Bartlett, and D. Schuurmans
o
(Eds.), Advances in kernel methods. MIT Press.

Das könnte Ihnen auch gefallen