Kernel Regression Kernelkalman

Nonlinear Time Series Filtering, Smoothing and Learning using the Kernel Kalman Filter
Liva Ralaivola Laboratoire dInformatique de Paris 6, 8, rue du Capitaine Scott, F-75015 Paris, France Epigenomics Project, Genopole, 93, rue Henri Rochefort, F-91000 Evry, France Florence dAlch -Buc e Epigenomics Project, Genopole, 93, rue Henri Rochefort, F-91000 Evry, France
LIVA . RALAIVOLA @ LIP 6. FR
DALCHE @ EPIGENOMIQUE . ORG
Abstract
In this paper, we propose a new model, the Kernel Kalman Filter, to perform various nonlinear time series processing. This model is based on the use of Mercer kernel functions in the framework of the Kalman Filter or Linear Dynamical Systems. Thanks to the kernel trick, all the equations involved in our model to perform ltering, smoothing and learning tasks, only require matrix algebra calculus whilst providing the ability to model complex time series. In particular, it is possible to learn dynamics from some nonlinear noisy time series implementing an exact EM procedure. When predictions in the original input space are needed, an efcient and original preimage learning strategy is proposed.
2. Kalman Filtering with Kernels

2.1. Kalman Filter Model The Kalman lter is based on the idea that a dynamical process is hidden and can only be observed or measured through some time series. The dependency between two consecutive states of the process is assumed linear as well as the dependency between the measurements and the state process. Assume a sequence of observation vectors in . The Kalman lter model (Kalman, 1960) assumes the following model for the series :
'&
a`X )Y' &
The paper is organized as follows. In section 2, we describe our Kernel Kalman Filter (KKF) model. Section 3 focuses on a few works which are closely related to ours, while section 4 reports empirical results achieved by our model on various times series processing tasks. Eventually, section 5 gives hints about future developments of KKF.
Two kinds of tasks are associated with the model (1). The rst one encompasses ltering and smoothing and it deals with the estimation of the state vector when the parameters are known and a series is observed. The second issue deals with the situation where the parameters of the model have to be learned: this task is classically addressed using an Expectation-Maximization (EM) (Dempster et al., 1977) algorithm. A set of Kalman Filter and Kalman Smoother equations is available to perform the estimation of the states. The ltering equations provide an estimate of from the
S EB WFD B %V% FD % FD 7UT S 4 9 E 4 E6 6 4 1 ' R&
6 I4
B 4
GQH P 1 '&
Time series modeling has become an appealing eld for machine learning approaches over the last decade. Time series analysis has indeed connections to economy forecasting, biological sequences study, speech recognition, video processing and prediction of Web-user behavior. In front of a large amount of data which often take the form of non linear time series, there is a need of transversal and exible tools that can be engineered easily. In this paper, we propose a kernel-based approach to time series modeling enabling the implementation of prediction, denoising and learning tasks. On the one hand, our method presents the advantage of keeping the framework of linear dynamical systems usable, and, on the other hand, the processing of non vectorial time series can be considered.
where and are zero-mean gaussian noise vectors of variances and , respectively; if is the dimension of , is a matrix, an -dimensional vector, a matrix and a -dimensional vector. The state process (1a) is intended to model the actual transition between two states of the studied process whereas the measurement process is (1b). It can be noticed that the studied series must have intrinsically linear dynamics to be modelled accurately by (1).
EFD B
B 8
GH5G EFD 6
6 8
1. Introduction
%$
8 3 B 4 & B 5C7A3 ' @9 8 3 6 4 3 '& 6 5#7521

' (' 0)&
" #!
(1a) (1b)
observations . Setting and , they can be written (see e.g. (Rosti & Gales, 2001)):
a ` X ' 9 ' ` X ' a ' ' 3 )X F' & a` ' B 4 ` X ' @9 F a & a` )X F 9 ` X ' a E B D 3 9 a ` X ' 9 E 6 D 3 1 a ` R)X F' 1 6 74 3 R` X ' 21 a & G 2
(7) (8) 2.2.2. K ERNEL F ILTERING

2
AND
S MOOTHING
(10)
(11) (12)
2.2. Kalman Filtering with Kernels 2.2.1. P ROPOSED M ODEL
and
s
the kernel ltering equations are, for
with and . is an dimensional vector whose computation only depends on kernel evaluations. From these equations and starting with and , the kernel smoothing equations follow readily for :
g h 5 V t u v W ` a v Vi a v X X W W &
a' X
yq0H
R a RX
i% r k
W4
a v @i & a v X X
RX & a
where all involved vectors are in and where is a matrix of appropriate size to be multiplied by a vector of ; and are zero-mean gaussian noise vectors of covariances and , respectively. Actually, these noise vectors are assumed to live in the space spanned by the vectors and is the identity matrix in this space. Although this might seem to be very strong an assumption, the efciency of Kernel PCA (Mika et al., 1999) for denoising shows that such a noise model can cover a wide range of noise distributions in the input space.
@
i 0 |
h 1H w
i% r k
h
f g
h } i %4yw p x d eH 4yP
(13b)
1H p 1 w y0E p
} ~h jg r {| z
i k %y
(13a)

v w
o % n

v w
h m k i 3Hlj0H
u y1H w k i Elv1H p h sr t 0H d e1H
' rxU X EB D '
h 1
f g
' 3 EB D p ' ' p E D
4yq0H p
4y0E
In order to tackle nonlinear time series processing, we propose a kernelized version of the Kalman lter. To do so, instead of directly working with the nonlinear , we focus on the associated series , where stands for . The nonlinear mapping is assumed to map vectors from the input space to a so-called feature space , where the inner product can be evaluated thanks to a Mercel kernel function such that: (Boser et al., 1992). The generative model we propose to study is the following:
` ' ya ' p t Wu `
q s F' a ' CX
q r q p ' a ' EX F' a ' EX
a D # X
2 %3 %
F D P I
5 a X a X
starting with
and
If a noisy time series assumed to be generated from the same model as is observed, it is possible to perform the ltering and the smoothing procedures, extending (2)(12) to obtain the set of kernel ltering and kernel smoothing equations thanks to some matrix algebra (the details are omitted for sake of clarity). Introducing , and
h h
% h i %
2 3
(9)
W4
6 I4
fe3a dc a ` Y b b b V V XV W
4 4 2 2 3 2
2 4
B 4
U4 2
The Kalman Smoother computes an estimate of from the entire observation sequence . It requires the ltering procedure to be performed and it implements the following set of equations:
, , Assume (see next subsection) that the parameters and are of the form , , and , respectively, where is an orthonormal basis of the space (thus of dimension ). Such a basis can be obtained from the principal axes of a Kernel PCA (Mika et al., 1999) of having strictly positive eigenvalues, being the considered training series.
2
R T
B 4 26 4
EB FD
R S
2B 4
'&
RX a F'
"
! a` )X &
a a` X F' ` X & X F' " a a ` a` a)X F' & %)X $ X ' " aa ` F #)X F' X
a Y1 ' X)
4 RX & a
@
' a F X
2B A3 2B A3 Q& 8 4 2' 26 53 26 53 & 2 1 8 4 2'
a X
5 HB
&
' a XY)0 (& a X $ a ' " ` X & )X a ` F'
3 R)X F' R` X a ` a 3 a R` X F' & R` X a 1 R)X F' F' " a `
2 4
5 5 a X CB F D aEX " ! " 6 ! 75
'
' F
(2) (3) (4) (5) (6)
As for all kernel methods, the key idea behind the model (13) is that a linear model dened in the feature space corresponds to a nonlinear one in the input space. The ability to fully exploit the model (13) should thus be connected to the ability of addressing some nonlinear time series processing. It can be noticed that the model (13) is a generalization of the model proposed by Ralaivola and dAlch -Buc (2004), e which corresponds to the specic case and .
` X a ` X F' a ` X F' & a a` X ' ` X ' & a $' '

G @ A9 8 2' 4 ( 2' )Q&
EFD B

2 4
3 2
4 4R 2 2 R
B 8
EFD 6
&
$ &
26 8
E FD
, the ltering and the smoothing pass dea basis of scribed above must rst be implemented. To perform the M step, it may be useful to compute the values , and the auxiliary vectors :
2.2.3. L EARNING
THE
PARAMETERS
from which , and . In order to cope with the possible high complexity induced by the use of kernels, it may be sensible to put a gaussian prior tr
V
on , favoring with small entries. Resorting to such control complexity, each occurence of in (14) must be replaced by . For and
k i l
i k %
2
{| z
This EM procedure remains valid when the learning of the parameters of a Kernel Kalman Filter is addressed. Given
A Si A
A B
A
Finally, the values for

i k y
and
EB D
B 4
1
1
2
| n
M step: maximize the expectation of with respect to using the ltered and smoothed values and the values and (see (Rosti & Gales, 2001) for more details).
qk
0H
6 @ @ k { 2 1
0z {
E step: perform the ltering and smoothing procedures given the parameter with the observed (training) sequence ;
are given by:
tr
o X n
) 0
1
E FD
0z {o
E6 FD
The EM algorithm to maximize this likelihood requires the implementation of the two following steps:
we have:
W4
U4 2
F'
a a 1 1 X
G H' E
6 4
E6 D
t q X ywuha 1 X rp xvt s
62 4
3 F'
W XV 1 1
G H' E
X
k li
o n
Recall that the EM algorithm for linear dynamical systems aims at maximizing the based on a training sequence likelihood
n i
gi hC
e fC
u 0H s
R { 2 u c `b 1 vdi aW
0E
{ 2 Y
In this section, we show how the parameters of the Kernel Kalman Filter can be learned in combination with the ltering and smoothing pass by the use of an EM procedure in a way similar to that recalled by Rosti and Gales (2001).
According to the values of and obtained when a linear dynamical system is considered, we get:
(14) (15) (16)
6 qk
qk
E W G E ' 9 H' F D
i l
) U RR 0 TSi
) R X 0 U BSi V
6 4
P Q 0H
P Qqk
i v
'&
{ 2 10 C 1 { 2 1 C 1
E W G F' 9 H' E
q rp
These procedures make it possible to estimate the most probable process states and to make prediction or denoising tasks when nonlinear process are studied. The ltered value and the smoothed value of are respectively given by:
These values and subsequent values rely on the assumption that each vector of (1a) is searched for as an element of . In addition, let the vectors and together with the matrices
'9
W
0H
@ @
@ @
h 1y0E
A B H
h 1y
1H
a` )X '
a ` X
0H
m Hy0E
3 ` X F' a
w Ey
a' X
&
i 0
i jyq0H
6 7{ 3

B 4 W B 74
a` X ' 8 " 6
{| z 0
w k 5 {
W o %n V n
' (&%# ) "
| n
1H i vq1H p 0 h
a` )X 26
3 )X $ a` W 3 )X ' & a`
1H
f g

{ 2 45i 3 1 { 2 10 3 1
u H
d tH
i 0 {o z
{ z
a` a` )X $ )X
"
qk
' (&%#
3 a' hX 5' 3 a' hX 5'
1H
i j3 { z
1H
0H
qk
i v 0H p
i " i ) l0 ' # i (&%$"!y o
i vqk
h qk
r r k k
{o z
{| z
3 ` X a
0H
"
"
i vq0H
)X 26 a` )X 2 a`
i v0H 0H
h i i
a ` a` R)X $ 8 )X $ & )X a` S
k u 0H u E sr h h s
" 6
a` )X 2
4qk
4qk
0H
1H 0E h
i v
i v
(17)
from which (note that ). Hence, the assumption that each is in validates the form used by the kernel ltering and smoothing procedures (cf. (14)(16)). 2.3. Finding the Preimages through Learning The model (1) only deals with vectors in whereas vectors from the input space are often needed. Given a vector in , nding a good vector in such that is as close as possible to is known as the preimage problem. In order to solve this problem, we employ an original learning strategy presented in (Ralaivola & dAlch -Buc, 2004): e instead of trying to use a strategy requiring to solve some mathematical problem each time a preimage is needed, we consider the training set from which a regressor is learned. In other words, for each we learn the mapping between the coordinate vector of with respect to to itself. The learned regressor provides an efcient tool to estimate the preimages.
@
Eventually, regarding the issue of parameters learning, our algorithm must be related to the work of Ghahramani and Roweis (1999) and Briegel and Tresp (1999). The algorithms presented in these papers however assume some semi-parametric models whose parameters have to be learned. The approach that we have adopted is different in the sense that the proposed EM automatically determines the appropriate transition model from (14).
4. Experiments 3. Related work

This work is related to the extended versions of the classical Kalman lter, namely the Extended Kalman Filter or EKF (Anderson & Moore, 1979; Welch & Bishop, 1995) and the Unscented Kalman Filter or UKF (Julier & Uhlmann, 1997; van der Merwe et al., 2001) and their variants. These algorithms can indeed be used in tasks involving prediction and/or noise removal when nonlinear processes are studied. However, as evidenced previously, our model does not need to resort to any approximated estimation strategy while, in addition, it makes it possible to implement the learning of the model parameters. Two other works are closely related to ours. The rst one is the work of Suykens et al. (2001) who tackled the so-called -stage optimal control problem, a subproblem of the nonlinear ltering problem, using support vector regressors to model the transition process. The second one is that of Engel et al. (2002), who proposed to learn the weight vector of a kernel regressor using a Kalman lter with a xed state transition process. The topic of preimages determination is an issue of importance since the appearance of the Kernel PCA algorithm. An algorithm has been recently proposed by Kwok and Tsang (2003) to determine the preimages. This algorithm is based on the fact that some kernels can be inverted to provide the distance in input space between a pair of vectors from their kernel value. This algorithm nds the location of the preimages based on distance constraints in the feature space. Bakir et al. (2004) have proposed simultaneously to use a way to compute the preimages resorting to a learning strategy and have shown the efciency of this approach. 4.1. Datasets and Protocol The experiments presented in this section have been carried out on four datasets: the univariate time series Laser (see Figure 1, left) and Mackey-Glass MG (see Figure 1, right) and the multi-dimensional time series Ikeda and Lorenz. Two kinds of KKF performances are evaluated: the prediction capacity and the ability of processing noisy nonlinear time series. The Ikeda map (see Figure 2, left) is related to a laser dynamics. This time series is dened from a starting point by
a X 8a X R 3 R 67 2 #5 R 2 4 R E 2 4 3 2 22 $%% "' { 1hy 0)' ! { qx1 { ($ k h yy y 0)$' { &0 $% { h ( i k h {% "# ! qx1 h h 1y { 3k i v y {
A Lorenz attractor (see Figure 2, middle and right) is the solution of the system of differential equations
We set
and
The Laser and Mackey-Glass MG time series are univariate and, as proposed by Mukherjee et al. (1997) and M ller et al. (1997), we introduce the embedding vectors u
CH A IFi G AFiED@ h C i A9 h 9 i B#@
f
TS5 'GR Q
R R
R R
Ef
where sidered
and are real valued parameters; we con, , , and .

R
5t #u 3

E a a R X a R X X

a RX

a 4 V X a 4 V X 2 2
2 4
'
a X
'&
2 4
'
" !
B 4
V W
" !
B 4
Laser 250 2 1.5 200 1 0.5 150 0 -0.5 -1 50 -1.5 -2 0 0 50 100 150 200 250 300 -2.5 0 50 100
MG30
a ' ' X
@
a ' X '
100
150
200
250
300
Figure 1. Left, the rst 300 points of the Laser time series. Right, 300 points of the series MG .
Ikeda attractor 1 0.5 0 x2(t) -0.5 -1 -1.5 -0.2

50 40 30
Lorenz x y z
Lorenz attractor
z(t) 50
20 10 0 -10 -20
40 30 20 10 20 -20 -15 -10 10 -5 x(t) 0 0 5 10 -10 15 20 -20 y(t) 30
0.2
0.4
0.6 0.8 x1(t)
1.2
1.4
1.6
-30 0 50 100 150 200 t 250 300 350 400
Figure 2. Left, the rst 300 points of the series Ikeda ( is plotted as a function of ). Middle, the rst 400 points of the Lorenz attractor: each of the coordinate and over time and, right, the corresponding trajectory.
The training sequences provided to the algorithms are clean (non noisy) sequences. Thus, for the prediction task, KKF and . is learned with Table 1 reports the validation mean squared error obtained by the different models tested for the one-step prediction and the trajectory prediction. For KKF, the complexity control discussed above (see (17)) has been tested: the results are shown for different values of in order to assess its inuence on the prediction quality. The analysis of the results leads to the rst conclusion that KKF learning with no measurement noise is able to provide a predictor of quality comparable with that of MLP and SVR. This prediction capacity is witnessed for all the datasets even when the difcult task of the trajectory prediction is considered. This observation is all the more striking that the determination of the preimages is an approximate process and the accumulation of such approximation errors over iterated predictions may have been expected to deteriorate the overall performance. A second observation relates to the dependence of our algorithm on the regularization : the unregularized method ) never obtains the best error, neither for the one( step ahead prediction nor for the trajectory prediction. This evidences the excessive complexity induced by the use of kernel functions and, as a possible consequence, the presence of an over-tting phenomenon. Conversely, when the regularization is too important (too large coefcient ), the model cannot reach interesting performances. Incidentally, a strange result about the prediction of the series Ikeda by our algorithm may be noticed: the best regularization parameters (with respect to the validation error) for the trajectory prediction is the one which has the worst results for the one-step prediction. Actually, if the
R T
The measurement error used to evaluate the performance of the different algorithms is the mean squared error dened for a machine and a series as:
4.2. Prediction Accuracy of KKF Two kinds of prediction capacities are evaluated for KKF. The rst one is the one-step ahead prediction ability, which
1
http://choosh.ece.ogi.edu/rebel/index.html.
For all the experiments, the regression machine used to determine the preimages is a Kernel Ridge Regressor with a gaussian kernel whose width is determined with respect to the performance measured on the test set.
For all the datasets, the data are split up in training data, test data to choose the hyperparameters of the models and validation data. Three training methods for the prediction of time series are tested: a multilayer perceptron (MLP), a support vector regressor (SVR) and a Kernel Kalman Filter (KKF). For those last two kernel machines, gaussian kerhave been used. The nonnels linearity used in the MLPs is implemented by a function and the code used to implement the MLP and other ltering techniques is that of R. van der Werves ReBEL toolbox1. The training of multi-dimensional time series with support vector regressors is done by decomposing the learning problem in as many number of training problem as the dimension of the time series.
) ' & 0(%
EB D
R T
2B 4
where is a xed step size; the series is then used as a training series. For the Laser time series, we set and whereas and are used for MG (those are common values for these parameters). Conversely, Ikeda and Lorenz are multi-dimensional rst-order serie and the resort to embedding vectors (though possible) is not considered.
R (R
merely consists in realizing the prediction at time from the actual observation at time . The second one is the multi-step ahead or trajectory prediction prediction where the model relies on its own output estimates to compute future observations.
3 `
3 %1 3
5 P
T A' G
E 4 a FF X '
R ( R
' F
"

G H'
AP B
3
R " Y ' ' cdb ) ' ' X ' b b a

f
!
a #
6 P
R (R t
" #
3 X 2 3 E1 a X
'6' G
X yv t D # X x a
One-step prediction
MLP SVR

k y 1 y k y k y k y k k y
1.4326 0.2595
0.0461 0.0313 0.0512 0.0453 0.0364 0.0307 0.0315 0.0374 0.0521
7.093 8.103 7.922 7.914 7.773 8.421 15.71 30.81 35.65
0.2837 0.1811 0.3134 0.3133 0.3133 0.3186 0.3230 0.7922 5.1135
4450 2263 2304 2297 2270 2263 2281 2298 2315
1.046 1.173 1.242 1.121 1.095 1.090 1.085 1.111 1.241
1 0.5
x2(t)
0.2 0.4 0.6 0.8 1 1.2 0.2 0 0.2 0.4 0.6 x1(t) 0.8 1 1.2 1.4
x2(t)
0 x2(t) 0.5 1
1.5 0.2
0.2
x2(t)
0.2 0.4 0.6 0.8 1 1.2 0.2 0 0.2 0.4 0.6 x1(t) 0.8 1 1.2 1.4
x2(t)
Figure 3. Trajectory prediction made by KKF ( ). The triangles correspond to the KKF prediction and the crosses to the actual time series (see text for comments).
such a noise variance leads to a signal to noise ratio of dB. The experiments described in this section have been carried out on the series MG , Ikeda and Lorenz. The parameters of KKF model are, for each time series, those which correspond to the best one-step ahead prediction results . The other ltering algorithms make use of the best MLP to model the transition between the states (see Table 1). Each experiment of noise removal is repeated 20 times and Table 2 reports the mean-squared errors and standard deviations between the denoised signal and the original one. The results of this table show that our method of ltering is as effective as the usually implemented methods when the problem of nonlinear series denoising is tackled. In particular, it may be noticed that in the case of MG , the measured mean squared error is lower than that provided by the others algorithms. Besides, in all cases, KKF produces results on average better than those of the usually

These prediction experiments validate both the efciency of the Kernel Kalman Filter and the learning strategy to compute the preimages. 4.3. Noise Extraction The task tackled here consists in removing the noise of one series to which an observation noise has been added. For our experiments, we added a zero-mean gaussian noise of variance to the validation series of length ;
' &
forecasted trajectory provided by the corresponding model ( ) is compared to the true Ikeda trajectory, it appears that this low quadratic error does not square with an actual prediction ability; it is instead a sort of degenerated quasi-periodical trajectory passing through particular position (see Figure 3). This phenomenon is also observed for the values and whereas for the other values of the trajectory envisaged is very similar with the real trajectory (see Figure 4). The trajectory predictions for the other time series do not undergo such an atypical behaviour.
R (R
"

k S
R `
KKF
0.2812 0.2641 0.2325 0.2325 0.2732 0.2812 0.3101
KKF =1
KKF =0 0.8 0.6 0.4 0.2 0 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1.2 0.2 0 0.2
KKF =107 0.8 0.6 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1.2 0.2 0 0.2
0.4
0.6 0.8 x1(t)
1.2
1.4
1.6
0.4 0.2 0
Figure 4. Trajectory prediction made by (above each gure, the value of ).
Algo.
Param.
Laser
MG
Ikeda (
Table 1. Mean squared error for the one-step ahead prediction.
is the regularizing parameter of Equation (17). Multi-step prediction Lorenz Laser MG Ikeda 0.9153 0.9231 0.9705 0.9415 0.9914 1.0593 0.9528 0.8550 0.7263 Lorenz 486.4 411.8 422.2 425.5 428.1 430.9 441.7 459.4 508.1
KKF =109
0.4
0.6 x1(t)
0.8
1.2
1.4
KKF =105
0.4
0.6 x1(t)
0.8
1.2
1.4
KKF
on the Ikeda series
E 6 D
Trajectoire : bruit uniforme
Trajectoire prvue 0.4 0.2 0
x2(t)
0.5 1
x2(t) 0 0.2 0.4 0.6 0.8 x1(t) 1 1.2 1.4 1.6 1.8
Table 2. Mean squared errors and standard deviation for the denoising task. CDKF differs from UKF by the strategy used to determine the so-called Sigma points.
Qk 6 k 6 k k Q 6 " 6 6 k " Q "k
1 0.5 0
0.2 0.4 0.6 0.8
x2(t)
0.5 1 1.5
x2(t)
KKF
used procedure of the Extended Kalman Filter. Regarding the Lorenz series, a relatively important difference between the results of our method and the best results can be noticed; the reported standard deviation however prevents from concluding about any kind of limitation regarding KKF for the Lorenz problem. Finally, the main idea of these experiments of noise extraction is that our model makes it possible to proceed effectively to this task by using the matrix equations of recursive estimation given in section 2.2.2. 4.4. Dynamics Extraction The last capacity of KKF we evaluate relates to the effectiveness of the training procedure that we described in section 2.2.3. We focus on the Ikeda problem and proceed as follows. First, we produce a series of 300 training noisy points from the Ikeda map, using the transition process
x2(t)
x2(t)
: a zero-mean gaussian noise with variance (Figure 5, top), a uniform noise on (Figure 5, middle) and a zero-mean Laplace noise distributed as with (Figure 5, bottom); choosing this way leads to a noise of variance 0.01. Figure 5 illustrates the efciciency of our method for the extraction of complex dynamics from a noisy series. In particular, it must be noticed that our algorithm is able to extract dynamics very similar to the actual dynamics whichever noise model is considered. These results are all the more remarkable as the studied dynamics are very complex, which again validates our kernel extension of linear dynamical systems.
f t
and the observation process

1y h h1y

where the vectors and are noise vectors. Then, we provide KKF with the series from which the learning procedure determines the optimal parameters. Lastly, we measure the one-step ahead prediction capacity on a clean validation series of length 300. In order to show the genericity of our model, the kernel used here is no longer gaussian but polynomial of degree 5 such that . A value of regularization is used. For the experiments, is selected as a zero-mean gaussian vector with variance , which corresponds to the use of a signal to noise ratio of 3dB with respect to the validation time series. Three types of noise are used
5. Conclusion and Future Work

We have presented a new kernel method, the Kernel Kalman Filter, which is an extension of the Kalman lter and linear dynamical systems. We have set up the ltering, smoothing and learning procedures for this model showing that the involved equations are closely related to their classical linear counterparts. Throughout various experiments, we have shown the efciency of KKF in tasks such as nonlinear time series denoising and prediction; the problem of extracting complex dynamics from noisy nonlinear dynamics has also successfully been tackled by our approach. Lastly, the presented results conrm the ability
R % P&t
t 0 (1
44 BBR
Q a S ) ( Tv t a X p ' ' X x
for
R
EB D
6 8
' R
%$ E
h y ho yy 0' )( $
f
qk
6 k
!Y
" (" 6 "
B 8
{ { y h E { { ijy h { E y { l { 3k h h
$ " %% '
k " ( 6
T
f
k
" a #R
EB D
$ " %% '
6"
!Y
EKF UKF CDKF
qk
k 6 6 k 6 6 k k 6 k Qck"
" Qk 6
6 6 6 6 " k " 6 " 6
Algo.
MG
Ikeda (
Lorenz (
1.5 2 0.4 0.2
1 1.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x1(t) Trajectoire prvue 0.4 0.2 1 1.1 1.2
Trajectoire : bruit gaussien 1 0.5 0 0 0.2 0.4 0.6 0.8 1 1.2 2 0.5 1.4 0 0.5 x1(t) Trajectoire : bruit de Laplace 1 0.5 0 0.5 1 1.5 2 2.5 0.5 0 0.5 x1(t) 1 1.5 2 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1 1.5 2 0 0.2 0.4
0)' ($
B 8
6 8
6 k 6
0.6 x1(t)
0.8
1.2
1.4
Trajectoire prvue
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x1(t)
1.1 1.2 1.3
Figure 5. Left column: noisy training series. Right column: extracted dynamics. The rst row corresponds to a uniform noise (validation error: 0.061), the second one to a gaussian noise (error: 0.061) and the last one to a Laplace noise (error: 0.064).
X D # X a

3k 3k
Q G
" $
P Q
vy R
{ h1 h 1

of the preimage learning strategy to determine good preimages. Several extensions to this work might be envisioned. For instance, the order of the matrices involved by the model linearly scales with the length of the studied time series, leading to possible heavy computations and/or storage memory problems. A sparse incremental greedy method could be a workaround to this problem. Besides, whereas working with a linear process in the state space allows some interpretability, this advantage is lost when working in the feature space. Thus, an interesting extension of this work would be to use kernels well-suited for interpretability and to learn their parameters in order to extract valuable knowledge. Finally, the proposed algorithm offers an original framework to model time series of non vectorial objects, provided that a suitable kernel is available.
Mika, S., Sch lkopf, B., Smola, A. J., M ller, K.-R., o u Scholz, M., & R tsch, G. (1999). Kernel PCA and Dea Noising in Feature Spaces. Adv. in Neural Information Processing Systems. Mukherjee, S., Osuna, E., & Girosi, F. (1997). Nonlinear prediction of chaotic time series using support vector machines. Proc. of IEEE NNSP97. M ller, K., Smola, A., R tsch, G., Sch lkopf, B., Kohlmoru a o gen, J., & Vapnik, V. (1997). Predicting Time Series with Support Vector Machines. Proc. of Int. Conf. on Articial Neural Networks. Ralaivola, L., & dAlch -Buc, F. (2004). Dynamical Mode eling with Kernels for Nonlinear Time Series Prediction. Adv. in Neural Information Processing Systems. Rosti, A.-V., & Gales, M. (2001). Generalised linear Gaussian models (Technical Report CUED/FINFENG/TR.420). Cambridge University Engineering Department. Suykens, J. A. K., Vandewalle, J., & Moor, B. D. (2001). Optimal control by least squares support vector machines. Neural Networks, 14, 2335. van der Merwe, R., de Freitas, N., Doucet, A., & Wan, E. (2001). The Unscented Particle Filter. Adv. in Neural Information Processing Systems. Welch, G., & Bishop, G. (1995). An introduction to the Kalman lter (Technical Report TR 95-041). University of North Carolina.
References
Anderson, B. D., & Moore, J. B. (1979). Optimal ltering. Englewood Cliffs, NJ: Prentice Hall. Bakir, G. H., Weston, J., & Sch lkopf, B. (2004). Learning o to Find Pre-Images. Adv. in Neural Information Processing Systems. Boser, B., Guyon, I., & Vapnik, V. (1992). A Training Algorithm for Optimal Margin Classiers. Proc. of the 5th Workshop on Comp. Learning Theory. Briegel, T., & Tresp, V. (1999). Fisher Scoring and a Mixture of Modes Approach for Approximate Inference and Learning in Nonlinear State Space Models. Adv. in Neural Information Processing Systems. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. of the Royal Statistics Society, 39. Engel, Y., Mannor, S., & Meir, R. (2002). Sparse online greedy support vector regression. Proc. of the 13th European Conference on Machine Learning. Ghahramani, Z., & Roweis, S. (1999). Learning nonlinear dynamical systems using an em algorithm. Adv. in Neural Information Processing Systems. Julier, S., & Uhlmann, J. (1997). A New Extension of the Kalman Filter to Nonlinear Systems. Int. Symp. Aerospace/Defense Sensing, Simul. and Controls. Kalman, R. E. (1960). A New Approach to Linear Filtering and Prediction Problems. Transactions of the ASME Journal of Basic Engineering, 82, 3545. Kwok, J. T., & Tsang, I. W. (2003). The pre-image problem in kernel methods. Proc. of the 20th International Conference on Machine Learning.

Kernel Regression Kernelkalman

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Kernel Regression Kernelkalman

Hochgeladen von

Copyright:

Verfügbare Formate

Nonlinear Time Series Filtering, Smoothing and Learning using the Kernel Kalman Filter

LIVA . RALAIVOLA @ LIP 6. FR

DALCHE @ EPIGENOMIQUE . ORG

2. Kalman Filtering with Kernels

a`X )Y' &

S  EB WFD  B %V% FD  % FD 7UT S 4  9 E 4 E6  6 4  1 ' R&

8 3 B 4 & B 5C7A3 ' @9 8 3 6 4 3 '& 6 5#7521

' (' 0)&

(7) (8) 2.2.2. K ERNEL F ILTERING

2.2. Kalman Filtering with Kernels 2.2.1. P ROPOSED M ODEL

the kernel ltering equations are, for

u y1H w k i Elv1H p h sr t 0H d e1H

' rxU X  EB D '

' 3 EB D  p ' ' p  E D

`  ' ya ' p  t Wu `

q r q p ' a ' EX F' a ' EX

 4  4 2    2  3  2

  ! a` )X &

2B A3 2B A3 Q& 8 4 2' 26 53 26 53 & 2 1 8 4 2'

' a XY)0 (& a X $ a  ' " ` X & )X a ` F'

5 5  a X CB F D aEX " ! " 6 ! 75

'  F 

(2) (3) (4) (5) (6)

` X a ` X F' a ` X F' & a a` X ' ` X ' & a $' '

 4 4R 2     2   R 

Finally, the values for

are given by:

a` X ' 8 " 6

' (&%# ) "

3 )X $ a` W 3 )X ' & a`

3 a' hX 5' 3 a' hX 5'

i " i ) l0    ' # i (&%$"!y     o  

a ` a` R)X  $ 8 )X $ & )X a` S

4. Experiments 3. Related work

CH A IFi G AFiED@  h C i A9 h 9 i B#@

and are real valued parameters; we con, , , and .

Ikeda attractor 1 0.5 0 x2(t) -0.5 -1 -1.5 -0.2

40 30 20 10 20 -20 -15 -10 10 -5 x(t) 0 0 5 10 -10 15 20 -20 y(t) 30

0.6 0.8 x1(t)

-30 0 50 100 150 200 t 250 300 350 400

  R "  Y ' ' cdb )  ' ' X ' b b a

0.0461 0.0313 0.0512 0.0453 0.0364 0.0307 0.0315 0.0374 0.0521

7.093 8.103 7.922 7.914 7.773 8.421 15.71 30.81 35.65

0.2837 0.1811 0.3134 0.3133 0.3133 0.3186 0.3230 0.7922 5.1135

4450 2263 2304 2297 2270 2263 2281 2298 2315

1.046 1.173 1.242 1.121 1.095 1.090 1.085 1.111 1.241

0.2812 0.2641 0.2325 0.2325 0.2732 0.2812 0.3101

0.6 0.8 x1(t)

Figure 4. Trajectory prediction made by (above each gure, the value of ).

Table 1. Mean squared error for the one-step ahead prediction.

on the Ikeda series

Trajectoire : bruit uniforme

Trajectoire prvue 0.4 0.2 0

0.2 0.4 0.6 0.8

and the observation process

5. Conclusion and Future Work

Q a S ) ( Tv t a X p ' ' X x

 '  R 

 h  y  ho yy 0' )( $

" (" 6 "

EKF UKF CDKF

6  6 6 6 " k " 6 " 6

1.5 2 0.4 0.2

a`X )Y' &

S EB WFD B %V% FD % FD 7UT S 4 9 E 4 E6 6 4 1 ' R&

8 3 B 4 & B 5C7A3 ' @9 8 3 6 4 3 '& 6 5#7521

' (' 0)&

u y1H w k i Elv1H p h sr t 0H d e1H

' rxU X EB D '

' 3 EB D p ' ' p E D

` ' ya ' p t Wu `

q r q p ' a ' EX F' a ' EX

4 4 2 2 3 2

! a` )X &

' a XY)0 (& a X $ a ' " ` X & )X a ` F'

5 5 a X CB F D aEX " ! " 6 ! 75

' F

` X a ` X F' a ` X F' & a a` X ' ` X ' & a $' '

4 4R 2 2 R

a` X ' 8 " 6

3 )X $ a` W 3 )X ' & a`

i " i ) l0 ' # i (&%$"!y o

a ` a` R)X $ 8 )X $ & )X a` S

CH A IFi G AFiED@ h C i A9 h 9 i B#@

R " Y ' ' cdb ) ' ' X ' b b a

Q a S ) ( Tv t a X p ' ' X x

' R

h y ho yy 0' )( $

6 6 6 6 " k " 6 " 6