10 1 1 53

3-D Kalman Filter for Image Motion Estimation
Jaemin Kim and John W. Woods

Abstract|
This paper presents a new 3-D Markov model for motion vector elds. The three dimensions consist of the two space dimensions plus a scale dimension. We use a compound signal model to handle motion discontinuity in this 3-D Markov random eld. For motion estimation, we use an extended Kalman lter as a pel-recursive estimator. Since a single observation can be sensitive to local image characteristics, especially the model is not accurate, we employ windowed multiple observations at each pixel to increase accuracy. These multiple observations employ di erent weighting values for each observation, since the uncertainty in each observation is di erent. Finally, we compare this 3-D model with earlier proposed 1-D (coarse-to- ne scale) and 2-D spatial compound models, in terms of motion estimation performance on a synthetic and a real image sequence. Keywords| 3-D Markov random eld, pel-recursive motion estimation, extended Kalman lter, multiscale, compound model, multiple observation.
Motion estimation is formulated as the computation of a 2-D motion vector eld. The motion vector eld is de ned as the set of motion vectors which are used to denote the apparent motion of the image brightness pattern in a timevarying image sequence. This perceived motion has been called optical ow. The optical ow can be the true motion or just a change of illumination in the scene. Since motion vector elds relate successive images of a given scene, reliable estimation of these elds is important for image sequence processing such as interpolation, coding, tracking, and ltering 1]. For certain visual tasks, such as video compression, it may not be necessary to compute true motion. However, other tasks such as temporal interpolation will require true motion. There are two principle approaches to the problem of estimation of motion vector elds: the feature-based matching approach and the gradient-based approach. The gradient-based approach allows the motion vect eld to be dense, and does not have to nd feature points with signi cant variations. In the gradient-based approach, gray value variation at a pixel does not give enough information to determine the two components of the motion vector at the pixel, and is therefore called ill-posed or ill-conditioned. The well known approach for solving ill-posed problems is to restrict the class of admissible solutions by using suitable a priori knowledge 2,3]. In an algebraic approach, Horn and Schunk 4] have sought the solution which simultaneously satis es the optical constraint (measurement) and minimizes a motion smoothness error. Hildreth 5] estimated the motion vector component orthogonal to the intensity contour while minimizing motion variation along the contour. Nagel and Enkelmann 6,7] and Fogel 8] have varied the smoothness constraint based on the characteristics of the data structure. Yuille and Grzywacz 9] proposed a theoretical model for motion coherence and chose the smoothing function so that the interaction between the motion vectors falls o like a Gaussian as a function of the distance between them. As a stochastic approach, vector eld models have also been used for smoothing motion vectors. Stuller and Krishanamurthy 10] modeled each component of the motion vector eld with a stationary 1-D autoregressive (AR) model. Driessen and Biemond 11] replaced the 1-D AR model to a 2-D one. Then Efstratiadis and Katsaggelos 12] used a nonstationary 2-D (spatial) AR model, and Brailean and Katsaggelos 13] used multiple 2-D AR models. Instead of generalized causal AR models, Konrad and Dubois 14] used noncausal AR models. They modeled each component of the motion vector eld with a compound Gaussian Markov random eld. Driessen 15] has used a 3-D (spatiotemporal) Gaussian Markov random eld model. A spatiotemporal model uses both the spatial and the temporal dependencies of the motion vector eld. Namazi et al. 16,17] modeled the transform coe cients with a stationary 1-D AR model and used an 1-D Kalman lter for the estimation of the coe cients corresponding to low frequencies. As types of pel-recursive methods, a 1-D Kalman lter and a 2-D Kalman lter have been used for estimating motion vector elds 10,18,13]. In these methods, motion vector elds have been modeled with a single or compound generalized causal AR model. In addition to the above methods, multiscale (coarse-to- ne) methods have been used for solving the ill-posed problem 19,20]. In multiscale methods, the motion vector eld is rst estimated at previous coarser scales. This estimated vector eld is propagated to the ner scale with estimation uncertainty. Since this propagation can be thought as a 1-D (coarseto- ne) Markov random sequence and each observation at each scale has observation uncertainty, a 1-D Kalman lter is
Jaemin Kim is with Samsung Semiconductor Inc., San Jose, CA, USA. John W. Woods is with Electrical, Computer, and Systems Engineering Department, Rensselaer Polytechnic Institute, Troy, NY, USA.
I. INTRODUCTION
used over the scale. This multiscale method can be viewed as imposing an a priori smoothness constraint over the scale 19]. In this paper, we will model the motion vector eld with a compound 3-D generalized causal AR model. The 3-D space consists of the 2-D image space augmented with a 1-D multiscale dimension. When there is discontinuity in the motion vector eld, motion vectors on one side are weekly correlated with motion vectors on the other side. Therefore, causal AR models do not result in good prediction. A noncausal AR model results in better prediction than the causal AR models because the noncausal model uses information of motion vectors on both sides of the discontinuity. The proposed 3-D model uses information of motion vectors in the previous scales, motion vectors in the previous rows, and motion vectors before in the current row. Those motion vectors in the previous scale include information of motion vectors in the noncausal support region in the current scale. In the 1-D multiscale method, the estimation in the previous scale is coarser than the estimation in the current scale. When there is a line discontinuity in the 2-D motion vector eld, a motion vector estimated in the causal region in the current scale can be highly correlated with the motion vector at the current pixel than motion vectors estimated in the previous scale. When the prediction error is smaller, so is the linearization error. The gradient-based motion estimation results in better performance when the prediction error is smaller. Therefore, the proposed 3-D model results in better prediction than the 2-D spatial model and 1-D multiscale model. We will show that this 3-D model can handle motion discontinuity properly. We will estimate a motion vector eld with a measurement (optical constraint) and the proposed motion model. A 3-D extended Kalman lter will be used for estimating the motion vector eld. This 3-D extended Kalman lter is a simple extension of the 2-D reduced order model Kalman lter (ROMKF) 21] and is derived in 22]. The paper is organized as follows: The 3-D motion vector eld model is presented rst in Section 2. This is followed by the observation model in Section 3. In Section 4, we will explain how to obtain a set of model parameters for the proposed 3-D vector eld model. In Section 5, we will show a multi-model extension of the 3-D motion vector eld model. In Section 6, we will use a 3-D extended Kalman lter for motion estimation and present the model detection problem. Some experimental results are presented and discussed in Section 7. This is followed by Conclusions. We describe a 3-D Gaussian Markov random process for modeling a motion vector eld, where the 3-D space consists of the 2-D space and the 1-D multiscale. The term motion vector means "displacement vector". The image eld at the current scale is horizontally and vertically decimated by two for the generation of the image eld at the coarser scale. When the image eld s(n1 ; n2; n3 ? 1) at the previous coarse scale n3 ? 1 is generated as follows: 1 s(n1 ; n2; n3 ? 1) = 4 s(2n1 ; 2n2; n3) + s(2n1 + 1; 2n2; n3) + s(2n1 ; 2n2 + 1; n3) +s(2n1 + 1; 2n2 + 1; n3) ; (1) the unit distance, the distance between two neighbor pixels, at the scale n3 ? 1 is twice that at the ner scale. This means that when an object moves one pixel at the previous coarser scale n3 ? 1, the corresponding object moves two ~ ~ pixels at the ner scale n3 . Therefore, the motion vector d(2n1; 2n2; n3) can be estimated by 2d(n1 ; n2; n3 ? 1). When the motion vector eld at the previous coarse scale n3 ? 1 is propagated to the ner scale n3 , a parent motion vector at the previous coarser scale, is propagated to four child motion vectors as illustrated in Fig. 1. In this Markov random eld sequence, we de ne the ordering structure: At each scale n3 , motion vectors are processed line-by-line, in raster scan order. Motion vectors at the ner scale n3 are processed after processing all motion vectors at the previous coarser scale n3 ? 1. In this way, we can divide the multiscale motion vector eld into two parts, i.e. past and future. The generalized causal model support region illustrated in Fig. 1 is described by: R ++ = fkjk1 0; k2 0; k3 = 0g fkjk1 < 0; k2 1; k3 = 0g fkjM3 k3 1g; (2) where k = (k1 ; k2; k3) is a spatial location at the scale k3, and the order of the proposed 3-D model is M3 . This 4 3-D Markov random eld is 2-D cyclostationary and thus has multiple phases. In this paper, we use = 1; 2] to emphasize that. The two components of the motion vector may be correlated, but the horizontal and the vertical motions are generally uncorrelated with each other. In the proposed 3-D Markov motion vector eld model, we assume that the ~ correlation structure for both components is the same 10]. Then, a motion vector eld d = d1; d2] at spatial location M3 n + M3 ; 2M3 n + M3 ) at scale level n can be expressed as a 3-D Markov motion vector eld of the form: (2 1 1 2 3 2
d1 (2M3 n1 + d2 (2M3 n1 +
M3 ; 2M3 n 2 1 M3 ; 2M3 n
1
II. THREE-DIMENSIONAL MOTION VECTOR FIELD MODEL
+ 2 +
M3 ; n ) 3 2 M3 ; n )
2 3
a(k1 ; k2 ; k3 ;
k2R
++
M3 ; M3 ; n
1 2
a(k1 ; k2 ;k3 ;
M3 ; M3 ;n ) 3
1 2
( 2k3 d1 (2(M3 ?k3 ) n1 + 1M3 ?k3 ) ? k1 ; 2(M3 ?k3 ) n2 + k3 d2 (2(M3 ?k3 ) n1 + (M3 ?k3 ) ? k1 ; 2(M3 ?k3 ) n2 + 2 1 v1 (2M3 n1 + M3 ; 2M3 n2 + M3 ; n3 ) ; 1 2 + v (2M3 n + M3 ; 2M3 n + M3 ; n ) 2 1 2 3 1 2 (M3 ?k3 ) (M3 ?k3 ) (M3 ?k3 ) ; 2 = 0; :::; 2 ? 1; 1
(M 2 (M 2
3 ?k3 ) ? k2 ; n3 ? k3 ) 3 ?k3 ) ? k2 ; n3 ? k3 )
(3)
which can be written in simpli ed notation as ~ d(2M3 n + M3 ) X = A(k; +~ (2M3 v
k2R
M3 ; n
n+
++
(M3 ?k3 ) ? k) k ~ (M ?k ) 3)2 3 d(2 3 3 n +
M3 );
(4)
where A(k; M3 ; n3) is a matrix representing an AR parameter set, which depends on ( M3 ; M3 ; n3), and ~ (2M3 n + M3 ) v 1 2 is a white vector eld, which generates the motion vector eld. Each component of ~i ; i = 1; 2, is assumed to be a v Gaussian, zero-mean white noise with correlation function, Q(m; p) =
vi (m1 ; m2 ; m3)
(m1 ? p1 ; m2 ? p2; m3 ? p3);
(5)
where (m1 ? p1; m2 ? p2 ; m3 ? p3 ) is a discrete delta function , and i = 1; 2. As the model support region R ++ increases, the proposed 3-D Markov random eld can model motion vector elds more accurately, but its complexity also increases. In some papers about multiscale signal processing, a signal is de ned on a 3-D pyramid and a tree-like structure is used 23]. In this paper, we are interested in the simplest 3-D Markov random eld. In the multiscale Markov eld, the 1st order Markov is successively used for modeling various random elds 20]. Hence, we use the 3-D Markov random eld, the scale component of which is 1st order (M3 = 1): ~ ~ pfd(n)jD(n ? 1)g = pfd(n)jD11(n ? 1)g; (6) where
4 4 n = (n1 ; n2; n3); n ? 1 = (n1 ? 1; n2; n3); 4 ~ ~ D(n ? 1) = d(1; 1; n3); :::; d(N1 ; 1; n3); :::; ~ ~ d(1; n2; n3); :::; d(n1 ? 1; n2; n3); ~ (n3 ? 1); ~ (n3 ? 2); :::; ~ (1) ; d d d 4 D11 (n ? 1) =
A. Four Phases of Nearest-Pixel Model
~ ~ d(1; 1; n3); :::; d(N1 ; 1; n3); :::; ~ ~ d(1; n2; n3); :::; d(n1 ? 1; n2; n3); ~ (n3 ? 1) ; d ~ ~ d(1; 1; n3 ? 1); :::; d( N1 ; 1; n3 ? 1); :::; 2 N2 ; n ? 1); :::; d( N1 ; N2 ; n ? 1) ; ~ ~ d(1; 2 3 2 2 3
4 ~ (n3 ? 1) = d
where N1 is the number of pixels in a row, N2 is the number of rows in a frame, D(n ? 1) includes everything before in the current row, everything in all previous rows, and everything in all previous scales, but D11 (n ? 1) includes everything before in the current row, everything in all previous rows, and everything in the previous coarser scale, (n3 ? 1). To predict the motion vector at the pixel currently being processed, we want to use the motion vectors in the smallest 3-D model support region. The smallest 3-D model support region consists of the parent pixel plus the (1 1)-order nonsymmetric half-plane (NSHP) support region on the current scale. This model support region, here termed nearestpixel support is illustrated in Fig. 2 and is described by:
Rnearest = f(1; 0; 0); (1; 1; 0); (0; 1; 0); (?1; 1; 0); (0; 0; 1)g:
(7)
From (4), (5), and the orthogonality condition 24], the parameter set for each component is determined by the following equation: E di(2n + ) ?
X
2j3 di(2(1?j3 )
k n+
a(k; ; n3)2k3 di(21?k3 n +

(1?j3) ? j)
(1?k3) ? k)
2 (n)
(n ? j);
i = 1; 2:
(8)
where (2n + ) and (21?k3 n + 1?k3 ? k) are de ned in (3) and (4), and the order of the proposed model is one (M3 = 1). In (9), E di(2n1 + 1 ; 2n2 + 2; n3)di (2n1 + 1 ? k1 ; 2n2 + 2 ? k2 ; n3)] is independent of ( 1; 2). However, from (3), E di(2n1 + 1 ? k1 ; 2n2 + 2 ? k2 ; n3)di (n1; n2; n3 ? 1)] depends on ( 1; 2), and the nearest pixel model has four di erent phases at each scale, that is four di erent model parameter sets at each scale. These four phases are illustrated in Fig. 2. For the nearest pixel model (M3 = 1), Eq. (4) is simpli ed as ~ d(2n + ) X ~ A(k; ; n3)2k3 d(21?k3 n + 1?k3 ? k) =
k2R ++ +~ (2n + ); v (9) where (2n + ) and (21?k3 n + 1?k3 ? k) are de ned in (3) and (4), denotes the four phases, and the n3 argument
of A( ) indicates that the model is scale-variant. III. OBSERVATION MODEL In an intensity image sequence, the motion eld is observed as displaced intensities. Two consecutive image frames are related by a displacement eld. Introducing the new variable `t' to denote time, the observed pyramidal image frame is then given by r(n1; n2; n3; t) = s(n1 ; n2; n3; t) + wr (n1; n2; n3; t); (10)
where s(n1 ; n2; n3; t) is the original image intensity at the spatial location (n1 ; n2) at the scale level n3 at the time t, r(n1; n2; n3; t) is the observed image intensity, and the observation noise wr (n1; n2; n3; t) is assumed to be zero-mean and Gaussian. Under the assumption that image intensity is constant along motion trajectories, two observed image frames are related by the displacement eld as r(n1 ; n2; n3; t) = r(n1 ? d1(n1 ; n2; n3); n2 ? d2(n1 ; n2; n3); n3; t ? 1) ~ +wobs (n1; n2; n3; t); where r= ~4 (11)
r; if d1 and d2 are integers; (12) interpolated value; otherwise, and wobs (n1; n2; n3; t) = wr (n1 ; n2; n3; t) ? wr (n1 ; n2; n3; t ? 1) + spatial interpolation error. We will use (12) as the observation equation for the motion vector eld, and thereby assume that image intensity does not change along motion trajectories. Since (12) provides only a single linear-constraint for the two unknown components of the motion vector, a stochastic model of the motion vector eld is used for combining information on motion vectors in the neighborhood, thereby producing a unique solution. Even though a motion vector eld model eliminates the ill-posed problem, the observation equation (12) can still result in motion estimation which is sensitive to local characteristics of a noisy image, especially when the model does not well represent the local characteristics of the motion vector eld. At each pixel, the motion vector is usually fairly constant in a small region centered at the pixel. Hence, observations in a symmetric J J window can be used for estimating the motion vector at that pixel. Assuming that motion is constant in the J J window W , the observations within the window can be expressed as: r(n1 ? j1; n2 ? j2 ; n3; t) = r(n1 ? j1 ? d1(n); n2 ? j2 ? d2(n); n3; t ? 1) ~ 4
A. Multi-Observation
+ 5d r(n1 ? j1 ? d1(n); n2 ? j2 ? d2 (n); n3; t ? 1) ~ ~ 5(j1 ;j2 ) d(n) (j1 ; j2 ) +high order terms + wobs (n1 ? j1 ; n2 ? j2 ; n3; t) (13) = r(n1 ? j1 ? d1(n); n2 ? j2 ? d2(n); n3; t ? 1) ~ +werr (n1 ? j1 ; n2 ? j2; n3; t); (14) where werr (n1 ? j1; n2 ? j2; n3; t) includes the second, the third, and the fourth terms of the RHS in (14). The error term werr (n1 ? j1 ; n2 ? j2 ; n3) increases as the distance between the center of window and the pixel (n1 ? j1 ; n2 ? j2 ; n3) increases. This error is treated as a zero-mean Gaussian noise. The observation uncertainty is di erent at each pixel in the window. Hence, the observation at each pixel (n1 ? j1 ; n2 ? j2 ; n3) should be weighted di erently in estimating ~ d(n1 ; n2; n3). When the weighting value at each pixel (n1 ? j1; n2 ? j2 ; n3) is expressed as variance w n n 4 !(j1 ; j2) = variance of w of (nerr (n1 ;; n2;+ 3j) ; n ) ; (15) err 1 + j1 2 2 3 the weighting value decreases as the distance between the center of window and the pixel (n1 ? j1; n2 ? j2; n3) increases. Since the observation uncertainty at each pixel depends on the local properties of the image and the motion vector eld, determining the weighting values is quite complicated. Here we use a predetermined set of weighting values. From (14) and (16), a formula for determining the set of weighting values is expressed as
2 2 + (j1 + j2 ) In Section 6, we will explain how this set of weighting values is used.
!(j1 ; j2) =
(16)
Our 3-D AR model can be interpreted as a smoothing constraint over space and scale. In the 2-D case, Yuille and Grzywacz 9] imposed two criteria for the smoothing function of the motion vector eld as follows: It must impose enough smoothness to make the problem well-posed. The interactions between di erent measurements must fall o to zero at large distances. They chose the smoothing function such that the correlation between two motion vectors is a Gaussian function because of the following reasons: First it meets the two criteria; second, it generates analytic solutions; and third, it has a natural spatial scale. When the correlation function is given by: 2 2 =2 f(n1 ; n2) = p 1 2 exp ? (n1 + n22) ; for 1 2; (17) 2 2 it satis es the two criteria and may be able to account for a large part of the motion vector eld. When = 2, the correlation is a 2-D Gaussian function, which corresponds to a separable 2-D AR random eld. The value = 1 has been used for describing a 2-D isotropic random eld 25]. By selecting these two parameters: and , we choose the model parameter set for the motion vector eld. In our proposed 3-D model for motion vector elds, it is very complicated to have analytic 3-D model parameters. However, a parameter set for the 3-D AR model at each level n3 can be obtained from the Yuille and Grzywacz model as follows: 1. By selecting and , determine the parameter set for a 2-D AR model. 2. With the chosen model-parameter set, generate a 2-D random eld, which becomes the random eld at the nest scale (n3 = 4) in our case. 3. From the random eld at the ner scale n3, generate a random eld at the coarser scale n3 ? 1. 4. From the random elds at the ner scale and the coarser scale, determine the parameter set for a 3-D AR model, which satis es the linear minimum mean square error (LMMSE) criterion (9). 5. At the next coarser scale n3 ? 1, repeat steps 3-4, since the model parameters may be scale variant. In the above procedure, we do not use a real displacement eld but use a 2-D random eld that satis es the given correlation function, for the following reasons: First, in the region where the motion ow is continuous in all directions, the motion vector is very strongly correlated with its neighboring motion vectors. In the multi-model approach of the next section, a 3-D isotropic model is used for motion vectors in this region. By selecting and , we can choose a 2-D random eld that can model motion vectors in this region accurately, and hence we can obtain the model parameter set 5
IV. MODEL PARAMETER SETS AT MULTISCALE LEVELS
for a 3-D isotropic model from the above procedure. Secondly, the purpose of using a stochastic model is to restrict the class of admissible solutions. Smoothness constraints and stochastic models are used for smoothing the motion vector eld. When we do not know the true random eld, a random eld that satis es Yuille and Grzywacz's two criteria can be used for smoothing the motion vector eld. Given and , parameter sets at di erent scale levels are shown in 22]. The 3-D AR model is scale-variant. As the scale level decreases (gets coarser), the spatial correlation decreases, and the prediction of the motion vector at the pixel currently being processed depends more on the parent motion vector. Also, though, as the scale level decreases (becomes coarser), the magnitude of the motion vector (displacement vector) decreases. The overall result is that the model error variance decreases, even though the spatial correlation decreases. It turns out that the error variance of the 3-D AR model at the coarser scale n3 ? 1 is about half of that of the 3-D AR model at the ner scale n3 . Motion vector elds may be assumed to be globally stationary random vector elds, but they have apparently nonstationary local structures such as motion discontinuity. To represent nonstationary local structures of motion vector elds, a multi-model will be used. Our multi-model consists of 6 di erent elementary models: one 3-D isotropic model, 4 spatial-directional models (0, 45, 90, and 135 degrees), and one spatially unpredictable model (a scale-directional model), in which only a motion vector at the previous coarse scale is used for predicting the current motion vector. For the isotropic model, we used the method proposed in Section 4. For the spatial-directional models, only a single motion vector in the generalized causal support region is used for predicting the motion vector at the current pixel. The prediction is from a pixel in a given direction. These directional models are simple and make an estimated motion vector eld sharp at any motion discontinuities. When there are motion discontinuities between the pixel (2n + ) and four pixels in the spatial generalized causal support region, f(2n1 + 1; 2n2 + 2; n3),(2n1 + 1 + 1; 2n2 + 2 ; n3), (2n1 + 1 ; 2n2 + 2 + 1; n3),(2n1 + 1 + 1; 2n2 + 2 + 1; n3)g, motion vectors in the spatial support region cannot satisfactorily predict the motion vector at (2n + ). Since the 3-D generalized causal support region includes the previous scale, a motion vector from the previous coarser scale can be used for the prediction when the spatial-directional models do not help. Since the 3-D motion vector eld has four phases, using only a parent motion vector for the prediction is not appropriate. When the phase at the point ~ is the 4th phase, ( 1; 2)=(1; 1), (illustrated in Fig. 2.), the parent motion vector d(n1; n2; n3 ? 1) corresponds to the ~(2n1 ; 2n2; n3); d(2n1 + 1; 2n2; n3); d(2n1; 2n2 + 1; n3), and d(2n1 + 1; 2n2 + 1; n3). These four motion ~ ~ ~ average value of d vectors are located at the four pixels in the spatial generalized causal support region. Hence, the parent motion vector is also inappropriate for predicting the motion vector at a motion discontinuity. In this case, we can use the motion vector at (n1 + 1; n2 + 1; n3 ? 1). As a result, selection of a motion vector at the coarse scale depends on four phases as ~ select d(n1 ; n2; n3 ? 1) if = (0; 0) ~(n1 + 1; n2; n3 ? 1) if = (1; 0) select d ~ select d(n1 ; n2 + 1; n3 ? 1) if = (0; 1) ~ select d(n1 + 1; n2 + 1; n3 ? 1) if = (1; 1) (18) This is illustrated in Fig. 3. For this multi-model, the underlying process which determines the elementary model is described by a Markov chain. Equation (10) is then changed into: ~ d(2n + ) X ~ Al(n) (k; ; n3)2d(21?k3 n + 1?k3 ? k) = (19) 1?k3 ? k) are de ned in (3) and (4). As described in the previous section, the motion vector eld model is both scale-variant and space-variant. Under the Gaussian assumption, each model parameter set consists of a prediction parameter set and an error variance. As for the isotropic model, the model parameter set can be calculated by the method proposed in the previous section. For the directional models, the model parameter sets consist of a single prediction parameter and an error variance. In general, the error variance of a spatially unpredictable model should be greater than that of the isotropic model. Since we employ a compound eld model, we try to detect the most likely elementary model at each pixel while estimating the motion vector eld. When the error variance of a model increases, the probability of selecting that model decreases. In a real motion vector eld, most regions are smooth and only a small number of regions have sharp transitions. Based on these facts, we empirically chose the parameter set for each directional model. At each scale, the error variance for 6
V. MULTI-MODEL FOR MOTION VECTOR FIELDS
k2SA +~d(n) (2n + ); vl where l(n) is a most likely elementary model at n, and (2n + ) and (21?k3 n +
each spatial-directional model was set to one and a half (1 1 ) times larger than that of the corresponding 3-D isotropic 2 1 model. Also, the error variance for a scale-directional model was set at one and a half (1 2 ) times larger than that of the spatial-directional model. The exact values for each scale are given in 22]. In the previous section, we have described our compound multiscale model for the proposed 3-D motion vector eld. In this section, we will describe a recursive estimation method for this model. In the 3-D case, the global state for the Kalman lter is O(N1 N2 M3 ) dimensional, where N1 is the width of the image frame, N2 is its height, and M3 is the order of the recursive model. Since it requires an enormous amount of computation to update all these points, the reducedorder model Kalman lter (ROMKF) and the reduced-update Kalman lter (RUKF) were proposed as suboptimal lters 21,26]. In 22,27], based on the relationship between RUKF and ROMKF, the ROMKF was further improved. In motion estimation, the observation is a nonlinear function of a motion vector. At each pixel, the observation equation needs to be linearized about the predicted motion trajectory. Hence, we will use ROMKF because of better adaptation capability in a nonlinear setting. Since the observation equation (15) is nonlinear, we use an extended Kalman lter for estimating motion vector elds. The 1-D extended Kalman lter was given in 28]. When the multi-observation equation within a 3 3 window W is ^ ~ linearized about the predicted displacement vector, d(n), at each multiscale pixel n, Eq. (15) can be expressed simply as, z(n1 ? j1 ; n2 ? j2 ; n3) ~ ^ ^ ~ = f T (n1 ? j1 ? d1(n); n2 ? j2 ? d2(n); n3) d(n) + w(n1 ? j1; n2 ? j2 ; n3) 8(j1 ; j2) 2 W ; where
4 z(n1 ? j1 ; n2 ? j2 ; n3) = r(n1 ? j1 ; n2 ? j2 ; n3; t) ^ ^ ? r(n1 ? j1 ? d1(n); n2 ? j2 ? d2(n); n3; t ? 1); ~ 4 5 r(k ; k ; k ; t ? 1) ~ f(k1; k2; k3) = d ~ 1 2 3 ~ 4 d(n) ? d(n); and ^ ~ ~ d(n) = ~ 4 w(n1 ? j1 ; n2 ? j2 ; n3) = werr (n1 ? j1; n2 ? j2 ; n3) + higher order terms:
A. Extended Kalman Filter for Multi-Observations VI. RECURSIVE ESTIMATION
(20)
When we express (21) in vector form, we obtain: ~ z(n) = FT (n) d(n) + w(n); where
4 z(n) = z?1;?1; :::; zj1 ;j2 ; :::; z1;1 ]T for ? 1 j1; j2 +1; 4 = z(n1 ? 1; n2 ? 1; n3); :::; z(n1 + 1; n2 + 1; n3]T ; 4 F(n) = F?1;?1; :::; Fj1;j2 ; :::; F1;1]T 4 ~ ^ ^ = f (n1 ? d1 (n) ? 1; n2 ? d2(n) ? 1; n3); :::; ~ ^ ^ f (n1 ? d1(n) + 1; n2 ? d2(n) + 1; n3)]; 4 w(n) = w?1;?1; :::; wj1;j2 ; :::; w1;1]T 4 = w(n1 ? 1; n2 ? 1; n3); :::; w(n1 + 1; n2 + 1; n3)]T :
(21)
The state equation for the motion estimation and more details are given in 22]. The ltering for such a compound-model can be interpreted as a two step process: at each point, one rst estimates the local state of the underlying Markov chain, and then chooses a most likely model to estimate the higher level random vector eld. We defer the required model detection to Subsection 6.2. 7
With the linearized state equation, the 3-D Kalman lter is composed of two parts: a prediction part and an update part. The prediction part is straightforward. By a simple matrix operation, ref. (5.34) of 22], the update calculation for the state vector can be simpli ed to
? ^ ^ Xa = Xb + PbHt P111
" P
is the error variance of the observation noise w at the center of the window, and Fi;j;k = Fi;j , and P11 = 2 w HPbHt . Equation (23) becomes more well-posed as P11;i ; i = 1; 2, increases. This means that as P11;i gets smaller, then where
w
w !(j1 ; j2 )Fj1 ;j2 ;1 Fj1 ;j2 ;1 + P11;1 P !(j1 ; j2 )Fj1 ;j2 ;2 Fj1 ;j2 ;1 P P !(j1 ;j2 )Fj1 ;j2 ;1 zj1 ;j2 j1;j2 !(j1 ; j2 )Fj1 ;j2 ;2 zj1 ;j2
Fj1 ;j2 ;1 Fj1 ;j2 ;2 2 P w !(j1 ; j2 )Fj1 ;j2 ;2 Fj1 ;j2 ;2 + P11;2
#?1
(22)
(23) becomes better conditioned. Since our proposed 3-D model results in lower prediction error than the 2-D spatial model or 1-D multiscale model, the proposed 3-D model can be expected to result in lower estimation error too. As the weighting value, !(j1 ; j2 ); (j1 ; j2) 6= (0; 0), increases, (23) becomes well-posed. By a simple matrix operation, ref. (5.34) of 22], the update calculation for the error covariance vector is simpli ed as follows:
Pa(n1 ;n2 ; n3 ) =
w !(j1 ; j2 )Fj1 ;j2 ;1 Fj1 ;j2 ;2 !(j1 ; j2 )Fj1 ;j2 ;1 Fj1 ;j2 ;1 + P11;1 I? P P 2 w !(j1 ; j2 )Fj1 ;j2 ;2 Fj1 ;j2 ;1 !(j1 ;j2 )Fj1 ;j2 ;2 Fj1 ;j2 ;2 + P11;2 P P !(j1 ; j2 )Fj1 ;j2 ;1 Fj1 ;j2 ;1 P !(j1 ;j2 )Fj;1Fj;2 P Pb (n1 ; n2 ;n3 ): !(j1 ;j2 )Fj;2Fj;1 !(j1 ;j2 )Fj;2Fj;2
" P
#?1
(23)
These Kalman ltering equations are derived under the assumption that a most likely model is chosen, i.e. the decisiondirected approach 29]. At each pixel, a model detection algorithm uses the observations in the small window W for estimating a probability of the acting model being lk . Using Bayes' rule, we obtain Pa Lk (n)] 4 L = PQ k (n)jr(n1 ? j1 ; n2 ? j2; n3; t); 8(j1; j2) 2 W ; r(n3 ; t ? 1); D11(n ? 1)] p r(n1 ? j1; n ? j ; n3; t)jr(n3; t ? 1); D11(n ? 1); Lk (n)] Pb Lk (n)] Q = P i;j p r(n ? j 2; n ?2 j; n ; t)jr(n ; t ? 1); D (n ? 1); L (n)] P L (n)] ; (24) 1 1 3 3 11 k b k k i;j where 4 P L (n)] = P l jL (n ? 1)]p L (n ? 1)]; = fr(1; 1; n3; t ? 1); :::; r(N1; 1; n3; t ? 1); :::; 4 r(1; 2 n3 t ? r(N1 )g; n= ? 1) ; = ffl(n)N=; kg;; L(n1); :::; n2; n;3N2;k 3; t1; :::;g6: 1 ? 1; and From (15) and (21), (14) and (20), the conditional probability of r(n1 ? j1; n2 ? j2 ; n3; t) can be expressed as p r(n1 ? j1 ; n2 ? j2 ; n3; t)jr(n3; t ? 1); D11(n ? 1); Lk (n)] ! 2 ^ ); ^ 2 1 = (2 k )? 2 exp ? zk (n1 ? j1 ? d1(n2 n2 ? j2 ? d2(n); n3) ; 2 k
4 ^ ~ ~ d(n) = E d(n)jD11(n ? 1); Lk (n)]; 2 4 ~T ~ k = f (n1 ? j1 ; n2 ? j2; n3) P^ f (n1 ? j1; n2 ? j2 ; n3) + ~ d
B. Model Detection
r(n3 ; t ? 1) Lk (n)
(25)
where
w (n1 ? j1 ; n2 ? j2 ; n3):
2 Since calculating even an approximate error variance w is very complicated, we make use of the predetermined weighting values !(j1 ; j2). Then, the switching logic is simpli ed as follow: select model l(n1 ; n2; n3) = k if
X
j1 j; 2 !(j1 ; j2 ) zk (n1 ? (n; n2 ?n2) n3 ) + ck + P Lk (n1 ;n2 ; n3 )j(n1 ? 1; n2 ; n3 )] 2 2 ;n ;

k
1 2 3
j1 j ;n 2 !(j1 ;j2 ) zq (n1 ?2 (n;n2 ?n2 ) 3 ) + cq + P Lq (n1 ; n2 ;n3 )j(n1 ? 1;n2 ; n3 )]; 2 ;n ;
q
1 2 3
(26)
where q = 1; :::; 6, and cq = 1 log 2 2
q (n1; n2; n3)
In order to examine the behavior of the proposed 3-D motion vector eld model,we estimated motion vector elds for a set of synthetic and real image sequences. We compared the proposed 3-D compound model to the 1-D coarse-to- ne scale model and to the 2-D spatial compound model. The 1-D coarse-to- ne scale model was proposed by Simoncelli 19]. The 2-D spatial compound model was proposed by Brailean and Katsaggelos 13]. Since our purpose is to compare our proposed 3-D multi-model to these earlier models, we did not preprocess the test image sequences in obtaining local gradients. Thus we applied these three models to motion estimation under the same test conditions. In the case where the motion vector eld is known quantitatively, we can analyze the errors in our estimates. The measure is the mean squared error between the correct and estimated motion vectors. In the case where the motion vector eld is not known quantitatively, we used the estimated motion vector eld to calculate a motion-compensated prediction. The MSE between frame t and its motion-compensated prediction is then used as the performance measure. Since these two mean square errors do not show the local behavior of the estimate, we illustrated the estimation results by a vector diagram. To make the vector diagram clear, we scaled down (or scaled up) the magnitude of each motion vector. In the case of the synthetic image sequence, each vector represents a single motion vector per pixel. In the case of real image sequences, each vector represents the average value of motion vectors within a 8 8 local window. The uncertainty w in (21) includes the uncertainty due to measurement noise wr and other terms such as the uncertainty due to linearization of the nonlinear observation equation and uncertainty due to interpolation of r as indicated ~ earlier. The uncertainties due to linearization and interpolation depend on the local characteristics of the image and the motion vector eld, and hence determining the error variance of w(n1 ; n2; n3) is very complicated. In this experiment, we treated the observation uncertainty w as Gaussian distributed. When we use multiple observations in a small window for estimating motion, we weighted the observation at each pixel di erently using (17). In predetermining these values, the set of the weighting values should be large enough to make (23) well-posed, but be small enough to make estimation accurate. At lower scale levels, the weighting values, w(j1; j2); (j1 ; j2) 6= (0; 0) should be larger. Since each motion vector is estimated again at the ner resolution level. we need robust estimation at the coarser scale level. To decide the appropriate window size and appropriate weighting values, we tested various 3 3 and 5 5 weighted windows. The set of weighting values experimentally chosen is 0:15 0:19 0:15; 0:19 1:0 0:19; 0:15 0:19 0:15]. When the observation noise is very strong, a 5 window with large weighting values results in better estimation 22]. In real image sequences, the measurement noise variance can be estimated by calculating the mean square error between two consecutive frames in the still background or in a region where image intensity does not change. We generated a synthetic test sequence for studying the behavior of the proposed algorithm. The input image SNR is 30dB. Since we are interested in the behavior of the proposed algorithm in the neighborhood of a point of motion discontinuity, the test sequence consists of a moving circle and a moving background. The frame size is 64 64, and the diameter of the circle is 32. A frame from the sequence is shown in Fig. 4a. The circle moved two pixels to the left, and the background moved one pixel downwards per frame. The vector diagram of the actual motion is shown in Fig. 4b. These two random elds, the circle and the background, were generated with di erent 2-D AR models which have 1 1 order NSHP support regions. Their parameter sets are given in 22]. We compared our proposed 3-D algorithm with a 2-D spatial Kalman lter and a 1-D (coarse-to- ne) Kalman lter. The weighted 3 3 window gives the best result for our 3-D multi-model Kalman lter. The weighted 5 5 window gives the best result for the 1-D (coarse-to- ne) and the 2-D multi-model Kalman lters. These results are shown in Tables 1 and 2. The best result for each algorithm is shown in Fig. 4. In the 1-D (coarse-to- ne) Kalman lter, the estimated motion vectors are blurred around a motion discontinuity. This can be seen in the neighborhood of the circle boundary in Fig. 4c. In the 2-D spatial Kalman lter, the motion vector eld was not estimated accurately at those pixels for which the motion vector could not be predicted accurately from the generalized causal support region. This is seen in the upper-left corner of the moving circle in Fig. 4d. The 3-D Kalman lter did not have these shortcomings as seen in Fig. 4e. In the test sequence, the uncovered region occurred in the right part of the circle, because the circle moved to the left. Since the uncovered region does not have a `corresponding' region in the previous frame, the resulting motion vector eld for this region is unreliable. Finally, we examined the behavior of our 3-D multi-model estimator on the table tennis video. We used the 29th and 30th frames, with parameter settings chosen from the synthetic sequence example. We estimated the motion vector eld of the table tennis sequence. The MSE between the two frames is 496. The mean square errors of the displaced frame di erences (DFDs) are shown in Table 3. The estimated motion vector elds and the displaced frame di erences are shown in Fig. 5. 9
B. Real Image Sequence A. Synthetic Image sequence
VII. EXPERIMENT AND DISCUSSION
We proposed a 3-D compound AR model for motion vector elds. To apply the model to motion estimation, we used a 3-D extended Kalman lter of the ROMKF variety. We experimentally demonstrated that the 3-D compound model can perform signi cantly better than the 1-D (coarse-to- ne) and 2-D compound models proposed earlier. Since single observations can be sensitive to local image characteristics in the presence of noise, especially where an inappropriate local model is chosen, we employed multiple or windowed observations at each pixel. Since the uncertainty of each observation is di erent, we used predetermined weighting values for each relative position. We experimentally found that our proposed 3-D model using a weighted 3 3 window was best at handling motion discontinuity while simultaneously smoothing the motion vector eld.
References
1] 2] 3] 4] 5] 6] 7] 8] 9] 10] 11] 12] 13] 14] 15] 16] 17] 18] 19] 20] 21] 22] 23] 24] 25] 26] 27] 28] 29] 30] I. Sezan and R. L. Lagendijk, eds., Motion Analysis and Image Sequence Processing. Kluwer Academic Publishers, 1993. J. Marroquin and S. Mitter and T. Poggio, \Probabilistic Solution of Ill-posed Problems in Computational Vision," Journal of the American Statistical Association, Theory and Methods, pp. 76-89, Mar. 1987. M. Bertero and T. A. Poggio and V. Torre, \Ill-posed Problems in Early Vision," Proceedings of the IEEE, vol. 76, pp. 869-889, Aug. 1988. B. K. P. Horn and B. G. Schunck, \Deterministic Optical Flow," Arti cial Intelligence, vol. 17, pp. 185-203, 1981. E. C. Hildreth, \Computations Underlying the Measurement of Visual Motion," Arti cial Intelligence, vol. 23, pp. 309-354, 1984. H. H. Nagel and W. Enkelmann, \An Investigation of Smoothness Constraints for the Estimation of Displacement Vector Fields From Sequence," IEEE Trans. Pattern Analysis and Machine Intelligence, pp. 565-593, Sept. 1986. H. H. Nagel, \On the Estimation of Optical Flow: Relation Between Di erent Approaches and Some New Results," Arti cial Intelligence, vol. 33, pp. 299-324, 1987. Fogel91, \The Estimation of Velocity Vector Fields from Time-varyingImage Sequences,"Computer Vision, Graphics and Image Process, vol. 53, pp. 253-287, May 1991. A. L. Yuille and N. M. Grzywacz, \A Mathematical Analysis of the Motion Coherence Theory," International Journal of Computer Vision, vol 3. pp. 155-175, 1989. J. A. Stuller and G. Krishanamurthy, \Kalman Filter Formulation of Low-Level Television Image Motion Estimation," Computer Vision, Graphics and Image Process, vol. 21, pp. 169-204, 1983. J. N. Driessen and J. Biemond, \Reduced Resolution Motion Field Estimation by 2-D Kalman Filtering," Eleventh Symposium on Information Theory in the Benelux, J. van der Lubbe, 1990. S. N. Efstratiadis and A. K. Katsaggelos, \Nonstationary AR Modeling and Constrained Recursive Estimation of the Displacement Field," "IEEE Trans. Circuits and Systems for Video Technology, Dec. 1992. J.C. Brailean and A. K. Katsaggelos, \DisplacementField Estimation Using a Coupled Gauss-Markov Model," Proc. SPIE Conf. Imaging Technologies and Applications, vol. 1778, pp. 170-181, Chicago, 1992. J. Konrad and E. Dubois, \Bayesian Estimation Of Motion Vector Field," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 14, pp. 910-927, Sept. 1992. J. N. Driessen, Motion Estimation for Digital Video. PhD thesis, Delft University of Tech., Delft, Netherlands, 1992. "N. M. Namazi and J. I. Lipp, \Nonuniform Image Motion Estimation in Reduced Coe cient Transformed-Domain," IEEE Trans. Image Processing, vol. 2, pp. 236-246, Apr. 1993. N. M. Namazi and P. Pena el and C. M. Fan, \Nonuniform Image Motion Estimation Using Kalman Filtering," IEEE Transactions on Image Processing, Special Issue on Image Sequence Compression, vol. 3, pp. 678-683, Sept. 1994. J. N. Driessen and J. Biemond, \Motion Field Estimation by 2-D Kalman Filtering," Signal Processing, pp. 975-978, 1990. E. P. Simoncelli, Distributed Representation and Analysis of Visual Motion. PhD Thesis, Massachusetts Institute of Technology, Cambridge, MA, 1993. M. R. Luettgen and W. C. Karl and A. S. Willsky and R. R.Tenney, \Multiscale Representations Of Markov Random Fields," IEEE Trans. Signal Processing, vol. 41, pp. 3377-3396, Dec. 1993. D. L. Angwin and H. Kaufman, \Image Restoration Using Reduced Order Models," Signal Processing, vol. 16, 21-28, Jan. 1988. J. Kim,' 3-D Kalman lter for Video and motion estimations. PhD thesis, Rensselaer Polytechnic Institute, Troy, NY, 1994. M. Basseville and A. Benveniste and A. S. Willsky, \Multiscale Autoregressive Processes .1. Schur-Levinson Parametrizations," IEEE Trans. Signal Processing, vol. 40, pp. 1915-1934, Aug. 1992. A. K. Jain, Fundamentals of Digital Image Processing. Prentice-Hall, Englewood Cli s, 1989. A. K. Jain, \Advances in Mathematical Models for Image Processing," Proceedings of the IEEE, vol. 69, pp. 502-528, 1981. J. W. Woods and V. K. Ingle, \Kalman Filtering in Two Dimensions: Further Results," IEEE Trans. Acoust., Speech, Signal Process., vol. 29, pp. 188-197, 1981. J. Kim and J. W. Woods, \A New Interpretation of ROMKF," IEEE Trans. Image Processing, to be published. A. P. Sage and J. L. Melsa, Estimation Theory with Application to Communications and Control. McGraw-Hill, New York, 1971. J. W. Woods and S. Dravida and R. Mediavilla, \Image Estimation Using Doubly Stochastic Gaussian Random Field Models," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. PAMI-9, pp. 245-253, Mar. 1987. J. Kim and J. W. Woods, \Spatiotemporal Adaptive 3-D Kalman Filter for Estimating Video," IEEE Trans. Image Processing, to be published.
VIII. CONCLUSION
10
the whole frame window size motion eld model 1 1 3 3 5 5 3-D multi-model 0.16 0.061 0.064 2-D (space) multi-model 0.26 0.112 0.098 .09 1-D (scale) none 0.160 0.127
at motion discontinui ty window size 1 1 3 3 5 5 0.71 0.76 0 .96 0.81 1.06 1 .09 none 1.52 1 .37
Table 1 mean square error (MSE) of estimated motion vector elds: synthetic sequence the whole frame at motion discontinui ty window size window size motion eld model 1 1 3 3 5 5 1 1 3 3 5 5 3-D multi-model 1.56 1.13 1.42 5.7 4.94 8.15 2-D (space) multi-model l.71 1.28 1.78 8.4 7.61 11.43 1-D (scale) none 4.26 2.98 none 20.62 22.84 Table 2 mean square error (MSE) of motion compensated frame di erences: synthetic sequence Window size Motion eld model 3 3 5 5 3-D (space and scale) multi-model 24.7 27.9 2-D (space only)multi-model 43.0 41.9 1-D (scale only) 104.2 76.0 Table 3 mean square error (MSE) of motion compensated frame di erences: table tennis sequence
11
scale
= n3 ? 2
scale
= n3 ? 1
scale
n3
= n3
The causal support region
Fig. 1 the 3-D generalized causal support region, using a 1st-order Markov for the coarse-to- ne multiscale eld.
The pixel currently being processed
scale
(0,0,1)
(0,0,1)
(0,0,1)
= n3 ? 1
(1,1,0) (0,1,0) (-1,0,0) scale (1,0,0) (1,1,0) (0,1,0) (-1,0,0) (1,0,0) (1,1,0) (0,1,0) (1,0,0) phase 1 phase 2 phase 3
(0,0,1)
(1,1,0) (0,1,0) (-1,0,0) (1,0,0) phase 4
= n3
The pixel currently being processed The four pixels where the child displacement vectors are located
Fig. 2 the nearest-pixel support, showing the four child phases

scale
= n3 ? 1 = n3
scale
phase 1
phase 2
phase 3
phase 4
The pixel currently being processed, The location of the displacement vector from the previous scale, which is used to predict the displacement vector at the pixel currently being processed
Fig. 3 selection of the motion vector at the previous (coarser) scale
12
Fig 4: vector diagram of motion vectors: a e b c d (a) a frame from the synthetic sequence (b) the true motion (c) 1-D coarse-to- ne model (d) 2-D spatial multi-model (e) 3-D multiscale multi-model
13
Fig 5: vector diagram of motion vectors and displaced frame di erences: b c d (a) a frame from the table tennis sequence and a frame di erence (b) 1-D coarse-to- ne model(c) 2-D spatial multi-model (d) 3-D multiscale multi-model
14

10 1 1 53

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

10 1 1 53

Hochgeladen von

Copyright:

Verfügbare Formate

3-D Kalman Filter for Image Motion Estimation

Jaemin Kim and John W. Woods

II. THREE-DIMENSIONAL MOTION VECTOR FIELD MODEL

which can be written in simpli ed notation as ~ d(2M3 n + M3 ) X = A(k; +~ (2M3 v

(M3 ?k3 ) ? k) k ~ (M ?k ) 3)2 3 d(2 3 3 n +

(m1 ? p1 ; m2 ? p2; m3 ? p3);

a(k; ; n3)2k3 di(21?k3 n +

IV. MODEL PARAMETER SETS AT MULTISCALE LEVELS

j1 j; 2 !(j1 ; j2 ) zk (n1 ? (n; n2 ?n2) n3 ) + ck + P Lk (n1 ;n2 ; n3 )j(n1 ? 1; n2 ; n3 )] 2 2 ;n ;

where q = 1; :::; 6, and cq = 1 log 2 2

q (n1; n2; n3)

VII. EXPERIMENT AND DISCUSSION

The causal support region

The pixel currently being processed

(1,1,0) (0,1,0) (-1,0,0) (1,0,0) phase 4

Fig. 2 the nearest-pixel support, showing the four child phases

Fig. 3 selection of the motion vector at the previous (coarser) scale

Das könnte Ihnen auch gefallen