Sie sind auf Seite 1von 19

Extraction of High-Resolution Frames from Video Sequences

Richard R. Schultz, Robert L. Stevenson,

Member, IEEE, and Member, IEEE


able within the data. For this reason, multiframe methods have been proposed 10]{ 17] which use the additional data present within a sequence of temporally-correlated frames to improve resolution. The sinc basis function provides a perfect reconstruction of a continuous function, provided that the data was obtained by uniform sampling at or above the Nyquist rate 3]. However, sinc interpolation does not give good results within an image processing environment, since image data is generally acquired at a much lower sampling rate. Even the use of polynomial methods such as Lagrange interpolation do not perform satisfactorily for image data, since globally-de ned polynomials do not model local image properties well. Rather, researchers have investigated piece-wise polynomial approaches to the interpolation problem. In the simplest method of image magni cation, a zero-order hold of the low-resolution data is used to compute a high-resolution data set with a blocky appearance. This corresponds to a nearest-neighbor interpolation kernel 6]. Bilinear interpolation 18] and cubic spline interpolation 2], 3], 5], 6], 8], 9] have also received much attention. Hou and Andrews 3] originally applied the cubic B-spline basis function to the image interpolation problem. Keys 5] introduced a basis function similar to a windowed sinc function, termed the cubic convolution interpolation kernel. Through an analysis by Parker et al. 6], it was shown that frequency domain properties of the cubic convolution kernel correspond more closely to an ideal low-pass lter than the cubic B-spline. T. C. Chen and deFigueiredo 2] showed the correspondence between spline lters and partial di erential equation (PDE) image models. They derived a spline lter pertaining to a noncausal image model with a seven sample region of support, also with the shape of a windowed sinc function. This kernel consistently produced interpolations with lower mean square error values than the cubic B-spline. Similar to image restoration and reconstruction, image interpolation is an ill-posed inverse problem, since too few data points exist in an image frame to properly constrain the solution. Intuitively, the mapping between the unknown high-resolution image and the low-resolution observations is not invertible, and thus a unique solution to the inverse problem cannot be computed. Regularization techniques include prior knowledge about the data in order to compute an approximate solution 1], 4], 7], 19], 20]. A Tikhonov regularization approach to image interpo-

Abstract| The human visual system appears to be capable are inherently limited by the number of constraints avail-

of temporally integrating information in a video sequence in such a way that the perceived spatial resolution of a sequence appears much higher than the spatial resolution of an individual frame. While the mechanisms in the human visual system which do this are unknown, the e ect is not too surprising given that temporally adjacent frames in a video sequence contain slightly di erent, but unique, information. This paper addresses how to utilize both the spatial and temporal information present in a short image sequence to create a single high-resolution video frame. A novel observation model based on motion compensated subsampling is proposed for a video sequence. Since the reconstruction problem is ill-posed, Bayesian restoration with a discontinuity-preserving prior image model is used to extract a high-resolution video still given a short low-resolution sequence. Estimates computed from a low-resolution image sequence containing a subpixel camera pan show dramatic visual and quantitative improvements over bilinear, cubic B-spline, and Bayesian single frame interpolations. Visual and quantitative improvements are also shown for an image sequence containing objects moving with independent trajectories. Finally, the video frame extraction algorithm is used for the motion compensated scan conversion of interlaced video data, with a visual comparison to the resolution enhancement obtained from progressively-scanned frames. Keywords|Discontinuities, image enhancement, image sequence processing, interpolation, MAP estimation, scan conversion, stochastic image models, video sequence processing.
I. Introduction

Image interpolation involves the selection of data values between known pixel constraints. Image processing applications of interpolation include region-of-interest image magni cation, subpixel image registration, and image decompression, among others. Single frame interpolation techniques 1]{ 9] have been researched quite extensively, with the nearest-neighbor, bilinear, and various cubic spline interpolation methods providing progressively more accurate solutions. However, all of these methods
Accepted to appear in IEEE Transactions on Image Processing, Special Issue on Nonlinear Image Processing. This research was presented in part at the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Detroit, MI), May 8{12, 1995. E ort sponsored by Rome Laboratory, Air Force Materiel Command, USAF under grant number F30602{94{1{0017. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the o cial policies or endorsements, either expressed or implied, of Rome Laboratory or the U.S. Government. Please address all correspondence to: Robert L. Stevenson, Department of Electrical Engineering, University of Notre Dame, Notre Dame, IN 46556, Phone: (219) 631{8308, Fax: (219) 631{4393, E-Mail: Stevenson.1@nd.edu, WWW: http://lisa.ee.nd.edu/rls/.

lation was proposed by Karayiannis and Venetsanopoulos 4], in which a quadratic stabilizing functional added to a delity term for the constraints was de ned. The resulting unconstrained optimization problem allowed for some noise within the data, since the minimal solution was not required to meet the constraint values exactly. A similar problem formulation was suggested by G. Chen and deFigueiredo 1], although as a constrained optimization problem. In this case, the solution was constrained to pass directly through the given image pixels. Previously proposed methods, from bilinear and spline interpolation to quadratic functional minimization, result in smooth solutions to the image interpolation problem. Although smoothness in one-dimensional function interpolation is often acceptable, the human visual system is acutely aware of discontinuities within images. To better preserve edges, a cubic \spline-under-tension" kernel has been developed 9], with interpolation kernel weights adjusted adaptively according to edge information in a local neighborhood. Another approach is a stochastic regularization technique using a discontinuity-preserving prior model for the image data 7], resulting in a convex, although nonquadratic, constrained optimization problem. An unconstrained optimization was also derived in this research, to account for additive noise corrupting the data. Bayesian estimates computed by this method contained preserved edges, which is an advantage over the regularization methods characterized by quadratic stabilizing functionals. Although the eld of single frame image interpolation is far from mature, the quality of an estimate generated by any method is inherently limited by the amount of data available in the frame. To achieve signi cant improvements in this area, the next step requires the investigation of multiframe data sets, in which additional data constraints from sequential frames can be used. Multiframe image restoration was introduced by Tsai and Huang 17]. Their motivation came from generating a high-resolution frame from misregistered pictures obtained by a Landsat satellite. A frequency domain observation model was de ned for this problem which considered only globally shifted versions of the same scene. Provided that enough frames are available with di erent subpixel shifts, the observation mapping becomes invertible and a unique solution may be computed. If this is not the case, a least squares approximation is computed through a pseudoinverse of the constraint mapping matrix. An extension of this algorithm for noisy data was provided by Kim et al. 13], resulting in a weighted least squares algorithm for computing the high-resolution estimate. Another approach to this problem involves mapping several low-resolution images onto a single high-resolution image plane, and then interpolating between the non-uniformly spaced samples. This requires knowledge of the exact image displacements, which happen to be available in images acquired with controlled subpixel camera displacement 12] and images captured by stereo cameras. Stark and Oskoui 15] formulated a projection onto convex sets (POCS) algorithm to compute an estimate from observations obtained by scanning 2

or rotating an image with respect to the CCD image acquisition sensor array. Tekalp et al. 16] then extended this POCS formulation to include sensor noise. Patti joined with Tekalp to account for time-varying motion blur within video sequences 14], and to accommodate for interlaced frames and other video sampling patterns 21]. In the approach most related to this research, Cheeseman et al. 10] applied Bayesian estimation with a Gaussian prior model to the problem of integrating multiple satellite frames observed by the Viking Orbiter. However, all of their examples used a large number of frames which were slightly misregistered. This is a poor assumption if the multiframe algorithm is to be applied to an arbitrary video sequence. Existing multiframe methods su er from several impractical assumptions. Previous research deals with some type of global displacement or rotation occurring between frames. This is rather impractical if a multiframe technique is to be applied to a video sequence containing objects with independent motion trajectories. In real-life electronic imaging applications, the motion occurring between frames is not known exactly, since precise control over the data acquisition process is rarely available. Thus, motion estimates must be computed to determine pixel displacements between frames. Obviously, the quality of these motion estimates will have a direct e ect on the quality of the enhancement algorithm. As stated previously, if enough frames with the correct subpixel displacements are available, then the image interpolation problem is no longer ill-posed. In other words, a unique solution is obtained by the direct inversion of the constraint mapping matrix. Generally these frames are not available, and therefore a suboptimal solution must be computed. Existing multiframe methods use either least squares techniques or POCS algorithms with quadratic constraint sets to regularize the interpolation problem. These techniques result in smooth estimates of the high-resolution frame containing blurred edges. In this paper, the problem of enhancing the de nition of a single video frame using both the spatial and temporal information available in a video sequence is addressed. The multiframe interpolation problem is placed into a Bayesian framework, featuring a novel observation model for video sequence data. The resulting algorithm incorporates several ideas which enhance both the usability and the quality of the estimated image frame: Independent object motion in the video sequence will be assumed, rather than the simple cases of global displacement or rotation. Block matching methods 22], 23] will be used to estimate the subpixel displacement vectors between frames. Modeling error due primarily to inaccurate motion vector estimates will be represented by a probability density function included in the Bayesian estimation algorithm. Bayesian maximum a posteriori (MAP) estimation will be used to regularize the ill-posed interpolation problem. This stochastic regularization technique requires a density for the data known as a prior image model, representing the underlying distribution of the image

pixel values. An edge-preserving image prior will be assumed for the data, resulting in an estimate of the high-resolution image containing distinct edges. This has the intent of improving upon least squares solutions and POCS solutions with quadratic constraint sets. The multiframe algorithm proposed in this paper reduces to the Bayesian interpolation method presented previously 7] when only a single video frame is available. Increasing the number of low-resolution frames used as constraints should improve interpolation results, up to a practical limit. Applications of this research include Improved De nition Television (IDTV), video hardcopy and display, and preprocessing for image/video analysis, among others. This paper will be organized as follows. Section II proposes an observation model for several frames from a lowresolution video sequence, including a subsampling model for the center frame of a particular sequence, and a motion compensated subsampling model for other frames within the multiframe model. In order to construct the motion compensated subsampling model, subpixel motion vectors must be computed from the low-resolution video frames. A hierarchical block matching technique is presented to estimate the displacement vector elds. The video frame extraction algorithm is formulated within a Bayesian framework in Section III, including a discontinuity-preserving prior model for the data and a probability density function for the motion error. Visual and quantitative simulation results are described in Section IV for a syntheticallygenerated sequence containing motion modeled with a camera pan and a video sequence containing independent object motion. A visual comparison of multiframe estimates computed from progressively-scanned and interlaced frames is also shown for an actual video sequence. Section V provides a brief summary, along with future research issues to be explored.
II. Video Sequence Observation Model

quence

n (l) o
y

where M represents an odd number of frames. A single high-resolution frame z(k) coincident with the center frame y(k) is to be estimated from the low-resolution sequence. This unknown high-resolution data consists of qN1 qN2 square pixels, where q is an integer-valued interpolation factor in both the horizontal and vertical directions. Subsampling for the center frame is accomplished by averaging a square block of high-resolution pixels, 1 (k yi;j) = 2 @ q
B. Subsampling Model for Center Frame

? ? for l = k ? M 2 1 ; : : :; k; : : :; k + M 2 1 ; (1)

qi X

qj X

r=qi?q+1 s=qj ?q+1

1 (k zr;s)A ;

(2)

for i = 1; : : :; N1 and j = 1; : : :; N2 . This models the spatial integration of light intensity over a square surface region performed by CCD image acquisition sensors 7]. The subsampling model for the center frame can be expressed in matrix notation as

y(k) = A(k;k)z(k) ;
1 2 2 1 2

(3)

where A(k;k) 2 IRN N q N N will be referred to as the subsampling matrix. Each row of A(k;k) maps a square block of q q high-resolution samples into a single lowresolution pixel. If the video sequence is interlaced, this matrix maps only the even- or odd-numbered scan lines from the high-resolution data into the low-resolution frame. The second part of the observation model is de ned to account for motion occurring within the sequence. The idea is to extract knowledge about the high-resolution center frame, z(k) , from temporally neighboring low-resolution frames y(l) , for l 6= k. This will be modeled as
C. Motion Compensated Subsampling Model

A novel observation model is proposed for a low-resolution video sequence. A subsampling matrix is de ned to map the high-resolution data pixels into a low-resolution frame via spatial averaging. Motion compensated subsampling matrices incorporate object motion between frames in the model, with error in the estimated motion vectors assumed to be Gaussian-distributed. A hierarchical subpixel motion estimator is described to compute the displacement vectors required in constructing the motion compensated subsampling matrices.

pling matrix which models the subsampling of the highresolution frame and accounts for object motion occurring between frames y(l) and y(k). For pixels in z(k) which are not observable in y(l) , A(l;k) contains a column of zeros. Object motion will also cause pixels to be present in y(l) which are not in z(k). The vector u(l;k) accommodates for these pixels with nonzero elements. Since u(l;k) is unknown, it is obviously di cult to utilize these nonzero rows. Rows of A(l;k) containing useful information are those for A. Problem Statement which elements of y(l) are observed entirely from motion Assume that each frame in a low-resolution video se- compensated elements of z(k). Write these useful rows as (l) quence contains N1 N2 square pixels. Let yi;j represent the reduced set of equations th pixel of frame l, or equivalently the (iN + j ? N )th the (i; j ) 2 2 y0 = A0 z(k): (5) element of y(l) . Consider a short low-resolution video se(l) (l;k)

y(l) = A(l;k) z(k) + u(l;k); (4) ? for l = k ? M2?1 ; : : :; k ? 1 and l = k + 1; : : :; k + M2 1 . In (l;k) is the motion compensated subsamthis expression, A

In practice, the motion compensated subsampling matrix must be estimated initially from the low-resolution frames; ^ that is, an estimate A0 must be computed from y(l) and (k). Therefore, the relationship between y(l) and z(k) for y l 6= k will be de ned as
(l;k)

(6) where n(l;k) is an additive noise term representing the error ^ in estimating A0 . The additive noise is assumed to be independent and identically distributed (i.i.d.) Gaussian, although this may not represent the most accurate error distribution. To construct the motion compensated subsampling matrix A(l;k) , the exact vectors describing translational motion between frames y(l) and y(k) are required. Assume that intensity is constant along object trajectories in a video sequence, and that translational displacements are su cient to describe motion over small time periods. This can be expressed by the relation (l) (k (7) yi;j = yi?)v ;j ?v ; where the translational displacement is denoted as (l;k vi;j ) = vi vj ]t: Vertical and horizontal displacements, represented as vi and vj , respectively, may be fractional values to account for subpixel motion in the low-resolution data. Since a single displacement vector is used to describe the motion (l) of each low-resolution pixel yi;j , this vector will be used to describe the motion for all q q high-resolution pixels (l) contained within yi;j . To construct A(l;k) , consider the th pixel between frames l and k: motion of the (i; j )
(l) (l;k) (l;k)
i j

^ y0 = A0 z(k) + n(l;k);

y(l)
i;j

= y(k)

1 i?v ;j ?v = q2 @

q(Xi) i?v

q(Xj ) j ?v

0 qi 1@ X = q2

r=q(i?vi )?q+1 s=q(j ?vj )?q+1 1 qj X (k) zr?qvi ;s?qvj A r=qi?q+1 s=qj ?q+1

1 (k zr;s) A

1 has a resolution of q pixels in the vertical and horizontal directions, with a single displacement vector describing the motion for all q q high-resolution elements contained (l) within yi;j . Exact motion vectors are rarely available, so that an estimate of the displacement eld v(l;k) must be ^ computed from the low-resolution frames y(l) and y(k). The hierarchical block matching algorithm 22], 24] can be modi ed to quickly and accurately estimate subpixel displacement vectors 25], 26]. An integer-valued displace(l) ment ^i;j ) is estimated initially for each pixel yi;j using v(l;k block matching with the mean absolute di erence (MAD) criterion 23]. The center frame of the video sequence, y(k), is used as the reference frame in the block matching scheme. Each subsequent level of the hierarchy involves an up-sampling of the low-resolution frames through Bayesian MAP interpolation 7], with subpixel re nements computed for each motion vector. This is repeated in order to estimate the up-sampled vector vi;jl;k) , with the corre^"( sponding subpixel displacement given as vi;j ) = q vi;jl;k) . ^(l;k 1 ^"( An unobservable pel detection (UPD) algorithm is applied to the data at the nest level of the hierarchy to determine which motion vectors will be useful in the con^ struction of A0 . Pixels entering frame y(l) from behind an object in y(k) or from the image boundaries are unobservable in the high-resolution frame z(k) . These samples are not useful in the video observation model. To detect these pixels, an approach loosely based on the change detection algorithm in 27] will be used. The displaced frame di erence,
(l;k)

(l z(k) is mapped to yi;j) . Within the video sequence observation model, this corresponds to a row of the center frame subsampling matrix A(k;k) being shifted to form the row with the same index within A(l;k) . Explicitly, for row iN2 + j ? N2 , the magnitude and direction of this shift is (l;k (l;k given by the displacement vi;j ). Each motion vector vi;j )

(8)

"(k (l;k "(l) DF Dm;n) = ym;n ? ym?)qv ;n?qv ; ^ ^


i j

(9)

Thus, A(l;k) is similar in form to A(k;k), but with the summation over a shifted set of pixels. In (8), it is assumed that the shift (qvi ; qvj ) is an integer number of pixels. This can be generalized to a fractional shift by adding a weighting term in the summation to compensate for the fractional part of the pixel in the shifted region. Recall that the only rows of A(l;k) containing useful information are those for which elements of y(l) are observed entirely from motion compensated elements of z(k) . Pixels which are not completely observable must be detected so that the corresponding rows of A(l;k) can be deleted in the construction of the reduced matrix A0 .
(l;k)

"(l) will serve as the criterion for determining whether ym;n is "(k) , and hence also observable in the up-sampled frame y (l;k in z(k) . A large value for DF D m;n) detects a pixel in y"(l) which is not present in y"(k) , or a poor displacement vector estimate. In either case, including displace"(l) ment vi;j ) associated with ym;n in the motion compen^(l;k sated subsampling model will not provide useful informa"(l) tion. If ym;n is detected as an unobservable pixel in z(k) , ^ then row iN2 + j ? N2 within A(l;k) is deleted in the con^ struction of the reduced matrix A0 . Figure 1 depicts the hierarchical subpixel motion estimator for q = 4. The motion estimator output, denoted D. Hierarchical Subpixel Motion Estimation as v0 , is the set of all displacements corresponding to ^ (l) Provided that the low-resolution pixel yi;j is observed in pixels in y(l) which are observed entirely from elements in frame y(k), a block of q q high-resolution pixels within z(k) .
(l;k) (l;k)

Four spatial activity measures are computed at each pixel in the high-resolution image, given by the following secondThe problem of estimating the high-resolution frame ^(k) order nite di erences: z given the low-resolution sequence y(l) is ill-posed in the sense of Hadamard 28], since a number of solutions could dtm;n;1z(k) = zm;n?1 ? 2zm;n + zm;n+1 (13) satisfy the video sequence observation model constraints. t (k) = 0:5z dm;n;2z m+1;n?1 ? zm;n + 0:5zm?1;n+1 (14) A well-posed problem will be formulated using the stochast (k) = z tic regularization technique of Bayesian MAP estimation, dm;n;3z (15) m?1;n ? 2zm;n + zm+1;n resulting in a constrained optimization problem with a dtm;n;4z(k) = 0:5zm?1;n?1 ? zm;n + 0:5zm+1;n+1 (16) unique minimum. The gradient projection algorithm will be used to compute the estimate, with the projection op- These quantities approximate second-order directional derivaerator structure described in detail. (k) tives computed at zm;n , with directions selected to account for horizontal, vertical, and diagonal edge orientaA. Bayesian MAP Estimation The MAP estimate?is located at the maximumof the pos- tions. The likelihood of edges in the data is controlled by terior probability Pr z(k) j y(l) , or equivalently at the the Huber edge penalty function 7], 20], maximum of the log-likelihood function, 2 (x) = x ;jxj ? 2; jjxjj > ;; (17) ? ) 2 x ^(k) = arg max logPr z(k) jy(k? ; : : :; y(k); z z where is a threshold parameter separating the quadratic ? (10) and linear regions. A quadratic edge penalty, : : : ; y(k+ ) : Applying Bayes' theorem to the conditional probability relim (x) = x2 ; !1 sults in the optimization problem n characterizes the Gauss-Markov image model. Edges are ^(k) = arg max log Pr z(k) z severely penalized by the quadratic function, making disz continuities within the Gaussian image model unlikely. The ? ) (k) o ? ) : (11) threshold parameter controls the size of discontinuities ; : : :; y(k); : : :; y(k+ jz + log Pr y(k? modeled by the prior 20] by providing a less severe edge ? Both the prior image model Pr z(k) and the conditional penalty. A longer-tailed density than the Gaussian results, ? density Pr y(l) jz(k) will be de ned. in which discontinuities are more probable. Bayesian estimation distinguishes between possible soluThe conditional density models the error in estimating tions through the use of a prior image model. Commonly, displacement vectors used in the construction of the mo^ an assumption of global smoothness is made for the data, tion compensated subsampling matrices A(l;k) . The error which is incorporated into the estimation problem through between frames is assumed to be independent, so that the a Gaussian prior. The objective is to estimate a high- complete conditional density may be written as resolution frame by reconstructing the high-frequency com? ? ponents of the image lost through undersampling. By makPr y(k? ) ; : : :; y(k); : : :; y(k+ ) jz(k) = ing an assumption of Gaussian-distributed data, edges are k+ ? Y statistically unlikely to appear in the MAP estimate. Ef(18) Pr y(l) jz(k) : fectively, high-frequency components are suppressed by the ? l=k? image model, since smooth edges will be more highly probable than sharp discontinuities. A more reasonable prior assumption is that digitized data is piece-wise smooth; i.e., Since A(k;k) is known exactly, image data consists of smooth regions, with these regions (k) (k;k) (k) separated by discontinuities. The Huber-Markov random Pr y(k)jz(k) = 1;; for y = A z ; (19) 0 otherwise: eld (HMRF) model 7] is a Gibbs prior which represents piece-wise smooth data, with the probability density de^ The error in estimating A0 , for l 6= k, is assumed to be ned as i.i.d. Gaussian-distributed, with the zero-mean probability ( ) (k) = 1 exp ? 1 X t z(k) : Pr z (12) density given as Z 2 c2C dc (20) Pr y(l) jz(k) = In this expression, Z is a normalizing constant known as the 2 1 ^ partition function, is the \temperature" parameter for exp ? (1 ) y0 ? A0 z(k) ; l;k 2 (l;k) the Gibbs prior, and c is a local group of pixels contained (2 ) t z(k) within the set of all image cliques C . The quantity dc ? ? is a spatial activity measure within the data, with a small for l = k ? M2 1 ; : : :; k ? 1 and l = k + 1; : : :; M2 1 . Alvalue in smooth image locations and a large value at edges. though the error variance (l;k) for each frame is unknown,
III. Video Frame Extraction Algorithm
M

(k)

(k)

(l;k)

(l)

(l;k)

N1 N2

N1 N2

Gradient optimization techniques converge to a local minimum of the objective function by following the trajectory de ned by the negative gradient. A sequence of iterates n (k)oK zi i=0 are generated, in which states denoted by increasing iteration numbers more closely approximate the (X X estimate ^(k) . In constrained optimization problems such z 4 t (k) (k) = arg min as (23), all iterates must belong to the constraint set Z . ^ z z 2Z m;n r=1 dm;n;rz To ensure that the constraints are met, the gradient pro9 jection technique 29] has been selected. This method maps > the negative gradient of the objective function in (24) onto > M ?1 k+ 2 > X 2= the constraint space at each iteration through a projection (l;k) y0 ? A0 z(k) ^ ; (21) operator. Any starting point z(k) which is a member of + > 0 ? > l = k ? M2 1 Z is valid. A zero-order hold, or pel-wise replication, of > ; l 6= k the low-resolution data y(k) by a factor of q in both directions satis es the constraints, and it will be used as the where the set of constraints is de ned as o n (k) (k) (k;k) (k) initial condition z(k) = q2A(k;k) y(k). For each iteration, 0 (22) the gradient is computed as g(k) = rf z(k) ; ; , with Z= z : y =A z : i i Each frame y(l) , for l 6= k, has an associated con dence the constraint space mapping denoted as p(k) = ?Pgi(k). i parameter, The projection operator, P 2 IRq N N q N N , is given by (l;k) = ?1 (l;k) ; P = I ? A(k;k) A(k;k)A(k;k) A(k;k) ^ proportional to the con dence in A0 . The optimization = I ? q2A(k;k) A(k;k): (25) problem can be expressed more compactly as (X X ?z(k); ; , each iteration of the gradient 4 To minimize f dtm;n;r z(k) ^(k) = arg zmin z method requires a movement in the descent direction pi 2Z m;n r=1 with step size i. By making a second-order Taylor se2 ^ ; (23) ries approximation to the objective function at the current + 1=2 y0 ? A0 z(k) state z(k), a quadratic step size approximation becomes i where the block-diagonal matrix of con dence parameters pi(k) p(ik) is denoted as ; (26) i = (k) ? ;k) pi r2f z(ik) ; ; p(ik) = diag (k? I; : : : ; (k?1;k)I; (k+1;k)I; ? : : :; (k+ ;k)I ; where r2f z(k) ; ; is the Hessian matrix of the objeci tive function. The image is then updated as the stacked observation vector is given as k) ? ? ) ( ?) t ? z(i+1 = z(ik) + i p(ik); (27) ; : : :; y0 ; y0 ; : : :; y0 ; y0 = y0( and the estimate of the motion-compensated subsampling andaconvergence is achieved once the relative state change for single iteration has fallen below a predetermined threshmatrix is represented as old , such that ? ) 0 ? ; A0 0 = A0( ? ^ ^ ^ k) ; : : :; A ; A ^ z(i+1 ? z(ik) : (28) ? ) t z(ik) ^ 0( : :::;A
(k) (l) (l;k)
t

it is assumed to be proportional to the frame index di erence jl ? kj. To formulate the frame extraction optimization problem, the Huber-Markov prior model and the complete conditional density are incorporated into (11). The MAP estimate of the high-resolution data given the low-resolution sequence becomes

B. Gradient Projection Algorithm

(l;k)

(k)

(k 1)

(k+1)

k+

;k

(k 1;k)

(k+1;k)

k+

;k

The objective function


f z(k); ;

4 XX

^ (24) + 1=2 y0 ? A0 z(k) is convex, since the convex Huber function is used to penalize edges within the data. Gradient-based methods may be employed to compute ^(k), since a unique minimum exists. z
6

m;n r=1

dtm;n;r z(k)
2

k) If this criterion is satis ed, the estimate is given as ^(k) = z(+1 . z i The iterative technique is summarized below:

Gradient Projection Algorithm

1. Set iteration number i = 0. Initial condition z(0k) = q2A(k;k) y(k) is the zero-order hold of y(k).
t

2. Compute the gradient of the objective function,

with a zero-order hold of the center frame

3. Project the negative gradient onto the constraint space, selected as the reference. Next, the video frame extraction algorithm was used to estimate the center frame, given the (k) = ?Pg(k); low-resolution sequence. Displacement vectors were rst pi i estimated using the hierarchical subpixel motion estimausing the projection operator de ned in (25). tion algorithm. A linear estimate was computed with pa10 4. Compute the step size i as de ned in (26). rameter values M = 7, = 1, and (l;k) = jl?kj , and then (k) = z(k) + p(k). a nonlinear estimate was generated with M = 7, = 1, 5. Update the state, zi+1 i i i (l;k) = 10 . Note that the relatively small value for kz ?z k k) jl?kj , then ^(k) = z(+1 . Otherwise, incre- and ) z 6. If kz k i (l;k shows little con dence in the motion estimates. In ment i and return to Step 2. this particular example, it is known that only panning occurs within the sequence. With this knowledge, greatly improved motion vector estimates can be recovered by avIV. Simulations eraging over the estimated displacement vector elds. This In order to show the resolution improvement achievable results in a signi cant improvement in both the linear and nal with this approach, three low-resolution video sequences nonlinear estimates, as reported (in )the 1000 two rows of Table I. The parameter value l;k = jl?kj represents a were used in the simulations. The rst image sequence, ^ Airport, was synthetically-generated from a single digital high con dence in the estimate of A0 . Figure 2 shows image to model a camera pan. The second data set, Mobile the original high-resolution test frame z(k) , the center lowCalendar, is a digitized video sequence composed of several resolution frame of the sequence y(k), the best single frame objects moving independently. Both of these sequences MAP estimate, and the best video frame extraction from were rst subsampled and then interpolated back to their the Airport sequence. original dimensions so that quantitative comparisons could The original Mobile Calendar test sequence consists of be made using the quadratic signal-to-noise ratio (SNR). seven frames, composed of objects possessing ne detail. An interpretation of the results will be provided in both Within the sequence, a wall calendar moves with subpixel cases. A third sequence, Dome, is an actual image sequence translational motion, a toy train engine moves with transacquired using a video camera. Progressively-scanned and lational motion, and the train is pushing a ball undergointerlaced versions of this data set were interpolated using ing rotational motion. Each of the high-resolution frames the multiframe technique. was subsampled by a factor of q = 4 in the same manner as the Airport sequence. Table II shows quantitative A. Visual and Quantitative Results for Subsampled results for various Mobile Calendar frame interpolations. Sequences Figures 3 and 4 show the complete unprocessed and estiThe original Airport test sequence consists of seven high- mated frames, while Figure 5 shows details in a region of resolution frames, with the center frame z(k) shown in Fig- the wall calendar. Again, the video frame extraction alure 2a). It was generated by extracting subimages from gorithm with the Huber-Markov image model provides the a digitized image of an airport, shifting each successive best results, although the improvement is not as dramatic frame seven horizontal pixels to the right and seven verti- as in the previous sequence. cal pixels down. This video sequence simulates a diagonal To minimize the number of computations required in espanning of the scene acquired by a video camera mounted timating a high-resolution video still, the objective is to on an airplane ying overhead. Each low-resolution frame include the fewest number of frames in the multiframe y(l) was generated by averaging 4 4 pixel blocks within model which will provide a signi cant improvement over each high-resolution frame z(l) and then subsampling by a single frame interpolation techniques. The number of lowfactor of q = 4. The center low-resolution frame y(k) was resolution frames which should be included in the video expanded using the single frame techniques of bilinear in- observation model can be determined experimentally by terpolation 18], cubic B-spline interpolation 3], Bayesian plotting SNR versus M . Figure 6a) shows this inforMAP estimation assuming a Gauss-Markov image model mation for the Airport sequence for a number of di erent with = 1 7], and Bayesian MAP estimation assuming image and motion model combinations. Evidently, using a Huber-Markov image model with = 1 7]. The two M = 5 frames for computing the the high-resolution estiBayesian estimates were computed to compare linear and mate is su cient for this sequence. Results for the Mobile nonlinear estimates. Table I provides a quantitative com- Calendar sequence are shown in Figure 6b). It appears parison of the estimates by showing the improved signal- that several more frames could have been used from the to-noise ratio, Mobile Calendar sequence to improve resolution, although gains would likely be incremental at best. From experi2 kz(k) ? z(k) k ; (dB ); (29) ence, a multiframe model composed of M = 5 frames from 0 SNR = 10 log10 (k) an arbitrary image sequence generates a signi cantly higher kz ? ^(k) k2 z
(k) i+1
i

gi(k) = rf z(ik) ; ; :

z(0k) = q2 A(k;k) y(k)


t

(k)

(k) i

(l;k)

resolution frame than any single frame interpolation technique. This relatively small number of frames provides a decent trade-o between video frame quality and computation time necessary to estimate the high-resolution data.
B. Interpretation of Results

more, in the region containing the ball with pixels undergoing rotational motion, the motion estimator fails. Incorrect motion estimates can severely degrade the quality of the estimate, a ecting the appearance of the overall frame extraction. The true indication as to whether a video processing algorithm is e ective is to test it on an actual image sequence. Dome is a short video sequence of a landmark on the University of Notre Dame campus. Figure 7a) shows the center frame of the M = 5 frame sequence. Each frame was progressively-scanned, containing 160 120 pixels. The video frame extraction from these ve frames, interpolated by a factor of q = 4, is shown in Figure 7b). In order to show how the multiframe algorithm may be used for scan conversion, an interlaced version of the Dome sequence was created as well. Only the even- or odd-numbered scan lines from the progressively-scanned frames were used in alternating interlaced frames. A simplistic method employed in the generation of interlaced video hardcopy involves the integration of two elds by placing them together in the same frame. An image produced by this method is shown in Figure 8a). Note the severe motion artifacts between the even and odd elds. The deinterlaced frame generated from two even elds and three odd elds is depicted in Figure 8b), interpolated by a factor of q = 4. Since the multiframe technique uses motion compensation between frames, visual resolution is signi cantly improved. As expected, the frame generated from the progressively-scanned data shown in Figure 7b) has higher resolution than the frame computed from the interlaced sequence shown in Figure 8b). This is certainly no surprise, since twice as much information is present in the progressively-scanned sequence.
V. Conclusion

The video frame extraction algorithmperforms extremely well given the image sequence with camera pan motion, with more moderate improvements shown for the digitized sequence with independent object motion. Several justi cations are proposed for the performance of the algorithm in each case. All rows within A(k;k) are linearly independent. To provide additional information in the video observation model, additional linearly independent rows must be available from ^ the motion compensated subsampling matrices A0 . If a low-resolution pixel undergoes a subpixel shift, this will provide another useful constraint. Otherwise, the constraints will be redundant and will not provide more information than the pixel found within the center frame. In the Airport sequence, subpixel global motion occurs between each frame. This provides a large number of linearly independent constraints in the observation model, and substantial improvement over the low-resolution data is achieved with the video frame extraction algorithm. Independent object motion occurs in the Mobile Calendar sequence. In regions such as the wall calendar in Figure 4, resolution improvement is evident since the wall calendar undergoes a subpixel translation. However, much of the background is stationary, and in this region the motion compensated subsampling matrices provide little additional information in the observation model. The diagonal shift of each frame within the Airport sequence results in the greatest number of linearly independent constraints. If an object is moving with subpixel motion, object features orthogonal to the motion will be enhanced by the frame extraction algorithm. For instance, if the motion is a horizontal panning of the scene rather than diagonal, vertical structures will be reconstructed with enhanced resolution, and horizontal structures will remain essentially as they were in the low-resolution data. In the Airport sequence, the orthogonal axis to the direction of motion can be described by the addition of vertical and horizontal vectors with equal magnitudes. This is why the Airport estimate in Figure 2d) contains both vertical and horizontal structures with high resolution, and serves as an \upper-bound" estimate of the video frame extraction algorithm. A basic assumption for the motion occurring between frames is that intensity is constant along object trajectories. Obviously, for the synthetic sequence this assumption holds exactly, since the sequence was generated from a single digitized image. This allows for the accurate estimation of displacement vectors within the Airport sequence. For actual digitized sequences such as Mobile Calendar, the constant intensity assumption often does not hold due to noise within each frame. Not surprisingly, the motion estimator does not produce an accurate vector eld. Further(l;k)

C. Visual Results for an Actual Video Sequence

Single frame interpolation methods are inherently limited by the number of constraints available within a given image. Additional linearly independent constraints are available from the adjacent frames within a video sequence. A novel observation model was proposed for low-resolution video frames, which models the subsampling of the unknown high-resolution data and accounts for independent object motion occurring between frames. A hierarchical subpixel motion estimator was presented to estimate the displacement vectors required in constructing the observation model, and a Bayesian frame extraction algorithm was proposed to estimate a single high-resolution frame given a short low-resolution video sequence. Provided that the object motion has subpixel resolution, the estimate computed by the frame extraction algorithm has the potential to be substantially improved over single frame interpolations. Visual and quantitative simulation results were reported for a synthetically-generated sequence containing a global camera pan, and an image sequence containing independent object motion. In the case of camera panning, the improvement provided by the algorithm was quite re8

markable. More modest improvement gains were visible for the sequence containing objects moving independently. Simulations were also conducted on an actual video sequence to compare the performance of the algorithm on progressively-scanned and interlaced frames. A number of issues will be explored in future research. The most critical aspect of accurately modeling the video data is the accurate estimation of motion. Regularization techniques can be applied to the ill-posed inverse problem of motion estimation which are robust to spatial noise and sparsity (e.g., interlaced and compressed video frames) and temporal discontinuities within the image sequence 30]. The improved displacement vector estimates should pro^ vide a more accurate estimate of A0 . To further improve the video observation model, a more accurate sensor model is under investigation which incorporates a realistic point spread function (PSF) for the electronic imaging system. Color video sequences contain spectral information in each frame which can further improve the quality of the high-resolution frame estimates. Spectral correlations can be included in the Huber-Markov image model through the addition of between-channel cliques 31]. Finally, the scan conversion capabilities of this algorithm will be investigated more fully.
(l;k)

References

1] G. Chen and R. J. P. deFigueiredo, \A uni ed approach to optimal image interpolation problems based on linear partial di erential equation models," IEEE Trans. Image Processing, vol. 2, no. 1, pp. 41{49, 1993. 2] T. C. Chen and R. J. P. deFigueiredo, \Twodimensional interpolation by generalized spline lters based on partial di erential equation image models," IEEE Trans. Acoust., Speech, Signal Processing, vol. 33, no. 3, pp. 631{642, 1985. 3] H. H. Hou and H. C. Andrews, \Cubic splines for image interpolation and digital ltering," IEEE Trans. Acoust., Speech, Signal Processing, vol. 26, no. 6, pp. 508{517, 1978. 4] N. B. Karayiannis and A. N. Venetsanopoulos, \Image interpolation based on variational principles," Signal Processing, vol. 25, no. 3, pp. 259{288, 1991. 5] R. G. Keys, \Cubic convolution interpolation for digital image processing," IEEE Trans. Acoust., Speech, Signal Processing, vol. 29, no. 6, pp. 1153{1160, 1981. 6] J. A. Parker, R. V. Kenyon, and D. E. Troxel, \Comparison of interpolating methods for image resampling," IEEE Trans. Med. Imaging, vol. 2, no. 1, pp. 31{39, 1983. 7] R. R. Schultz and R. L. Stevenson, \A Bayesian approach to image expansion for improved de nition," IEEE Trans. Image Processing, vol. 3, no. 3, pp. 233{ 242, 1994. 8] M. Unser, A. Aldroubi, and M. Eden, \Fast B-spline transforms for continuous image representation and interpolation," IEEE Trans. Patt. Anal. Mach. Intell., vol. 13, no. 3, pp. 277{285, 1991. 9] K. Xue, A. Winans, and E. Walowit, \An edge9

10] P. Cheesman, B. Kanefsky, R. Kraft, J. Stutz, and R. Hanson, \Super-resolved surface reconstruction from multiple images," Tech. Rep. FIA{94{12, NASA Ames Research Center, Mo ett Field, CA, December 1994. 11] M. Irani and S. Peleg, \Improving resolution by image registration," CVGIP: Graphical Models and Image Processing, vol. 53, no. 3, pp. 231{239, 1991. 12] G. Jacquemod, C. Odet, and R. Goutte, \Image resolution enhancement using subpixel camera displacement," Signal Processing, vol. 26, no. 1, pp. 139{146, 1992. 13] S. P. Kim, N. K. Bose, and H. M. Valenzuela, \Recursive reconstruction of high resolution image from noisy undersampled multiframes," IEEE Trans. Acoust., Speech, Signal Processing, vol. 38, no. 6, pp. 1013{ 1027, 1990. 14] A. J. Patti, M. I. Sezan, and A. M. Tekalp, \Highresolution image reconstruction from a low-resolution image sequence in the presence of time-varying motion blur," in Proc. IEEE Int. Conf. Image Processing, (Austin, TX), pp. I{343 to I{347, 1994. 15] H. Stark and P. Oskoui, \High-resolution image recovery from image-plane arrays, using convex projections," J. Opt. Soc. Am. A, vol. 6, no. 11, pp. 1715{ 1726, 1989. 16] A. M. Tekalp, M. K. Ozkan, and M. I. Sezan, \Highresolution image reconstruction from lower-resolution image sequences and space-varying image restoration," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (San Francisco, CA), pp. III{169 to III{172, 1992. 17] R. Y. Tsai and T. S. Huang, \Multiframe image restoration and registration," in Advances in Computer Vision and Image Processing (R. Y. Tsai and T. S. Huang, eds.), vol. 1, pp. 317{339, JAI Press Inc., 1984. 18] A. K. Jain, Fundamentals of Digital Image Processing. Englewood Cli s, NJ: Prentice-Hall, 1989. 19] D. J. C. MacKay, \Bayesian interpolation," Neural Comp., vol. 4, no. 3, pp. 415{447, 1992. 20] R. L. Stevenson, B. E. Schmitz, and E. J. Delp, \Discontinuity preserving regularization of inverse visual problems," IEEE Trans. Syst., Man, Cybern., vol. 24, no. 3, pp. 455{469, 1994. 21] A. J. Patti, M. I. Sezan, and A. M. Tekalp, \Highresolution standards conversion of low resolution video," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Detroit, MI), pp. 2197{2200, 1995. 22] M. Bierling, \Displacement estimation by hierarchical blockmatching," in Proc. SPIE Conf. Visual Commun. Image Processing '88 (T. R. Hsing, ed.), vol. 1001, pp. 942{951, 1988. 23] H. G. Musmann, P. Pirsch, and H.-J. Grallert, \Advances in picture coding," Proc. IEEE, vol. 73, no. 4, pp. 523{548, 1985.

Imaging, vol. 1, no. 2, pp. 152{161, 1992.

restricted spatial interpolation algorithm," J. Elec.

24] M. Bierling and R. Thoma, \Motion compensating eld interpolation using a hierarchically structured displacement estimator," Signal Processing, vol. 11, no. 4, pp. 387{404, 1986. 25] G. de Haan and P. W. A. C. Beizen, \Sub-pixel motion estimation with 3-D recursive search block-matching," Signal Processing: Image Commun., vol. 6, pp. 229{ 239, 1994. 26] M. Orchard, \A comparison of techniques for estimating block motion in image sequence coding," in Proc. SPIE Conf. Visual Commun. Image Processing IV, vol. 1199, pp. 248{258, 1989. 27] M. Hotter and R. Thoma, \Image segmentation based on object oriented mapping parameter estimation," Signal Processing, vol. 15, no. 3, pp. 315{334, 1988. 28] J. Hadamard, Lectures on the Cauchy Problem in Linear Partial Di erential Equations. New Haven, CT: Yale University Press, 1923. 29] J. M. Ortega and W. C. Rheinbolt, Iterative Solutions of Nonlinear Equations in Several Variables. Computer Science and Applied Mathematics, New York: Academic Press, 1970. 30] D. Shulman and J.-Y. Herve, \Regularization of discontinuous ow elds," in Proc. Workshop on Visual Motion, (Irvine, CA), pp. 81{86, 1989. 31] R. R. Schultz and R. L. Stevenson, \Stochastic modeling and estimation of multispectral image data." IEEE Trans. Image Processing, vol. 4, no. 8, August 1995.
Richard R. Schultz was born on March 19,

1967, in Grafton, North Dakota. He received the B.S.E.E. degree (summa cum laude) from the University of North Dakota in 1990, and the M.S.E.E. and Ph.D. degrees from the University of Notre Dame in 1992 and 1995, respectively. He joined the faculty of the Department of Electrical Engineering at the University of North Dakota in the Fall of 1995, where he is currently an Assistant Professor. Dr. Schultz is a member of the IEEE, and a member of both Eta Kappa Nu and Tau Beta Pi. His current research interests include digital image and video processing and the analysis of biomedical imagery.

Robert L. Stevenson was born on December 20, 1963, in Ridley Park, Pennsylvania. He received the B.E.E. degree (summa cum laude) from the University of Delaware in 1986, and the Ph.D. in Electrical Engineering from Purdue University in 1990. While at Purdue he was supported by a National Science Foundation Graduate Fellowship and a graduate fellowship from the DuPont Corporation. He joined the faculty of the Department of Electrical Engineering at the University of Notre Dame in 1990, where he is currently an Assistant Professor. Dr. Stevenson is a member of the IEEE, Eta Kappa Nu, Tau Beta Pi, and Phi Kappa Phi. His research interests include multidimensional signal processing, image processing, and computer vision.

10

TABLE I Comparison of Interpolation Methods on the Airport Sequence

Technique Single Frame Bilinear Interpolation Single Frame Cubic B-Spline Interpolation Single Frame MAP Estimation, = 1 Single Frame MAP Estimation, = 1 10 Video Frame Extraction with Motion Estimates, M = 7, = 1, (l;k) = jl?kj 10 Video Frame Extraction with Motion Estimates, M = 7, = 1, (l;k) = jl?kj Video Frame Extraction with Panning, M = 7, = 1, (l;k) = j1000j l?k Video Frame Extraction with Panning, M = 7, = 1, (l;k) = j1000j l?k

SNR

(dB) 0.57 1.25 1.43 1.51 3.47 5.48 6.72 7.00

TABLE II Comparison of Interpolation Methods on the Mobile Calendar Sequence

Technique Single Frame Bilinear Interpolation Single Frame Cubic B-Spline Interpolation Single Frame MAP Estimation, = 1 Single Frame MAP Estimation, = 1 10 Video Frame Extraction with Motion Estimates, M = 7, = 1, (l;k) = jl?kj 10 Video Frame Extraction with Motion Estimates, M = 7, = 1, (l;k) = jl?kj

SNR

(dB) 0.24 0.72 0.82 1.05 1.27 1.97

11

y (k) y (l)

MAD

v (l,k)

ZOH-2

ZOH-2 MAP-2 MAD

ZOH-2

MAP-2 ZOH-2 MAP-4 MAD ZOH-2 MAP-4 v


(l,k)

ZOH-2

UPD

v (l,k)

Fig. 1. Hierarchical subpixel motion estimator for q = 4. MAD represents the block matching motion estimator using the mean absolute di erence criterion, ZOH{2 corresponds to a zero-order hold of the input eld by a factor of two, MAP{n denotes an nth order up-sampler using the Bayesian MAP interpolation algorithm, and UPD corresponds to the unobservable pel detector. Triangles within each block represent initial conditions.

12

Fig. 2. Synthetic Airport sequence. (a) Original high-resolution frame z(k) . (b) Low-resolution frame y(k) . (c) Single frame MAP estimate, M = 1, = 1. (d) Video frame extraction with averaged motion estimates, M = 7, = 1, (l;k) = j1000j . l?k

ab c d

13

Fig. 3. Mobile Calendar sequence. (a) Original high-resolution frame z(k) . (b) Low-resolution frame y(k).

a b

14

Fig. 4. Mobile Calendar sequence. (a) Single frame MAP estimate, M = 1, = 1. (b) Video frame extraction with motion estimates, M = 7, = 1, (l;k) = jl10kj . ?

a b

15

Fig. 5. Details of the Mobile Calendar sequence. (a) Original high-resolution frame z(k) . (b) Low-resolution frame y(k). (c) Single frame MAP estimate, M = 1, = 1. (d) Video frame extraction with motion estimates, M = 7, = 1, (l;k) = jl10kj . ?

ab c d

16

Airport Sequence 8.0 7.0 6.0 SNR (dB) 5.0 4.0 3.0 2.0 1.0 0.0 1 3 5 7
GMRF/Motion Vectors HMRF/Motion Vectors HMRF/Camera Pan GMRF/Camera Pan

Mobile Calendar Sequence 2.5

2.0
HMRF/Motion Vectors

SNR (dB)

1.5
GMRF/Motion Vectors

1.0

0.5

0.0

Fig. 6. Improved SNR versus number of frames in video observation model. (a) Airport sequence. (b) Mobile Calendar sequence. Prior image models include the Huber-Markov random eld (HMRF) model and the Gauss-Markov random eld (GMRF) model. \Motion Vector" corresponds to displacement vectors estimated independently by the hierarchical subpixel motion estimator, while \Camera Pan" designates a single motion vector for each frame, obtained by averaging each estimated displacement eld.

a b

17

Fig. 7. Progressively-scanned Dome Sequence. (a) Low-resolution frame y(k) of progressively-scanned sequence. (b) Video frame extraction with 5 averaged motion vectors, M = 5, = 1, (l;k) = jl?kj .

a b

18

Fig. 8. Interlaced Dome Sequence. (a) Low-resolution frames y(k) (even eld) and y(k+1) (odd eld) of interlaced sequence, combined into a single 5 frame. (b) Video frame extraction with averaged motion vectors, M = 5 (2 even elds, 3 odd elds), = 1, (l;k) = jl?kj .

a b

19