Simulating Low-Cost Cameras For Augmented Reality Compositing

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 16, NO.
3, MAY/JUNE 2010 369
Simulating Low-Cost Cameras for

Augmented Reality Compositing
Georg Klein, Member, IEEE, and David W. Murray, Member, IEEE
Abstract—Video see-through Augmented Reality adds computer graphics to the real world in real time by overlaying graphics onto a
live video feed. To achieve a realistic integration of the virtual and real imagery, the rendered images should have a similar appearance
and quality to those produced by the video camera. This paper describes a compositing method which models the artifacts produced
by a small low-cost camera, and adds these effects to an ideal pinhole image produced by conventional rendering methods. We
attempt to model and simulate each step of the imaging process, including distortions, chromatic aberrations, blur, Bayer masking,
noise, sharpening, and color-space compression, all while requiring only an RGBA image and an estimate of camera velocity as inputs.
Index Terms—Artificial, augmented, and virtual realities, visualization, compositing.
1 INTRODUCTION
the blending seams between graphics and video. Both of
A UGMENTED Reality inserts virtual graphics into the real
world. Here, we consider video see-through AR and the
insertion of graphics by blending some rendered image onto
these methods exploited ever-increasing GPU bandwidth to
achieve their aims, and we follow the same approach.
a video feed from a small hand-held or head-mounted To enable easy integration with existing rendering
camera. In some applications, it may be desirable to have the methods, we cast our compositing method as a postrender-
virtual graphics appear as if they were part of the real ing process, operating on an ideal rendered image of the
world, and to create this illusion requires surmounting a virtual graphics such as is typically produced by OpenGL.
number of challenges: tracking should be accurate and jitter- This image is warped and blurred according to lens,
free, so that the graphics appear glued in place in the real motion, and sensor characteristics, and then, resampled
world; occlusions between real and virtual objects should be and degraded in accordance with sensor behavior. We then
correct; the lighting of the virtual objects should match that blend the degraded image with the captured video data to
of the real world; and the quality and texture of the produce the composited image. We attempt a principled
rendered pixels should match that seen in the video feed. simulation of camera effects based on data either learned
In this paper, we address the last problem. Virtual (preferentially) from device specifications, offline calibra-
graphics are usually rendered assuming that the camera is a tion, or guesswork (when necessary). We show that lens
perfect pinhole device, when in reality, the Webcams or distortions, chromatic aberrations, vignetting, antialiasing
other small devices often used for AR add many distortions filters, motion blur, Bayer interpolation, noise, sharpening,
and imperfections to the image. To convincingly blend the quantization, and color-space conversion can all be mod-
two images, there are then two options: One is to somehow eled on today’s commodity graphics hardware.
remove the distortions and imperfections from the captured The next section of this paper discusses previous related
image, but this is unrealistically difficult. The other is to work. Section 3 discusses an a priori model of the imaging
artificially introduce imperfections into the rendered pipeline of a small camera. Section 4 describes methods of
images, and this is the approach taken here. More quantifying the effects of this pipeline as performed on the
specifically, this paper seeks to emulate the imaging process Fire-i camera, and in Section 5, we show how this pipeline
which occurs in small cameras with wide-angle lenses, such can be simulated on the computer. Results are presented in
as the Unibrain Fire-i. There has been some previous work Sections 6 and 7, and finally, Section 8 concludes the paper.
on this topic: Watson and Hodges [17] have shown that lens
distortion can be emulated or corrected using graphics
hardware; more recently, Fischer et al. [4] have shown that 2 RELATED WORK
the integration of rendered graphics can be improved by A previous short version of this paper [11] has appeared in
adding synthetic noise and motion blur, and by antialiasing the Proceedings of ISMAR 2008.
Grain matching (a simulation of the texture of film stock)
is long-established in the offline world of the movie
. The authors are with the Department of Engineering Science, University of
Oxford, Parks Road, Oxford OX1 3PJ, UK. industry, and the imaging process of digital cameras has
E-mail: {gk, dwm}@robots.ox.ac.uk. been investigated in offline contexts (e.g., the work of Florin
Manuscript received 4 Feb. 2009; revised 13 Aug. 2009; accepted 3 Oct. 2009; [6]). In AR, such sensor effects were only recently
published online 20 Nov. 2009. considered: Fischer et al. [4] add noise and motion blur to
Recommended for acceptance by M.A. Livingston, R.T. Azuma, O. Bimber, the rendered image, and antialias the boundaries where real
and H. Saito. and virtual images meet. This paper can be seen as an
For information on obtaining reprints of this article, please send e-mail to:
tvcg@computer.org, and reference IEEECS Log Number
extension of that work in which we consider more sensor
TVCGSI-2009-02-0024. effects. However, our method is also more general in which
Digital Object Identifier no. 10.1109/TVCG.2009.210. it requires no special treatment of seams or transparent
1077-2626/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:487 UTC from IE Xplore. Restricon aply.
370 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 16, NO. 3, MAY/JUNE 2010
areas. Further, our more principled simulation of the

imaging process means that some behaviors, such as the
“splotchy” nature of image noise, emerge naturally and do
not have to be simulated by hand.
Lens effects have been considered by a number of
researchers. Radial distortion is discussed by Watson and
Hodges [17] for counteracting the effects of HMD distor-
tions, and their method has been widely adopted for AR
compositing: either to undistort the camera image, and
then, render graphics on top, or to distort the rendered
graphics to match the distorted video feed. (It is notable that
the latter operation is still frequently performed poorly,
with dark or gray outlines appearing on the outside
boundaries of rendered graphics. This is due to improper
blending, as discussed in Section 5.1.) Okumura et al. [14]
demonstrate the real-time estimation and application of
motion and defocus blur, resulting in noticeably more
natural-looking graphics. Defocus blur is an important Fig. 1. A frame captured from a Unibrain Fire-i camera equipped with a
effect for cameras with large apertures and varying focus; it 2.1-mm wide-angle lens exhibits various imperfections including
is less significant for small fixed-focus lenses such as used (a) banding, (b) noise, (c) corner softness, (d) chromatic aberrations,
here. We do not consider it in this paper because it is (e) Bayer artifacts, and (f) very low horizontal chroma resolution. In this
difficult to simulate properly in a postprocessing step. paper, we attempt to emulate these effects.
It is important to point out that this paper does not
attempt to create a complete photo-realistic imaging onboard processing—the imaging process for that is just a
process. To properly match rendered graphics with the real truncated version of the list below.)
scene also requires accurate color matching and a continual Fig. 1 illustrates a 640480 YUV-411 frame (converted to
estimation of the scene’s lighting conditions. Neither of RGB) captured by a stationary camera pointed at a printed
these things is attempted here, but these problems are not piece of paper. The image shows a number of effects which
new to the AR literature: Jacobs and Loscos [9] provide a have been introduced by the camera, and these arise from
recent survey of illumination-matching work. the image formation process. We assume a five-step image
An alternative to producing the most realistic rendition formation process.
of virtual graphics possible is to employ nonphoto-realistic
1. Lens effects: Incoming light is focused onto the
rendering. Here, graphics are rendered in some stylized
image sensor by the 2.1 mm lens. This produces
fashion, for example, with thick black edges and flat-shaded
large barrel distortion, as well as some image
colors, resulting in a cartoon-like appearance. If this
softness (particularly in the image corners) and
stylization is applied both to the rendered graphics and
vignetting (darkening of the corners and image
the source video, a seamless integration can be achieved.
The challenge here is that adding a non-photorealistic edges). Further, different wavelengths are refracted
stylization is easier for rendered images than a video feed, to various degrees, resulting in chromatic aberra-
due to the availability of 3D information, so care must be tions visible as the purple/yellow fringing in Fig. 1.
taken to achieve a good match. A number of demonstra- Bright lights would cause flare and very near objects
tions of nonphoto-realistic rendering in augmented reality would be blurred by defocus.
have been presented: Fischer and Bartz apply a cartoon 2. Bayer mask: The lens projects an image of the world
effect [5]; Haller et al. apply a variety of effects which onto a Bayer mask [1] (or color filter array, or CFA)
emulate a paper sketch [7]; and recently, Chen et al. have which forms part of the camera’s sensor. Each of the
demonstrated a causal AR implementation of a watercolor sensor’s photosites (or sensels) is masked by a color
NPR effect which is normally only possible in noncausal filter allowing either red, green, or blue light to pass.
postprocessing [2]. The color filters are not perfect bandpass filters, so
substantial color crosstalk takes place, i.e., a red
photosite will also capture some blue and green
3 THE IMAGING PIPELINE OF A UNIBRAIN FIRE-I light. Many Bayer filters also employ an antialiasing
This section describes in practical terms the image forma- filter (a blurring light spreader) to avoid the
tion process which takes place on small cameras, with formation of color Moiré patterns when observing
emphasis on those steps which cause visible distortions or high-frequency structure.
nonideal effects to appear in the image. We use the Unibrain 3. Image sensor: The Fire-i employs a CCD which
Fire-i (an IEEE-1394 camera with a Sony ICX098BQ CCD converts to electrical charge the color-filtered light
sensor) fitted with a 2.1 mm lens as an example. Not only is incident on each photosite during a finite exposure
this a popular setup widely used for real-time tracking time. The integration time can be substantial (of the
work, it also performs a large amount of image processing order of 33 ms indoors) and this gives rise to motion
onboard and so produces complex image effects: This blur. Each photosite is further subject to thermal and
makes it a good example to use. (Later, in the paper, we also shot noise. On the Fire-i, each photosite’s integration
consider the IDS -eye, a camera which performs less period is simultaneous (global shutter) but on some
KLEIN AND MURRAY: SIMULATING LOW-COST CAMERAS FOR AUGMENTED REALITY COMPOSITING 371
cameras with CMOS sensors, different rows inte-

grate at different time periods (rolling shutter),
which introduces image warping when motion is
observed. The charge from each photosite is con-
verted to a digital signal in an analog-to-digital
converter (ADC), introducing quantization noise.
4. In-camera processing: On the Fire-i, the 10-bit ADC
output is fed to a video processing chip which
performs a variety of image operations on the raw
Bayer signal. Unfortunately, documentation of these Fig. 2. Vignetting texture of the 2.1 mm lens on the Fire-i.
operations is not available, but they include sharpen-
ing, exposure control, Bayer interpolation (to recon- blue channels is found by iterative image alignment (Lucas-
struct a full-color image from the Bayer signal), color Kanade tracking [13]). Sampling such patches in a grid
processing (to counteract the color channel mixing across the image produces a vector image of chromatic
described earlier), further quantization, and color- aberration. This procedure is repeated across hundreds of
space conversion to YUV-411. The converted stream frames while moving the calibration pattern to smooth out
is sent to the host computer via an IEEE-1394 link. any spatial noise, and finally, the resulting vector image is
5. Color-space conversion: The YUV-411 stream con- smoothed with a Gaussian blur to eliminate any residual
tains independent luminance information (Y) for miscalibrations and resampled to a 2418-cell grid.
each pixel, but groups of four horizontal pixels share
the same color (U,V) information. The host computer 4.2 Properties of the Bayer Mask
converts this stream for processing and display, The specifications document of the Sony ICX098BQ image
often producing a black-and-white image for track- sensor [16] reveal that the Fire-i uses a Bayer mask of red,
ing purposes and an RGB image for display. green, and blue filters. There are four common permuta-
tions of Bayer mask (GBRG, GRBG, RGGB, and BGGR) and
the sensor specification suggests, at first glance, that the
4 CALIBRATION OF CAMERA EFFECTS
Fire-I uses a GBRG pattern. However, there are two sources
The preceding section outlines the process by which the of uncertainty: First, it is not clear if the arrangement in the
camera image is formed. In this section, we describe data sheet refers to the physical sensor or the read-out
measurement and calibration procedures by which these image (which is rotated 180 degrees); second, the sensor has
steps can be quantified. more than 640480 pixels, so the effective mask anyway
4.1 Calibration of Lens Effects depends on which pixel is used for the top-left, which is
An intrinsic camera parameter calibration, including radial unknown. Here, we assume that a GBRG mask is used.
distortion terms, is required for the operation of the visual It is not known if the sensor includes an antialiasing
tracking system used [10], and so we use that system’s filter. The effect of this would anyway be indistinguishable
calibration method and mathematical lens model. The lens from blur caused by the lens. The net blur of lens and
calibration uses images of a checkerboard pattern and possible antialiasing filter is tuned by hand in software by
performs damped Gauss-Newton minimization of reprojec- matching the appearance of a rendered graphics to graphics
tion error over the individual camera poses and joint observed in the video—we have found that a good match is
camera parameters. The projection model used is the five- obtained by assuming that no antialiasing filter is present
DOF model of Devernay and Faugeras [3], where barrel and/or the lens is very sharp.
distortion is modeled by the arc-tangent function.1 The sensor specification includes a spectral sensitivity
Vignetting is easily calibrated by pointing the camera chart for the Bayer filters in the sensor. Assuming that the
at a uniformly lit white surface (i.e., a white wall), rendered graphics’ r; g; b components correspond to pure
recording hundreds of gray-scale images of this surface, primary light of 700, 546, and 438 nm wavelength,
and then, averaging these images together. The resulting respectively, the response r^; g^; b^ of the Bayer filter may be
image is blurred to eliminate irregularities in the sensor read off the chart and corresponds to
or uniform target, and subsampled to 8060 pixels. 2 3 2 32 3
Finally, the resulting image is normalized so that the r^ 0:95 0:04 0:01 r
brightest portion of the image is scaled to unit intensity. 4 g^ 5 ¼ 4 0:07 0:89 0:04 54 g 5;
This image (shown in Fig. 2) can then be directly b^ 0:0 0:06 0:94 b
multiplied by the rendered luminance signal to vignette
the corners of the rendered image. where each output component has been scaled to sum
Chromatic aberration is calibrated using a printed sheet to unity.
of gray-scale white noise. This is placed before the camera A more accurate calibration could be performed using
so as to cover the field of view, and then, the differences calibration tools such as a color chart, but an accurate
between red, green, and blue channels are measured. photometric calibration is well beyond the scope of this paper.
Individual 3030-pixel patches are extracted from green
4.3 Properties of the Image Sensor
channel, and the position of these patches in the red and
The behavior of the CCD imager and its partner analog-to-
1. The calibration tool used in this work is part of the PTAM package digital converter is not fixed, but varies with parameters
available from the authors at http://www.robots.ox.ac.uk/~gk/PTAM/. such as gain control, exposure time, and color balance
However, the Fire-i uses a 10-bit ADC, so these quantization

effects must be introduced later in the processing pipeline.
When in-camera sharpening is set to zero, we observe
that the images received from the camera are noticeably less
sharp in the horizontal direction than vertically. This may
be due to imperfections in the transfer of electrical charge
from sensor to ADC. The resulting blur can be quantified by
observing a checkerboard pattern, and manually applying
Gaussian blur in the vertical direction so that the apparent
resolution in both orientations matches. In this way, we
estimate a horizontal Gaussian blur with ¼ 0:7 pixels on
each Bayer channel.2
4.4 Reverse Engineering In-Camera Processing

The Fire-i contains a Texas Instruments TSB15LV01 video
Fig. 3. Noise behavior of the Fire-i camera in indoor conditions. processor which receives the digitized sensor signal and
performs a variety of video processing functions before
settings. In normal operation, we let the camera modify sending a YUV411 data stream to the computer. Detailed
these parameters automatically. Unfortunately, the camera documentation of the process is not available. Unfortu-
does not allow read-out of parameters in auto mode, so nately, this processor has a large effect on the appearance of
these settings are unknown at runtime. This renders the the video produced by the Fire-i and we seek to emulate
precise calibration of color responses and noise properties this, so some reverse engineering is required. Here, we
somewhat futile. Nevertheless, some information can be attempt to estimate the processor’s methods for Bayer
gleaned by observing the operation of the camera in a interpolation, image sharpening, and saturation control.
“standard” setting, i.e., in a lab environment.
For our purposes, the most significant effect the image 4.4.1 Bayer Interpolation
sensor adds to the ideal signal is noise. A detailed In a previous version of this work [11], we have made the
characterization of the noise sources of a Unibrain Fire-i assumption that the Fire-i performs bilinear Bayer inter-
camera has recently been presented by Irie et al. [8]. The polation to reconstruct a full color image, and the same
authors conclude that noise is dominated by photon shot assumption is made by Irie et al. [8]. However, we noted
noise, readout noise, and photo-responsive nonuniformity that simulating bilinear interpolation does not produce
(PRNU). We have found that in indoor settings, shot noise images which match that of the Fire-i in areas of fine detail.
and readout noise dominate so that no stationary pattern Here, we examine the camera’s output to derive a closer
(PRNU) is perceptible, and we can assume that noise is approximation of the camera’s de-Bayering algorithm.
independently distributed spatially and temporally. It is To determine the interpolation method, it would be
then necessary to determine the magnitude (standard useful to observe the effect which a signal on any one
deviation) of noise for each color channel as a function of photosite has on the reconstructed image. The reconstructed
illumination intensity. image can be observed (albeit only after further in-camera
To calibrate noise behavior, we mount the camera rigidly processing) but manually producing a signal on one
to observe an unchanging scene, and record 100 frames. For photosite would involve difficult electronic intervention.
each pixel and each color channel, we calculate the mean We instead rely on noise to produce this signal: We assume
intensity and standard deviation . A plot of their that noise (be it thermal or shot noise) in one photosite is
relationship is shown in Fig. 3. Irie et al. predict a independent of noise in any other so that observing the
pffiffiffi correlations of noise in the reconstructed image should
relationship of the form ¼ kreadout þ kshot , and indeed,
this is apparent in the plots. However, it is faster at runtime provide information on the reconstruction method.
to use a linear model, so we calculate the best-fit line for Fig. 4 shows, for each color channel, the covariance
observed in the noise present in a 44-pixel square of the
each color channel, disregarding outliers and values with
video from the Fire-i, where pixels 1 . . . 16 are enumerated
> 220, which are clearly corrupted by saturation effects.
in scan order. (For this and many other calibration
We are interested in the noise values for each sensor
experiments, the Fire-i is set to 15 Hz RGB mode to avoid
element under the Bayer mask, but only have access to the
interference by the YUV411 compression.) These plots can
Bayer-interpolated image for measurements. The measured
test the hypothesized de-Bayering method used. For
values of must, therefore, be corrected to account for this.
example, the plot for the blue channel shows that pixels 2,
Irie et al. do this under the assumption of bilinear
3, 6, and 7 are almost fully correlated, suggesting that they
interpolation. Here, we have a different model for the
all depend on the same underlying sensor element, and
Bayer interpolation (See Section 4.4) whereby each inter-
polated green value is the average of two sensorpvalues, and likewise pixels 10, 11, 14, and 15 form another such clique.
ffiffi
hence, green noise is corrected by scaling with 2. Red and Both the red and blue channels appear to use a nearest-
blue noise depend on only one sensor value, so noise for neighbor assignment, whereby 22-pixel blocks take the
these channels is not scaled. intensity of one single sensor value. This is supported by
In indoor operation, the shutter-open time on the camera the close match of the observed correlations to those which
is mostly fixed at the full frame period of 33 ms. The images
2. If this blur is, in fact, caused by the analog transmission of sensor
from the Fire-i also exhibit very noticeable quantization voltage, it is likely that this causes further desaturation, as pixels of different
artifacts, equivalent to quantizing intensities to 6 bits. color are interleaved on readout. This has not been investigated here.
Fig. 6. Frequency response of the in-camera sharpening on a full-

resolution image, with a notch at half the Nyquist frequency.
2 3 02 3
r 0:2915 0:5887 0:1113
4 g 5 ¼ @4 0:2915 0:5887 0:1113 5þ
6 7 B6 7
Fig. 4. Covariances within a 44-pixel block of the reconstructed image
due to sensor noise. (a) The observed covariances. (b) The covariances b 0:2915 0:5887 0:1113
expected using bilinear Bayer interpolation. (c) The covariances 2 312 3
expected using 22-block demosaicing. 0:0109 0:0090 0:0017 r^
4 0:0045 0:0063 0:0017 5A4 g^ 5;
6 7C6 7
would be expected to arise from this mechanism, as shown 0:0045 0:0090 0:0137 b^
in Fig. 4c. (The results expected from bilinear interpolation,
which are very different, are shown in the second row.) The where 2 ½0; 255 is the numerical value of the saturation
green channel is more complex, but appears to match the control. The identity transformation is achieved when ¼ 65.
superposition of 22-block interpolation performed inde-
pendently for the green pixels on the red and blue lines, so 4.4.3 Image Sharpening
that each reconstructed pixel is the average of two sensor The default power-on setting of the Fire-i’s sharpness
green values. This de-Bayering method is illustrated in control applies significant (easily visible in the video image)
sharpening in the horizontal direction. No vertical sharpen-
Fig. 5, where Gr refers to green sensels on the same row as ing appears to be performed.
red sensels and Gb to those on the same row as blue sensels. The effect of the sharpening control can be investigated
by recording a stationary scene with varying levels of in-
4.4.2 Saturation Control camera sharpening. We assume a linear transfer function
The Fire-i provides a saturation control which can be between the unsharpened image and the sharpened
adjusted by the user from zero, where a gray-scale image is version. We parameterize this transfer function by a 101-
received, to 255, at which point a severely processed, tap finite impulse response (FIR) filter, i.e., we assume that
oversaturated image is produced. We may assume that the the sharpened image is produced by convolving the
default value of 90 attempts to counteract the desaturating unsharpened image by a 101-tap signal independently for
color mixing which occurs on the image sensor. each row of the image. The 101 filter values can be found as
To calibrate the effect of this control, we remove the the solution to a linear system of equations.
camera lens and directly expose the sensor to light, Fig. 6 shows the frequency response of the transfer
producing a uniformly colored image. Setting the camera’s function estimated for a sharpness setting of 100 on the full
resolution 640480-pixel image. As may be expected for a
white balance to manual control and observing the
sharpening filter, the response has unit gain at DC, and then,
variation in image color as the blue/U, red/V, gain, and rises with increasing frequency. However, there is a low-gain
saturation controls are modified allows the identification of notch present at the half-Nyquist frequency, which corre-
the following linear model for saturation processing: sponds to the Nyquist frequency in the 320240 image. From
this, we infer that the in-camera sharpening takes place on the
lower resolution image, perhaps on the 320240 resolution
signals of each of the sensor’s R, B, Gr, and Gb color
component outputs.
Examining the sharpening response on a 320240-pixel
subsampled image yields frequency responses as illustrated
in Fig. 7 for various settings of the sharpness control, and
show an amplification of high-frequency components as
expected from a sharpening filter. Also, illustrated in Fig. 8
is the impulse response of the calculated FIR filter with
Fig. 5. Demosaicing method corresponding to the correlations in Fig. 4. sharpening set to 100. It is notable that the filter is noncausal
The reconstructed red and blue channels have the same values within and quite long in extent, suggesting that the in-camera
22-pixel blocks. The green channel is the average of two 22 blocks sharpening may be implemented as a forward-backward
formed by the Gr and Gb sensels. pass of IIR filters rather than a single FIR pass. However, we
Fig. 8. The impulse response measured from the Fire-i on the 320
240 image at a sharpness setting of 100.
5.2 Processing
Fig. 7. Frequency response of the in-camera sharpening on a 320 Our method requires three inputs per frame: a rendered
240 image. pinhole image of the virtual graphics to be displayed, the
input video image, and an estimate of the camera’s rotational
cannot be certain of this: The impulse response has been displacement during the frame’s exposure. The rendered
calculated from a noisy, postprocessed version of inputs image should be stored in premultiplied alpha, requiring
and outputs, so accuracy will be low. What is apparent in either a trivial modification of blending modes when
both the impulse response and the sharpened image is the rendering, or a preprocessing step to multiply each color by
asymmetric nature of the filter, which shifts the image its alpha component. The image should also be sufficiently
sideways as more sharpening is applied. high-resolution to provide good detail across the distorted
frame; with the wide-angle lens used here, this means that we
5 IMPLEMENTATION use a 3;0002;250 pixel rendering. (The 3,000 size is chosen by
working backward from the 1;024768 size later in the
This section describes a method by which some elements of process. Scaling the image up by a factor of two to provide
the above image formation process can be emulated on a
some antialiasing yields a width of 2,048, and this is
computer. We start with a high-resolution image of virtual
multiplied by 1.5 to compensate for the change in pixel pitch
graphics rendered in OpenGL, and progressively down-
in the center versus edges of the distorted image.) The
sample, blur, and degrade the image to produce the data
transparent background should be of color c ^ ¼ ½0; 0; 0; 0.
which the camera would have measured at each Bayer
photosite, and subsequently, blend and color-space-convert The rendered image is processed and blended with the
the image together with the video input feed to produce a video frame in 10 distinct steps, which are shown in a
final 640480-pixel blended image. We implement every- flowchart in Fig. 9. These steps are now described in detail.
thing using OpenGL and its shading language GLSL. The
input image needs to have an alpha (transparency) channel: 1. Radial distortion: The pinhole image is warped by
At this point, it is helpful to briefly review alpha blending rendering into a 2;0481;536-pixel texture as a
and compositing in general. 2418-cell3 grid, as described by Watson and
Hodges [17]. Here, we use the “FOV” radial
5.1 A Note on Color Representation distortion model of Devernay and Faugeras [3], i.e.,
We adopt the usual convention of an alpha value with a the same distortion parameters used by the visual
range from zero to unity indicating fractional pixel cover- tracking system employed. The field-of-view cover-
age: that is, an RGBA pixel encodes both an RGB color as age of both the input image and the new distorted
well as a fraction of the pixel’s area which is covered by that image is slightly larger than the video frame: This
color. When performing calculations such as convolution provides the margin of pixels required for later blur
and interpolation on RGBA data, it is, however, also and aberration steps, and is illustrated in the
important to store this data in an appropriate format; here, flowchart by the black and white outlines (the white
this means storing pixels using premultiplied alpha. outline is the green channel’s frame coverage).
The use of premultiplied alpha has been commonplace 2. Half sampling and color mixing: The distorted
in computer graphics since the compositing paper of Porter image is subsampled down to 1;024768 pixels. We
and Duff [15] but appears less well known in AR. In the use a slightly enlarged 22 box filter to avoid
premultiplied representation, each pixel stores not the aliasing (four taps using bilinear interpolation yield
usual quadruplet of c ¼ ½r; g; b; but rather the values 16 source pixels for each pixel output). At the same
^ ¼ ½r; g; b; , with all values in the range 0-1. This has
c time, the image is desaturated by mixing a small
the advantage that interpolation and summation over fraction of each color channel into the others. The
pixels become trivial: The average of pixels c ^1 and c^2 is sensor specifications suggest a mixing operation as
simply 12 ð^ ^2 Þ, whereas 12 ðc1 þ c2 Þ would yield incorrect
c1 þ c described in Section 4.2. In practice, we find the
results when 1 6¼ 2 . Correct summation over pixels of desaturation produced by this transform is not as
different alpha values is crucial for the ability to handle strong as that observed in the Fire-i’s images, so we
transparent objects, allows simple convolution operations
3. The choice of 24 18 as a mesh size mirrors that of Watson and
over pixels of varying alpha, and avoids the sometimes- Hodges [17] who found that 400 vertices “struck a good balance between
seen black or gray outlines around virtual graphics. oversampling and overinterpolating.”
convolution kernel. The blur kernel varies across

the image with a 2418-cell grid (and bilinear
interpolation), allowing some regions to be blurred
more than others. Local blur magnitude is calcu-
lated from the following considerations:
a. Antialiasing blur—A small quantity of blur may

be applied to emulate limited lens resolution
and any antialiasing filter the sensor might have.
b. Corner softness—We calculate image-varying
softness by assuming that it varies with separa-
tion of the red and blue portions of the image as
calibrated for chromatic aberration. That is, the
higher the color separation, the higher the blur.
c. Motion blur—We use a 16-tap motion blur kernel
in the next step, and for large blur lengths, these
taps can be more than one pixel apart. In such
cases, we apply extra Gaussian blur to prevent
visible banding.
4. Motion blur: On the same image grid as above, we
estimate the local direction and magnitude of motion
blur based on a tracking system’s estimate of the
rotational motion during exposure. This is done by
projecting the motion which each grid point would
undergo during 33 ms of the estimated rotational
velocity. We consider only camera rotation and ignore
translation and moving objects. Anything more
general is very difficult to achieve as a postprocess,
and would require multiple geometry rendering
passes. We apply motion blur as a 16-tap filter, with
the taps spaced uniformly along the tangent of the
local blur direction in the source image.
5. Bayer sampling: The blurred image is subsampled
with a simulated Bayer mask. Instead of producing a
joint 640480-pixel Bayer image (as delivered by the
camera), we find it convenient to produce a separate
intensityþalpha image for each channel, and indeed,
to separate the Green channel into its Gr and Gb
components. We use four (326242) images to store
each of the R, Gr, Gb, and B channels; these images are
larger than 320240 so that border pixels are available
for convolution and demosaicing.
Each channel samples individually from the mo-
tion-blurred image using a calibrated 2418-cell grid
and bilinear interpolation, which slightly distorts the
sampling process to account for chromatic aberration.
(The green channel is considered the baseline and is
not distorted.) Each Bayer pixel is further darkened by
multiplication with a calibrated vignetting texture,
and corrupted with Gaussian noise: We use a noise
texture to obtain independent normally distributed
noise samples, and scale these according to the
calibrated linear function of each pixel’s intensity (cf.
Section 4.3). The noise texture is generated only once,
and we add a random offset to texture lookup
coordinates at every frame to make the samples
time-varying. Making the noise texture much larger
Fig. 9. Postprocessing method employed for the Fire-i camera. The (1;0241;024) than the 326242 images that sample
checkerboard background represents regions of transparency. from it ensures that no fixed pattern is visible.
6. Horizontal blur: Each Bayer-mask image is blurred
add an extra 5 percent of every color channel to horizontally by convolution with a seven-tap Gaus-
every other. sian filter with ¼ 0:7 pixels.
3. Gaussian blur: The subsampled distorted image is 7. Sharpening and quantization: Each Bayer-mask
blurred. We use a separable Gaussian filter, with a image is sharpened by convolution with a shar-
horizontal and vertical pass of a seven-tap pening mask. The sharpening impulse response
Fig. 10. Compositing comparison with a “Cartman” figure inserted in the scene. The two large images show (a) the old and (b) new compositing
methods. (c) The top enlargement illustrates motion blur applied by the new method, which matches reasonably well the motion blur of the
background. Visible in the second enlargement is virtual noise added by the method, some slight separation of the color channels (chromatic
aberration), and the low chroma resolution of YUV-411.
Fig. 11. Compositing comparison with a digger inserted in the scene. Here, the enlargements show the graceful degradation of areas with intricate
detail, and artificial sharpening artifacts at the junction of real and virtual graphics.
estimated in Section 4.4.3 is too long to be used color channel has its own alpha channel, which is
directly and an IIR filter is incompatible with the interpolated from the Bayer images alongside the
parallel nature of GPUs, so we perform sharpening color component. We perform the 22 box inter-
using a symmetric4 seven-tap FIR filter. This filter polation described in Fig. 5, with a slight modifica-
F is constructed as a mixture of the identity tion: The interpolation pattern for the blue channel is
impulse response I and a Gaussian low-pass filter harmonized to that of the red and green channels,
L with standard deviation : i.e., the 22 box is placed so the physical sensel is in
the top-right corner. This is done to avoid a 1-pixel
F ¼ I þ ðI LÞ; shift of the blue channel in the reconstructed image,
where is the value of the camera’s sharpness as any such shift produced by the Fire-i would
control. The parameters and are set to roughly already be reproduced by our method’s chromatic
match the measured filter’s frequency response aberration stage.
( ¼ 0:55 and ¼ 0:1). The blending procedure effectively converts the
This convolution produces pixels with negative YUV frame to RGB, blends it with each rendered color
values. The premultiplied alpha representation still channel in turn, and then, converts back to YUV. Wary
allows these pixels to be blended correctly. However, of finite numerical precision, we rearrange these
they are stored in an 8-bit unsigned integer value, so operations to ensure that source video pixels not
we add a þ128 offset to prevent underflows. Prior to covered by graphics ( ¼ 0) remain completely un-
this, we also divide each value by 4. Upon storage in changed. The output is a composited 640480-pixel
the 8-bit texture, this has the effect of quantizing each image with a YUV triplet per pixel.
pixel to 6-bit resolution to match the quantization 9. Chroma split and squash: To match the color
artifacts observed in the Fire-i’s output. resolution of the video input, the UV component of
8. YUV blending: The rendered image is blended into the blended image is horizontally subsampled by a
the input video frame. At this stage, each rendered factor of four using a uniform box filter. This splits
the 640480 YUV image into a 640480 Y image
4. Even though the Fire-i’s filter is known not to be symmetric, we use a and a 160480 UV image.
symmetric filter here. This is because any image shift produced by the Fire-i
is already compensated by the visual tracking system. Any extra shift 10. Chroma recombine: The blended image is converted
introduced in the compositing stage would misalign the virtual graphics. to RGB, using the full-resolution luminance (Y)
Fig. 12. Compositing comparison with a test pattern. Two test patterns are in each image, of which the left is an actual printout in the world and the
right is the inserted virtual version. Our new compositing method makes the inserted version look more like the real pattern. Particularly noticeable
are the chromatic aberrations introduced near the image edges. (a) Old method. (b) New method.
information and the low-resolution chrominance while the CPU already tracks the next frame; however, it
(UV) without interpolation. adds 15 ms of latency to the display.
Figs. 10, 11, and 12 show some side-by-side comparisons
6 RESULTS of the proposed compositing method with a standard
compositing approach which blends the warped input
The results presented here are also included as losslessly
texture on to the video frame directly. As well as some
compressed PNG images in the supplemental materials,
which can be found on the Computer Society Digital Library
at http://doi.ieeecomputersociety.org/10.1109/TVCG.
2009.210.
We have integrated the above compositing method into
our “Parallel Tracking and Mapping” markerless tracking
system [10], [12]. On a desktop computer with a fast
graphics card (NVIDIA GeForce 9800 GTX), the new
compositing method adds approximately 6 ms of rendering
overhead per frame compared to our previous method,
which undistorted the video frame, drew graphics into it,
and redistorted the composite. On more modest mobile
hardware (NVIDIA GeForce 8600M/GT), the overhead is
substantial at 15 ms extra rendering cost per frame. In our
experiments, this does not impact frame rate (the system
maintains 30 Hz) since compositing takes place on the GPU,
Fig. 14. Real text interleaved with virtually inserted text at different
sharpening settings. The rendition of fine structures like text is a good
Fig. 13. Two sharpening settings with a virtual Darth Vader figure. The test for the accuracy of our simulation of the whole processing pipeline,
compositing method reads the camera’s sharpness setting and adjusts in particular the demosaicing method. A relatively good match is
its own sharpening filter accordingly. At the default value of 80, achieved, especially compared to a standard compositing approach,
sharpening artifacts are visible both in the real scene and the although the heavy coloration the Fire-i exhibits at low sharpness
augmented graphics. settings remains unmodeled.
Fig. 15. Drawbacks of the proposed method. (a) The method is indiscriminate and will degrade all graphics rendered in 3D, for example, here, the
cross hair is dull. (b) The next image showing the standard compositing method’s output shows how bright the cross hairs should be. (c) The same
applies to motion blur which is applied to all elements of the screen, even those like the cross hairs which always remain in the center of the display;
here, the cross hairs have been blurred to the point of being unrecognizable. (d) Finally, the rendering of blur relies on accurate velocity
measurements from the tracking system; the last image shows a misestimation of velocity, and hence, incorrect blur (heavily blurred virtual brown
shapes near the top of the crop).
simple inserted objects, the results show a texture of some causes a motion blur mismatch between real and virtual
test patterns superimposed next to a printout of the same objects. Beyond this, many sensor effects are not simulated
texture. In most instances, the new method provides quite correctly: For example, the real chromatic aberrations
superior integration of the real and virtual images. are more purple than the simulated versions, and sharpening
An advance presented in this paper over our previous artifacts appear more monochromatic on the Fire-i (perhaps,
short paper [11] is more accurate simulation of in-camera in-camera sharpening is performed on Y rather than RGB).
processing and in-camera sharpening in particular. The These small infidelities are due to imperfect understanding
most notable difference this has made is that the method of the processes in the camera as well as implementation and
presented here is effective with the Fire-i set to its rather calibration constraints.
aggressive power-on default sharpening of 80. In previous
work, sharpening was lowered to 25 to avoid the obvious
artifacts the higher setting produces. Fig. 13 shows a
7 RESULTS WITH A RAW BAYER CAMERA
comparison between the two sharpness settings. Some of the imaging steps taken by the Fire-i camera are a
Fig. 14 displays more results at different sharpness nuisance. For computer vision (and AR) applications, it
settings, this time comparing text printed on a piece of paper would often be preferable to have the camera data as soon as
to text in the same font inserted virtually. Ideally, there should possible, so any further processing could be user-controlled
be no visual difference between the real and virtual text. on the host computer. Some new cameras now make the raw
Again, the method does a reasonable job of emulating the Bayer image available to the computer; one such camera is the
sharpening artifacts which appear at higher sharpness IDS -eye. This is a USB CMOS device with a global shutter. It
settings. The fact that our method employs a symmetric sends raw Bayer data over the USB bus, allowing the user to
sharpening filter, whereas the Fire-i’s is biased to one side is choose any appropriate de-Bayering method in software. In
apparent on close inspection, but not glaringly intrusive to the other respects (form factor and sensor size), it is similar to the
casual observer. More obvious is the low-frequency red-blue Fire-i and could easily be used in the same applications.
banding visible only in the Fire-i image at low sharpness To form composite images for this camera, we truncate
settings, which indicates that our understanding of the the blending method of Fig. 9 after stage 5. Instead of
camera’s imaging pipeline is far from perfect. performing blending and YUV conversion, the individual
The compositing method presented here is not without color channel images are blended directly using the
drawbacks. Some problems arise from the decision to apply appropriate pixels from the camera’s Bayer image. The color
sensor effects as a postprocess step: For example, objects channels images are finally converted to an RGB composite
which should appear stationary in the image are still using bilinear interpolation.5 As for the Fire-i pipeline, all
blurred, which looks unnatural. Further, items which are processing is performed on the graphics card; however, the
very bright because they are user interface components truncated pipeline here requires only circa 4 ms on an
(which should stand out from the background) are treated NVIDIA GeForce 9800 GTX due to its shorter length.
the same way as any other graphics, and the resulting look Fig. 16 shows results obtained with this simplified
is dull. Solving these drawbacks would require modifica- method. The quality of images delivered by the -eye is
tions to the 3D rendering stage. Some of the problem cases higher than the quality obtained from the Fire-i due to the
are illustrated in Fig. 15.
Further problems include misestimation of camera velo- 5. For this camera, it is helpful to combine the Gr and Gb Bayer-mask
images into a single Green image rotated 45 degrees, as this allows the
city by the tracking system, which is likely due to the often- graphics hardware to perform bilinear interpolation directly, as described
false assumption of zero acceleration. This misestimation in [11].
That said, the method as presented here is not currently

flexible enough to allow the fine-tuning of the simulated
effects on a per-object basis. Further, the method is by no
means exhaustive: For example, we do not model gamma
correction. The method requires powerful graphics hard-
ware, so a more lightweight approach implementing only a
subset of the steps described here could perhaps provide
satisfactory fidelity while being easier to implement and
faster. The best approach will likely vary with the hardware
constraints and compositing requirements of any specific
application.
Finally, the method presented here requires a number of
time-consuming calibration steps. While a more automatic
calibration method would be of value, it would be
preferable if more camera manufacturers provided detailed
specifications or allowed access to unprocessed images.
ACKNOWLEDGMENTS
This work was supported by EPSRC grant EP/D037077/1.
REFERENCES
[1] B.E. Bayer, “Color Imaging Array,” US Patent 3971065, July 1976.
[2] J. Chen, G. Turk, and B. MacIntyre, “Watercolor Inspired Non-
Photorealistic Rendering for Augmented Reality,” Proc. ACM Symp.
Virtual Reality Software and Technology (VRST ’08), pp. 231-234, 2008.
[3] F. Devernay and O.D. Faugeras, “Straight Lines Have to Be
Straight,” Machine Vision and Applications, vol. 13, no. 1, pp. 14-24,
2001.
[4] J. Fischer, D. Bartz, and W. Strasser, “Enhanced Visual Realism by
Incorporating Camera Image Effects,” Proc. Int’l Symp. Mixed and
Augmented Reality (ISMAR ’06), pp. 205-208, Oct. 2006.
[5] J. Fischer and D. Bartz, “Stylized Augmented Reality for
Improved Immersion,” Proc. IEEE Conf. Virtual Reality (VR ’05),
vol. 325, pp. 195-202, 2005.
[6] T. Florin, “Simulation of a Digital Camera Pipeline,” Proc. Int’l
Symp. Signals, Circuits and Systems (ISSCS ’07), July 2007.
[7] M. Haller, F. Landerl, and M. Billinghurst, “A Loose and Sketchy
Approach in a Mediated Reality Environment,” Proc. Third Int’l
Conf. Computer Graphics and Interactive Techniques in Australasia and
South East Asia (GRAPHITE ’05), pp. 371-379, 2005.
[8] K. Irie, A.E. McKinnon, K. Unsworth, and I.M. Woodhead, “A
Technique for Evaluation of CCD Video-Camera Noise,” IEEE
Trans. Circuits and Systems for Video Technology, vol. 18, no. 2,
pp. 280-284, Feb. 2008.
Fig. 16. Virtual text superimposed onto a printout of real text. Ideally, the
[9] K. Jacobs and C. Loscos, “Classification of Illumination Methods
virtual text should look identical in character to the real text. These
for Mixed Reality,” Computer Graphics Forum, vol. 25, no. 23,
images are from the -eye camera; here, the availability of raw Bayer pp. 29-51, Mar. 2006.
data means that our new compositing method can achieve good fidelity, [10] G. Klein and D. Murray, “Parallel Tracking and Mapping for
especially when compared to the standard blending method. Small AR Workspaces,” Proc. Int’l Symp. Mixed and Augmented
Reality (ISMAR ’07), Nov. 2007.
lack of destructive on-camera processing (and probably also [11] G. Klein and D. Murray, “Compositing for Small Cameras,”
Proc. Int’l Symp. Mixed and Augmented Reality (ISMAR), 2008.
due to more modern sensor design). At the same time, our [12] G. Klein and D. Murray, “Improving the Agility of Keyframe-
compositing method has an easier time attempting to match Based SLAM,” Proc. 10th European Conf. Computer Vision
the camera effects, resulting in a good match between real (ECCV ’08), pp. 802-815, Oct. 2008.
and virtual imagery. Mismatches at this stage are mostly [13] B.D. Lucas and T. Kanade, “An Iterative Image Registration
due to incorrect parameters, e.g., for motion blur or the Technique with an Application to Stereo Vision,” Proc. Seventh
Int’l Joint Conf. Artificial Intelligence, pp. 674-679, 1981.
basic lens softness/antialiasing filter blur function, and the [14] B. Okumura, M. Kanbara, and N. Yokoya, “Augmented Reality
fact that our method does not model depth defocus. Based on Estimation of Defocusing and Motion Blurring from
Captured Images,” Proc. Int’l Symp. Mixed and Augmented Reality
(ISMAR ’06), pp. 219-225, Oct. 2006.
8 CONCLUSIONS [15] T. Porter and T. Duff, “Compositing Digital Images,” SIGGRAPH
Computer Graphics, vol. 18, no. 3, pp. 253-259, 1984.
This paper has presented a simulation of the behavior of [16] Sony. ICX098BQ Diagonal 4.5 mm (Type 1/4) Progressive Scan
small Webcams by postprocessing the ideal images pro- CCD Image Sensor with Square Pixel for Color Cameras, http://
duced by the standard OpenGL pipeline. Results show that www.unibrain.com/download/pdfs/Fire-i_Board_Cams/
ICX098BQ.pdf, Aug. 2009.
the integration of real and virtual graphics can be improved [17] B. Watson and F. Hodges, “Using Texture Maps to Correct for
by simulating some of the various artifacts that give the Optical Distortion in Head-Mounted Displays,” Proc. IEEE Virtual
video image its characteristic look. Reality Ann. Symp. (VRAIS ’95), pp. 172-178, Mar. 1995.
Georg Klein received the doctorate degree from David W. Murray received the graduate degree
the University of Cambridge in 2006. He was a with first class honors in physics and the
postdoctoral research assistant in Oxford’s Ac- doctorate degree in low-energy nuclear physics
tive Vision Laboratory until August 2009. He now from the University of Oxford in 1977 and 1980,
works at Microsoft Corporation. His research respectively. He was a research fellow in
interest is mainly the development and applica- physics at the California Institute of Technology
tion of visual tracking techniques for Augmented before joining the General Electric Company’s
Reality. He is a member of the IEEE. research laboratories in London, where he
developed research interests in motion compu-
tation, structure from motion, and active vision.
He moved to the University of Oxford in 1989 as a university lecturer in
engineering science and a fellow of St Anne’s College, and was made a
professor of engineering science in 1997. His research interests
continue to center on active approaches to visual sensing, with
applications in surveillance, telerobotics, navigation, and wearable
computing. He is a fellow of the Institution of Electrical Engineers in
the UK and is a member of the IEEE.
. For more information on this or any other computing topic,

please visit our Digital Library at www.computer.org/publications/dlib.

Simulating Low-Cost Cameras For Augmented Reality Compositing

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Simulating Low-Cost Cameras For Augmented Reality Compositing

Hochgeladen von

Copyright:

Verfügbare Formate

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 16, NO.

3, MAY/JUNE 2010 369

Simulating Low-Cost Cameras for

Index Terms—Artificial, augmented, and virtual realities, visualization, compositing.

areas. Further, our more principled simulation of the

cameras with CMOS sensors, different rows inte-

However, the Fire-i uses a 10-bit ADC, so these quantization

4.4 Reverse Engineering In-Camera Processing

Fig. 6. Frequency response of the in-camera sharpening on a full-

convolution kernel. The blur kernel varies across

a. Antialiasing blur—A small quantity of blur may

That said, the method as presented here is not currently

. For more information on this or any other computing topic,

Das könnte Ihnen auch gefallen