Sie sind auf Seite 1von 6

ECSE-626 Final Project: An Evaluation of Chris Stauffer and W.E.

L Grimsons Method for Background Subtraction


Eric Thul
Abstract The primary goal of this work is to implement the background subtraction method proposed by Chris Stauffer and W.E.L Grimson. This method uses a mixture of Gaussians to model the background. The secondary goal is to compare the accuracy of the proposed method to two other background subtraction techniques: background averaging and a method that uses the previous frame for the background. In this report, we rst describe the method by Stauffer and Grimson, second develop and test a method for comparison, and third implement an improvement to Stauffer and Grimsons original method.

A. Part One: A Probabilistic Model 1) Initialization: The initialization consists of two parts. The rst part is the construction of the Gaussian mixtures for each pixel. The second part is initializing the parameters for each of the Gaussians. Let us rst explore the rst part. Consider the rst frame in a sequence of images. Let this be frame t = 1. We do not know anything prior about the sequence of images. However, for each pixel in frame t, we have a set of Gaussians composing a mixture. Let K be the number of Gaussians at each pixel. Now, let Xt be the value of some pixel location (it , jt ) in frame t. The probability of observing the pixel value Xt is calculated with the following equations [4, 2.1].
K

I. I NTRODUCTION The topic of Stauffer and Grimsons work [4] is to develop an adaptive probabilistic model for a dynamic scene. The probabilistic model proposed is a mixture model of Gaussians. For gray-scale, each Gaussian is uni-modal. For color, each Gaussian is multi-modal [4, 2.1]. An important aspect of the mixture of Gaussians is that the parameters for each Gaussian are updated over time. The research question posed is: How can a probabilistic model be used to separate the background from the foreground in a dynamic scene? The rationale behind this question lies in real-time tracking. The authors apply their proposed method to real-time tracking for in-door and out-door settings. The main focus is outdoor settings where the scene is generally more dynamic [4, 1]. Critical systems, such as video surveillance, can benet from a stable and accurate implementation of realtime tracking. The authors plan to use such a real-time tracking for perimeter patrol and urban security [1]. In short, the main goal of the work by Stauffer and Grimson is to develop a probabilistic background model of a changing scene, where each pixel is represented by a mixture of Gaussians. This model is used to determine whether or not a pixel is background or foreground. By using the proposed background model, separation of background and foreground pixels for object tracking in real-time can be achieved. II. M ETHODOLOGY The method proposed by Stauffer and Grimson to develop a probabilistic background model for real-time tracking, is composed of two main parts. The rst part is the heart of their method which initializes the Gaussian mixture, updates the model over time, and distinguishes the background. The second part uses the information from the rst part, and tracks foreground objects discovered. This paper explores part one, but does not detail part two, since part one encompasses the authors main contribution.

P (Xt ) =
k=1

k,t (Xt , k,t , k,t )

(1)

(Xt , k,t , k,t ) =

exp 1 (Xt k,t ) 1 (Xt k,t ) k,t 2 (2)


|Xt | 2

(2) In equation 1, the k,t represents the weight of the k th Gaussian in the mixture, and () is the multi-modal Gaussian probability density function as seen in equation 2. In equation 2, we have k,t representing the means of each k-Gaussian and k,t representing each covariance matrix. 2 2 Note that the authors assume k,t = k,t I, where k,t is the variance and I is the identity matrix [4, 2.1]. This assumption is relevant to color frames. The implication is that colors are statistically independent [4, 2.1]. Gray-scale images use uni-modal Gaussians. Now that we have our mixtures, we can move to initializing the parameters , , and . The authors do not present a method for generating initial values [4]. However, we sample from Gaussian distributions with the following parameters: means are set to the pixels values of the rst frame, standard deviations are set to a moderate value determined by a test to obtain an average variance from a sequence of frames, and weights are set to a low value. 2) Parameter Update: At each frame, the parameters of each Gaussian are updated. Consider an arbitrary frame t. The update process is divided into two parts. The rst part is determining which parameters to update per Gaussian, and the second part is performing the update. To determine which parameters are updated for the Gaussians, a match matrix M is constructed. Each Mi,j,k represents whether the k th Gaussian for pixel (it , jt ), at frame t, matched or not [4, 2.1]. A Gaussian is a match if the value

|k,t | 2

Xt for pixel (it , jt ) is within 2.5 standard deviations [4, 2.1] of the k th Gaussian distribution. Otherwise we do not match. The value of a match is 1 and the value of not having a match is 0. Each pixel can have at most one matching Gaussian from its mixture. The authors indicate to pick the rst matching k-Gaussian in the mixture [4, 2.1]. However, we pick the closest distribution instead. The matches present three cases for the update. For each case, Xt is a value for pixel (it , jt ) in frame t. The rst case is for a matching k-Gaussian in the mixture for pixel (it , jt ). Here we update the weight, mean, and variance from the following equations given by Stauffer and Grimson [4, 2.1]. k,t = (1 ) k,t1 + Mi,j,k (3) k,t = (1 ) k,t1 + Xt
2 k,t

sum is larger than , where is the second parameter to the implementation (recall the learning rate ). The threshold determines the number of Gaussians from the mixture that model the background pixel (it , jt ) [4, 2.2]. Determining the parameters will be covered in III-C. B. Part Two: Real-Time Tracking The goal of the second part is to use the information of the background and foreground separation for real-time tracking. To do this the authors implement an algorithm for creating connected components of their foreground objects [4, 2.3]. They use an algorithm by Horn [2]. They also use the established technique of a Kalman lter to track each component [4, 2.4]. We do not go further with these topics since they are not essential to understanding the background subtraction method [4, 2.4]. III. E XPERIMENTATION The goal of our experimentation is to achieve a comparison between three background/foreground separation methods. The rst method by Stauffer and Grimson [4] has been detailed above. The second method is a background subtraction technique where the background is a running average of all previous frames [3, p.7]. The third method is a background subtraction approach where the background is considered to be the previous frame [3, p.5]. In order to compare the three different methods, we need to devise a fair measure. No measure has been provided by Stauffer and Grimson [4], so we will create one. However, all we need is a true representation of the background to do so. The following sections detail the chosen test data, elaborate on how the true background was established, and discuss the optimal system parameters for the three background methods. A. Test Data The test data chosen for the experimentation was a recording of four-thousand frames (about 2.5 minutes) of trafc and people. The conditions were moderate snowfall and the time was mid-afternoon. The method proposed by Stauffer and Grimson [4] claims to work best for outdoor scenes with cloud-cover and also claims to be successful in snowfall [4, 4]. This is the reason we chose to record during snowfall at mid-afternoon. The location the corner of Sherbrooke and University in Montreal, QC, CA, facing north. The camera used was a Sony Cybershot DSC-W100. This is not a video camera, but does have video capability. The resolution used was 640 480. As a note, a close-up video was recorded since the video recorded at a distance was poorer in quality. B. True Background The following method was devised to generate a representative true background. Since the background is a dynamic environment, we wish to generate an estimate based on our sequence of frames. To do this, we will make an assumption

(4)

2 (1 ) t1

+ (Xt k,t ) (Xt k,t ) (5)

In equation 3, notice the variable . This is the learning rate parameter. In equations 4 and 5, notice . This is the Gaussian probability density function scaled by with the following parameters [4, 2.1]. = (Xt , k,t , k,t ) (6)

The second case is for a non-matching k-Gaussian. Here, we only update the weight with equation 3. However, recall that Mi,j,k = 0, so the weight is decreased. The mean and variance parameters do not change. The third case is when none of the k-Gaussian match for our pixel (it , jt ). This implies that the pixel value Xt is not close to any of the distributions in the mixture, ie. Xt is potentially a foreground pixel and is marked as such. Moreover, for each k-Gaussian in the mixture, we calculate (Xt , k,t , k,t ). We then pick the least probable Gaussian. This least probable Gaussian then has the mean set to Xt , the variance set high, and the weight set low. The reason for this is because Xt may be a pixel we want to incorporate into the background [4, 2.1]. So, a distribution based on the value is placed into the mixture. Over time, if the values are similar to Xt , then the distribution will slowly gain higher weight and smaller variance [4, 2.1]. This indicates that Xt is indeed a new object becoming part of the background. 3) Background Heuristic: Distributions having a high weight and low variance are precisely the distributions we want to represent the background model. The authors propose a heuristic for locating such distributions in each mixture [4, 2.2]. There are three parts to this heuristic. For each part, consider an arbitrary pixel (it , jt ) in frame t. In the rst part we calculate k,t /k,t for each k-Gaussian in the mixture at pixel (it , jt ) [4, 2.2]. This value will be large for those distributions which have a high weight and a low variance. This leads us to the second part. In the second part we simply order the K-Gaussians that represent our mixture [4, 2.2]. So the distribution with the largest k,t /k,t will be at the top. For the third step, we follow the ordering just generated and sum the over each k,t in our mixture of K-Gaussians [4, 2.2]. We do this until the

that the pixel values follow a Gaussian distribution. Based on this assumption, we will nd means and standard deviations for each pixel. These Gaussians will then be sampled from to create the true background. To nd the means, we simply take an average across all the frames to arrive at a mean image as depicted in gure 1. This mean was taken using the test data described in III-A. In gure 1, consider the highlighted region in red. This shows an area where we dont know what the true background pixels are, since objects frequently obstruct that region. In order to attain a value we can use for the mean values of that region, we replaced those pixels in the region with pixels from one of the frames which does not have any foreground objects in the red region.

Fig. 2.

Estimated true background.

Fig. 1.

Average pixel values across all frames.

Now that we have values representing the background means we need to nd standard deviations. To nd a standard deviation value, , for a pixel in our frames {Xi }t , we use i=1 our previously determined means. We pick the mean for that pixel and apply the following formula. = 1 t
t

we computed an error value for the background, and also an error value for the foreground. They both were tested for [0.01, 0.2]. The background error was determined by comparing the estimated true background from gure 2 to the modeled background at each frame. We calculated the error value with the sum-of-squares measure and then took the average across all of the frames. The foreground error was determined by in two parts. The rst part was to calculate a representation of the foreground at each frame. This was done by taking the absolute difference between the current frame and the estimated true background from gure 2. Once we had this representation, we compared the representation to the foreground modeled at each frame by the method. We again used the sum-ofsquares measure to calculate the error. Then we took the average across all of the frames. Figure 3 shows the best choice of to be where the two error curves rst intersect. This point is = 0.1136.
1

( Xi )2
i=1

(7)

0.9 0.8

Background Error Foreground Error Intersection Marker

By calculating the standard deviation for each pixel, we can now generate our true background representation. However, since we replaced the mean pixel values in the red region, we must replace the standard deviations as well. Since we dont have enough clean frames, ie. frames with no objects moving, to calculate new standard deviations for those pixels in the red region, we take the average of all the standard deviations outside the red region. This average is used as the standard deviation of each pixel in the red region. With the means and standard deviations for each pixel of the background, we randomly generate values from the corresponding Gaussian distributions. The random values are averaged over 50, 000 trials. This average is then used for the true background. This is represented in gure 2. C. System Parameters 1) Method One: Stauffer and Grimson: The learning parameter and the background threshold must be determined. To nd the best value for the learning parameter,

0.7 Normalized Error 0.6 0.5 0.4 0.3 0.2 0.1 0

0.02

0.04

0.06

0.08

0.1 Alpha

0.12

0.14

0.16

0.18

0.2

Fig. 3. .

Error of the background and foreground estimation to determine

The second parameter, did not require any experimentation. Since our experiments were carried out in gray-scale, ie. we used uni-modal Gaussians, the authors state, If a small value for is chosen, the background model is usually unimodal. If this is the case, using only the most probable

distribution will save processing time [4, 2.2]. Hence, we chose a small value for which used the most probable distribution for the background. 2) Method Two: Background Average: For method two, where we assume the background is the average of all previous frames, we must determine a threshold parameter to lter-out unwanted foreground pixels. The range tested was [100, 300]. In gure 4 below, we have a plot of the normalized error for each value of . From this, we have = 130 as the optimal choice for a threshold. Note that we may only consider the foreground error since the does not affect the background estimation.
1 0.9 0.8 0.7 Normalized Error 0.6 0.5 0.4 0.3 0.2 0.1 Foreground Error Minimum Error

reason for this is because all the pixels get ltered out. By testing the values and viewing the foreground result, an appropriate value was picked. The threshold = 51. IV. R ESULTS The following results present graphs using the same background and foreground error measure of III-C. This was the averaged sum of squares error compared to the true background representation. We show a comparison between the three methods of background subtraction previously discussed. Also, note that the compressed video for the background and foreground of each method have made available as attachments. A. Comparison Figure 6 represents the measured background error and gure 7 represents the measured foreground error for the test data which is at 640480 resolution. Each measurement was taken at every 100th frame over a span of 4000 frames.
5 4.5 4 3.5 x 10
8

Method 1: Gaussians Method 2: Averaging Method 3: Previous Frame

0 100

110

120

130

140

Error

150 Tau

160

170

180

190

200

3 2.5

Fig. 4.

Error of the foreground estimation to determine .

2 1.5 1 0.5

3) Method Three: Previous Frame: We must also determine a threshold for method three, where we consider the background to be the previous frame. Again we only need to consider the foreground estimation error since does not effect the background estimation. In gure 5 below we see a plot of the normalized error for [1, 300]. This is a curious
1 0.9 0.8 0.7 Normalized Error 0.6 0.5 Error 0.4 0.3 0.2 0.1 0 Foreground Error Minimum (Handpicked) Error

10

15 20 25 30 FramesStamp [Every 100 Frames]

35

40

45

Fig. 6.

Background 640 480

2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4

x 10

Method 1: Gaussians Method 2: Averaging Method 3: Previous Frame

50

100

150 Tau

200

250

300 0.2 0 5 10 15 20 25 30 FramesStamp [Every 100 Frames] 35 40 45

Fig. 5.

Error of the foreground estimation to determine . Fig. 7. Foreground 640 480

plot because as increases, the error continues to decrease. However, upon testing, once reaches a certain point, the foreground estimation reveals nothing in the foreground. The

As we can see in 6, method 2 (background averaging) has the clear advantage. The method proposed by Stauffer and

Grimson [4] did not perform as well as method 1, but did surpass method 3, which assumes the previous frame is the background. In gure 7 of the foreground, the error values were closer compared to the error values of the background; however, method 1 by Stauffer and Grimson [4] does not perform as well as method 2 or method 3. From these results, perhaps the determination the true background gave a bias towards the background subtraction method using an average across all the frames. A reason for this might be that when generating the true background, the means for the Gaussian distributions were that of the overall mean image. However, this seems to be a fair way to represent a changing environment. Instead of re-evaluating the true background, problems are discussed regarding Stauffer and Grimsons method and a promising solution to one of the problems is implemented and tested. B. Discussion From the results, we have seen that the method 2 of background averaging consistently has the best performance. Let us explore problems encountered with method 1, which may have lead to the above results. 1) Problems: For convenience, we will refer to the attached videos which show the background and foreground result from the three methods compared. There are two main problems which become immediate when viewing the resulting video. Moreover, the images in appendix and appendix may be helpful and please note to zoom in for a clearer view of the accuracy. The rst problem can be seen in fg-m1-project.mpg, where there is signicant noise in the foreground. Also, the complete object is not detected as foreground. It looks like more of an outline of the object in some cases. The authors admit that noise is a problem in their implementation as well [4, 2.1] and remove it by ignoring small patches of 1 or 2 pixels. The second problem can be seen in bg-m1-project.mpg, where there are foreground objects which are not completely transparent to the background. This implies that the background model is matching the wrong distribution for the value of the background pixel. The reason for this seems to be related to the following problem which are discussed together in the next section. The third problem is that of time. As time progresses, the error for the background increases signicantly as seen in gure 6. Also the foreground error rises slightly after an initial decrease as seen in gure 7. 2) Solutions: To improve the foreground results and to try and solve the rst problem, we ran the foreground through a lter which examined the neighbors of each pixel marked as both foreground and non-foreground. This essentially ignores small patches of foreground pixels, but also improves the foreground object detection by taking neighbors into account. Since the foreground is a square tessellation, deciding on which pixels are neighbors becomes complex when considering 4-neighbors and 8-neighbors. A hexagonal tessellation provides a clearer denition of neighbor, where each pixel

has exactly 6 neighbors. Fortunately, this approach of 6neighbors can actually be achieved on a square tessellation [2, p.69]. Fore each pixel, we consider the immediate north, north-west, west, south, south-east, and east pixels to be the 6-neighbors [2, p.69]. The decision rule is if a foreground pixel (i, j) has less than 5 neighbors marked as foreground, then set (i, j) to a background pixel. Also, if a non-foreground pixel (k, l) has three or more neighbors which are foreground pixels, then set (k, l) to a foreground pixel. Figure 8 below presents the measured foreground error with the new lter in place against the other methods.
x 10
9

2.2 2 1.8 1.6 1.4 Error 1.2 1 0.8 0.6 0.4 0.2

Method 1: Gaussians Method 2: Averaging Method 3: Previous Frame Method 1: Gaussians Filter

10

15 20 25 30 FramesStamp [Every 100 Frames]

35

40

45

Fig. 8.

Foreground 640 480

We can see this lter offers a great improvement over the original method. Also, when viewing the video fgm1-project-lter.mpg, we can clearly see an improvement compared to not using the neighbor lter in the video fgm1-project.mpg. The second and third problems were not solved in this work. It seems that perhaps the second is related to the third; however, the underlying cause is not known. One idea is that the parameters are changing too quickly even with a small learning parameter. Specically, the weight parameter and in some cases the variance. The weight rapidly changes even with a low learning parameter and can signicantly affect the background heuristic. This is because the heuristic takes the weight divided by the standard deviation to determine a rank of the Gaussians in the mixtures. Moreover, the standard deviation at times approached very close to zero when the background was very static. It was not clear how to adjust for this issue of time and has been left for future investigation. C. Conclusions The method proposed by Stauffer and Grimson [4] is an effective method for separating the background and foreground with a neighbor-checking lter as an addition. Other problems that exist with this implementation are yet unresolved, but progress has been made with respect to the foreground detection.

With the neighbor lter in place, weather effects seem to be minimized. Without the lter, the foreground displays small foreground objects which resemble excess noise. However, these are most likely due to the snowfall during the frame sequence. This weather effect was not clearly depicted in the background averaging method, but did show in method 3. Thus, with a minor enhancement to the method proposed by Stauffer and Grimson [4], we have seen that results of the foreground detection improve in accuracy and also resist weather more effectively. As a note, the following appendix contains images of frame 462 from the video sequences of the foreground and background movies to show each method and the performance at a given instant in time. A PPENDIX

Fig. 13.

Method 1: Gaussians with Filter

Fig. 14.

Method 1: Gaussians

Fig. 9.

Method 1: Gaussians with Filter

Fig. 15.

Method 2: Averaging

Fig. 10.

Method 1: Gaussians

Fig. 16.

Method 3: Previous Frame

R EFERENCES
[1] W.E.L Grimson, Chris Stauffer, Gideon Stein, Raquel Romano, Lily Lee, Paul Viola, and Olivier Faugeras. A forest of sensors: Project details. http://www.ai.mit.edu/projects/vsam/ Overview2/index.html. [2] B.K.P. Horn. Robot Vision. The MIT Press McGraw-Hill Book Company, 1986. [3] Massimo Piccardi. Background subtraction techniques: A review. Apr 2004. [4] Chris Stauffer and W.E.L Grimson. Adaptive background mixture models for real-time tracking. In Computer Vision and Pattern Recognition, volume 2, pages 252258, 1999.

Fig. 11.

Method 2: Averaging

Fig. 12.

Method 3: Previous Frame

Das könnte Ihnen auch gefallen