Sie sind auf Seite 1von 5

366

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 7, NO. 2, APRIL 2010

Object Classication of Aerial Images With Bag-of-Visual Words


Sheng Xu, Student Member, IEEE, Tao Fang, Deren Li, and Shiwei Wang
AbstractThis letter presents a Bag-of-Visual Words (BOV) representation for object-based classication in land-use/cover mapping of high spatial resolution aerial photograph. The method is introduced to handle the special characteristics of aerial images, i.e., variability of spectral and spatial content. Specically, patch detection and description are used to divide and represent various subregions of objects comprising multiple homogeneous components. Moreover, the BOV representation is constructed with the statistics of the occurrence of visual words, which are learned from the training data set. A combination of spectral and texture features is veried to be a satisfactory choice through the evaluations of various patch descriptors. Furthermore, a thresholdbased method is employed to reduce the impact of outliers on classication in test data. Experiments based on aerial-image data set show that the proposed BOV representation yields better classication performance than the low-level features, such as the spectral and texture features. Index TermsBag-of-Visual Words (BOV), composite object, low-level features, object-based classication.

I. I NTRODUCTION HE VERY high spatial resolution (VHR) aerial images can provide abundant spatial and textural information for land-use/cover classication. However, due to the highly detailed information and spectral heterogeneity even within the same class, we face critical challenges in the classication of VHR aerial images, which have not been addressed by conventional spectral-classication methods [1]. Kinds of low-level object features have been proposed for object-based remotesensing image analysis (OBIA) [2], such as spectral, texture, and structure features. In [3], texture feature is regarded as a very useful property of spatial structure information in highresolution images. In addition to spectral and texture features, structural information extracted by mathematical morphological operations is used to detect the urban area [4]. Texture motifs are used for modeling and detecting compound geospatial objects by manual annotation [5]. Nevertheless, the aforemenManuscript received June 15, 2009; revised October 7, 2009. Date of publication December 11, 2009; date of current version April 14, 2010. This work was supported in part by the National Key Basic Research and Development Program of the Peoples Republic of China (Grant 2006CB701303), and in part by the National High Technology Research and Development Program of the Peoples Republic of China (Grant 2006AA12Z105). S. Xu, T. Fang, and S. Wang are with the Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail: alfredxn@gmail.com; tfang@sjtu.edu.cn; wish_von@sjtu.edu.cn). D. Li is with the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China (e-mail: drli@whu.edu.cn). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/LGRS.2009.2035644

tioned features are often useful for object-specic application (e.g., object representation and classication). In high spatial resolution remote sensing, we can obtain different kinds of geospatial objects in automatic or manual segmentation ways. However, it is still an open problem for OBIA on how to represent these objects accurately for classication. In order to facilitate better description and further analysis of these objects in OBIA, this letter categorizes them into three different types of geospatial objects, namely, simple objects (i.e., homogeneous spectral and textural regions that are easily classied, such as water body), complex objects (i.e., heterogeneous and random spectral/textural regions that are difcult to be classied, such as sparse woodland in land cover/use), and composite objects (i.e., the composition of several homogeneous components, such as a golf course comprising grass, shrub, water, and sand regions). In this letter, we introduce a new method called Bag-of-Visual Words (BOV) [6] to describe all the aforementioned objects in a unied framework, particularly, the complex and composite objects in VHR aerial image. The BOV method has now attracted much attention in the generic visual-categorization eld to construct the midlevel representations instead of low-level features. Recently, Weizman and Goldberger [7] have proposed an urbanarea-segmentation approach based on a pixel-level variant of visual words. Different from their pixel-level method which generally provides no clue about the semantics of the image [8], this letter builds the visual vocabulary in patch-level for semantic understanding of geospatial objects and low computation complexity. The contributions of this letter can be summarized as follows. First, the BOV representation is introduced for representation and classication of the remote-sensing imagery. Second, various types of low-level features are evaluated to obtain the satisfactory performance in BOV representation and objectbased classication. Third, in generating new features of test data according to constructed visual words, a threshold is used to produce a virtual word for reducing the impact of outliers, which belong to an undened class in a test data set, on classication. This feature-extraction method based on the threshold is helpful to avoid misclassifying these outliers. II. BOV M ETHOD The BOV method was inspired by the bag-of-words (BOW) [9] approach for text categorization. In BOW, a text document is encoded as a histogram of the number of occurrences of each selected word. A text is represented as an unordered collection of words, disregarding grammar and even word order. Similarly,

1545-598X/$26.00 2009 IEEE

XU et al.: OBJECT CLASSIFICATION OF AERIAL IMAGES WITH BAG-OF-VISUAL WORDS

367

Fig. 1.

BOV image-classication approach.

we can characterize an image by a histogram of visual-word count [6]. The visual vocabulary provides a midlevel representation which helps to bridge the huge semantic gap between the low-level features extracted from an image and the highlevel concepts to be categorized. However, the main difference between text categorization and visual categorization is that there is no given visual vocabulary for visual categorization problem. Therefore, Zhu et al. [8] extended the codebook of keywords from the text domain to image domain and introduced the vector quantization of small square image windows, called keyblocks, into remote-sensing image retrieval. They showed that this proposed method can produce better semanticsoriented results than traditional low-level features such as spectral and texture features. As shown in Fig. 1, the BOV-based classication approach is exhibited. Given an object from the sample data set, patch detection and description are used to form a set of feature vectors from an object. In the training phase, the k -means method is applied to train k classes whose center is named as visual words and, in the testing phase, a single virtual word is built to represent all implausible patches which are not close enough to warrant representation by any relevant visual words. Based on the visual words, the histogram is generated by counting their occurrence numbers. Here, such histogram is dened as a BOV representation. Thus, this novel feature is fed for classication. A. Local Region Detection and Local Descriptors Many researches in computer vision show that local features are commonly used in object recognition [10], [11] since they are robust to spatial variations. In particular, local features could also be employed in the representation of the patches in the composite objects efciently. Here, the local regions can be extracted in two different ways. Evenly regular grid: Evenly regular grids are extracted at different scales, where each grid is spaced at 11 11 pixels for a given object. The size of the patch is randomly sampled between the scale of 10 to 30 pixels. More detailed description can be found in [12]. Lowes DoG detector: A set of local regions (patches) that are stable and afne invariant over different scales are extracted using the DoG detector [10]. Thus, the conspicuous points are located, and their neighborhoods are regarded as the detected patches for further description. Traditionally, the BOV approach models the distribution of low-level local features such as the scale-invariant feature transformation (SIFT) [10], which computes the orientation and gradient of the keypoints in gray-level information. However, this is not the case for single object (homogeneous regions)

since it is insensitive to the change of orientation. Currently, the content of geospatial objects in OBIA is often described by using various kinds of spectral and texture features, which are also used here. In our experiment, we also present a combination of spectral and texture feature on evenly distributed regular grids as the input of the BOV approach. Here, the combined feature is composed of two categories of components, including the following: 1) the mean and standard deviations of three spectral bands (i.e., RGB) and 2) 48 texture features computed from 12 gray-level co-occurrence matrices (GLCMs) which are generated from four different directions (i.e., 45 , 90 , 135 , and 180 ) on each spectral band, respectively. Note that four texture features are calculated for each GLCM. Therefore, a total of 54 dim features are extracted in each patch. All features are linearly normalized to the interval [0, 1] by minmax normalization. B. Visual-Vocabulary Construction The key issue of the BOV method is how to construct the visual words automatically from a training set. Visual vocabulary offers a way to construct a novel feature vector for classication by relating new descriptors in query objects to low-level features from the training set. Owing to its simplicity, k -means [13], [14] is exploited in this letter for visual-word encoding [15]. Based on the local region detection and description, objects are separated into a set of patches, each of which is described by one feature vector. Thus, every object is described by a set of patch descriptors, and then, all patch descriptors from the training objects form a data set. Henceforth, k clustering classes are learned from the training data set, and their centers are dened as the visual words. Note that the visual-vocabulary construction is an unsupervised method without any references to the class label of each patch, which make the generation of visual words label free. Here, k -means algorithm is implemented by an expectation-maximization (EM) scheme [16]. However, there are two problems during this process. First, the k -means algorithm converges only to the local optima. Various initial settings would lead to different solutions. Therefore, the k -means algorithm should be repeated with random initial values, and the approximate optimal solution is obtained by averaging over the results of each run. Second, the parameter k (the number of cluster center) could not be computed automatically. In the experimental section, the initialization of parameter k will be discussed. C. Virtual Word and Histogram Computing Based on the clustering algorithm, a visual vocabulary is constructed to describe the object contents. Thus, each patch

368

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 7, NO. 2, APRIL 2010

is assigned to the closest visual word by using the Euclidean distance, and an object can be represented as a histogram by counting the occurrence numbers of the visual words. Generally, this method is based on the assumption that the test data must belong to one of all the current training classes, and each patch is well represented by its single closest visual word. However, if the test data are out of sample classes, many patches from these outliers cannot be assigned to any suitable candidate in the vocabulary. How can we determine their assignments? Therefore, a threshold is presented to reduce the impact of outliers in the testing process, and those patches, whose distance to the closest visual word is greater than this adaptive threshold, will be assigned to a single virtual visual word. Using this threshold in histogram computation, the test data can generate distinct features. Specically, virtual word occurs much more frequently in outliers and less in others. Thus, when most patches of one object are assigned to a virtual word, this object would be attributed to undened class. During the training, based on the earlier assumption, these thresholds are set to be the maximum distance from all patches in the same cluster to their center. From the previous analysis, we can see that the new feature is only dependent on the visual words, which are learned from the training data. Based on this idea, three different types of geospatial objects in a remote-sensing image can be described uniformly by this histogram representation. III. E XPERIMENTAL R ESULTS In this letter, support vector machines (SVMs) [17] are employed for the classication that is based on BOV method. The one-against-all method is exploited to resolve the multiclass problem, and the leave-one-out algorithm is used to select the kernel parameters of the radial basis function (RBF). Here, the multiclass SVM approach with an RBF kernel C = 10 and = 0 was used to classify the data. Additionally, two statistics, overall accuracy (OA) and Kappa coefcient based on confusion matrix, are utilized to evaluate the classication performances. In this experiment, the aerial images, acquired from the Chongming County in Shanghai, China, in June 2006, consist of three multispectral bands (RGB) with 0.25-m resolution. Here, we distinguished four land-cover types that dominate the study area: Crops, Woodland (WD), Pond, and Residential area (RA). Then, totally, 1882 objects of these four classes are segmented by the eCognition software (the parameter shape is 0.5, and the others are the default). Manual interpretation was used as a reference for the classication algorithm. Objects within each class are randomly divided into a training set and a test set. We repeat each experiment for ve random splits and report the averaged results obtained over the ve different test set. A. BOV-Based Classication Results For convenience of comparison, the results for the whole data set based on both our approach and the baseline lowlevel features are listed in Table I. The baseline approach

TABLE I C ONFUSION M ATRIX OF THE I MAGE -C ATEGORIZATION E XPERIMENTS (OVER F IVE R ANDOMLY G ENERATED T EST S ETS ). E ACH ROW L ISTS THE N UMBER OF I MAGES (T EST I MAGES ) IN O NE C ATEGORY C LASSIFIED TO E ACH OF THE F OUR C LASSES . N UMBERS A LONG THE D IAGONAL I NDICATE C ORRECT A LLOCATIONS , AND THE O FF -D IAGONAL E NTRIES M EAN THE E RRORS . T HE C OMPARED R ESULTS W ITHOUT U SING BOV A RE S HOWN IN PARENTHESES

Fig. 2. Experiment area, reference map, and classication map based on the BOV representation.

also uses 54-dim combined features on the object level, rather than a histogram representation based on visual words. The classication results of the baseline approach are then shown in brackets. An example of an object classication in our test area is shown in Fig. 2. Fig. 2(a) shows the original aerial images (3000 3000 pixels). The reference map is shown in Fig. 2(b), and the classication result by the BOV-based representation is shown in Fig. 2(c). We make a closer analysis of the performance by looking at the classication results on every category in terms of the confusion matrix. It can be seen that the improvements of the classication accuracy and Kappa coefcient are mostly listed in the following two classes as Residential Areas and Woodlands, while the other two classes have the same classication performance as with the compared low-level features. For example, there is a 2.4% improvement in the Residential area class and a 3.73% improvement in the Woodland class. These details verify that the improvement of OA is obvious

XU et al.: OBJECT CLASSIFICATION OF AERIAL IMAGES WITH BAG-OF-VISUAL WORDS

369

Fig. 3. Comparing the classication performance based on the BOV representation with that based on low-level features as the number of training objects varies from 50 to 250 per class. The experiment is performed on 1882 objects in four classes. The overall classication accuracies are computed over ve randomly generated test sets.

Fig. 4. Comparing the classication performance based on the BOV representation with four different low-level features (the combination of spectral and texture features, spectral features, and texture featuresGLCM and SIFTrespectively). The overall classication accuracies are computed over ve randomly generated test sets.

in those complex and composite objects. We can believe that our proposed BOV representation can describe the complex and composite objects more effectively. B. Inuence of Size of Training Set Although we have tested the performance of SVMs via BOV when the training set contains 1000 samples, the scalability of the BOV method remains a question: How does the performance vary when the number of training data set decreases? For this purpose, we tested the sensitivity of the classication results with BOV to the size of training set in each class. Fig. 3 shows that the two curves of classication accuracy based on the BOV representation and the low-level features have similar declining tendency when the training sample size in each category reduces from 250 to 50. However, the classication accuracy based on the BOV outperforms that of the low-level features by 1.81%, 1.48%, 1.01%, 0.54%, and 4.1%, respectively. In particular, the classication accuracy based on the BOV representation degrades more stably when the number of training data in each class is close to 50. It is obvious that the performance based on the BOV representation is more reliable when the number of the training data set is small. C. Inuence of Feature Selection in BOV Representation As mentioned previously, our method combines the spectral and texture features to describe the patches and construct the histogram representation, rather than the SIFT features used in current literature. For a thorough analysis of the effect of the various low-level features, we compare the performance on four different features, i.e., spectral features, GLCM, their combination, and SIFT, as shown in Fig. 4. It is seen that when the number of training set is 250 and the number of visual vocabulary is 450, the overall (highest) accuracies are gained by the BOV representation based on the combined features, spectral features, GLCM, and SIFT, which are 93.81%, 86.60%, 91.40%, and 87.61%, respectively. The results indicate that the

combination of spectral and texture features is the better choice in the proposed approach, whereas the SIFT cannot obtain a satisfactory result. The weakness of SIFT is probably due to the fact that it only computes the orientation and gradient of salient points, which is not always useful for remote-sensing images. Similarly, spectral and texture features only can also not offer sufcient information for exact classication. Thus, we can ascertain that the combined feature is more suitable in this experiment. D. Inuence of the Parameter k Central to our proposed approach is the construction of visual vocabulary using unsupervised clustering, such as the k -means clustering method used in this work. In this approach, each visual word is dened as one clustering center. Thus, the inuence of the size of visual words on the BOV representation in classication are estimated, and we set the parameter in the k -means (or the number of visual vocabulary) to be 100, 150, 200, 250, 300, 350, 400, and 450, respectively. As shown in Fig. 5, when the number of clustering centers increases, the overall classication accuracy rises as well, whereas the results vary with different initial values. This instability is because the k -means is not guaranteed to converge to the best solution, and visual words vary across different initial settings. However, it can be observed from Fig. 5 that the corresponding 95% condence interval decreases clearly as the parameter k increases. It implies that the stability of the classication will be improved when the parameter k increases. Thus, we can ascertain that the stability of the BOV representation in classication can be improved by using more visual words. In our experiment, the parameter k is set to 450 to trade off between the stability of the classication and the computation cost. E. Inuence of Outliers In some cases, we have to face the difculty that there are some outliers in the sample data set. In this letter, we

370

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 7, NO. 2, APRIL 2010

and texture features signicantly outperforms the classication accuracy compared with the representation based on SIFT. The experimental results also show that the BOV representation is insensitive to the impact of outliers. From the preceding discussion, it is clear that our method is useful for representing the aerial images from the same data set. In our future work, we will investigate the impact of gray-level differences due to changes in time and illumination, when the samples are collected from different sensors. ACKNOWLEDGMENT The authors would like to thank Prof. F.-f. Li of Princeton University for the BOV code, and Perronnin of Xerox Research Centre Europe for some discussions on our work.
Fig. 5. Comparing the classication performance based on the BOV representation with that based on low-level features as the number of visual words varies from 150 to 450 per class. The overall classication accuracies are computed over ve randomly generated test sets. TABLE II C ONFUSION M ATRIX OF I MAGE -C ATEGORIZATION E XPERIMENTS (20 O UTLIERS A RE A DDED )

R EFERENCES
[1] Y. D. Zhao, L. P. Zhang, P. X. Li, and B. Huang, Classication of high spatial resolution imagery using improved Gaussian Markov randomeld-based texture features, IEEE Trans. Geosci. Remote Sens., vol. 45, no. 5, pp. 14581468, May 2007. [2] T. Blaschke, S. Lang, and G. J. Hay, Object-Based Image Analysis. New York: Springer-Verlag, 2008. [3] S. E. Franklin, R. J. Hall, L. M. Moskal, A. J. Maudie, and M. B. Lavigne, Incorporating texture into classication of forest species composition from airborne multispectral images, Int. J. Remote Sens., vol. 21, no. 1, pp. 6179, Jan. 10, 2000. [4] J. A. Benediktsson, P. H. Swain, and O. K. Ersoy, Neural network approaches versus statistical-methods in classication of multisource remote-sensing data, IEEE Trans. Geosci. Remote Sens., vol. 28, no. 4, pp. 540552, Jul. 1990. [5] S. Bhagavathy and B. S. Manjunath, Modeling and detection of geospatial objects using texture motifs, IEEE Trans. Geosci. Remote Sens., vol. 44, no. 12, pp. 37063715, Dec. 2006. [6] F. Perronnin, Universal and adapted vocabularies for generic visual categorization, IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 7, pp. 12431256, Jul. 2008. [7] L. Weizman and J. Goldberger, Urban-area segmentation using visual words, IEEE Geosci. Remote Sens. Lett., vol. 6, no. 3, pp. 388392, Jul. 2009. [8] L. Zhu, A. B. Rao, and A. D. Zhang, Theory of keyblock-based image retrieval, ACM Trans. Inf. Syst., vol. 20, no. 2, pp. 224257, Apr. 2002. [9] E. Wiener, J. O. Pedersen, and A. S. Weigend, A neural network approach to topic spotting, in Proc. 4th Annu. Symp. Document Anal. Inf. Retrieval, Las Vegas, NV, 1995, pp. 317332. [10] D. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis., vol. 60, no. 2, pp. 91110, Nov. 2004. [11] K. Mikolajczyk and C. Schmid, A performance evaluation of local descriptors, IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10, pp. 16151630, Oct. 2005. [12] F.-F. Li and P. Perona, A Bayesian hierarchical model for learning natural scene categories, in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2005, pp. 524531. [13] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, Visual categorization with bags of keypoints, in Proc. Eur. Conf. Comput. Vis. Workshop Stat. Learn. Comput. Vis., 2006, pp. 122. [14] J. Sivic and A. Zisserman, Video Google: A text retrieval approach to object matching in videos, in Proc. 9th IEEE Int. Conf. Comput. Vis., 2003, pp. 14701477. [15] F. Jurie and B. Triggs, Creating efcient codebooks for visual recognition, in Proc. 10th IEEE Int. Conf. Comput. Vis., 2005, pp. 604610. [16] A. Dempster, N. Laird, and D. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc., vol. 39, no. 1, pp. 138, 1977. [17] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.

intentionally add 20 objects from undened classes (e.g., ten in urban area and ten in factory area from the experimental area) into the test data set to evaluate their inuence on classication. The corresponding classication result is shown in Table II. It can be seen that only three outliers are misclassied and the other 17 implausible objects are rejected, while the test data in classes 14 remain the same as the approximate performance shown in Table I. It veries that this threshold-based method can judge whether the current test data belong to a certain category among the training sample set, namely, that our proposed representation has a high plausibility in classication. IV. D ISCUSSION AND C ONCLUSION This letter has introduced a simple but useful representation for VHR aerial-image content. This BOV representation is generated to overcome the problem on how to accurately describe the complex and composite objects in very high resolution imagery. The experimental results suggest a good performance of such representation in VHR aerial images, as compared with the low-level features in classication results. Furthermore, the study on the aerial-image classication shows that the BOV representation based on a combination of spectral

Das könnte Ihnen auch gefallen