Beruflich Dokumente
Kultur Dokumente
General Terms
Algorithms
Keywords
K-Nearest Neighbors Classification, Vertical Data Structure, Vertical Set Squared Distance.
1. INTRODUCTION
Classification on large datasets has become one of the most important research priorities in data mining. Given a collection of labeled (pre-classified) data objects, the classification task is to label a newly encountered and unlabeled object with a pre-defined class label. Classification algorithm such as k-nearest neighbors has shown good performance on various datasets. However, when the training set is very large, i.e. contains millions of objects,
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC06, April, 23-27, 2006, Dijon, France. Copyright 2006 ACM 1-59593-108-2/06/0004$5.00.
536
addition, the classification accuracy of our algorithm is high and comparable to the accuracy of KNN algorithm. The remainder of the paper is organized as follows. In section 2, we review some related works. In section 3, we present the vertical data structure and show the derivation of the vertical set squared distance function to vertically compute a total variation. In section 4, we discuss our approach. In section 5, we report the empirical results, and in section 6, we summarize our conclusions.
2. RELATED WORKS
Classification based on k-nearest neighbors was first investigated by Hart [3]. It can be summarized as follows: search for the knearest neighbors and assign the class label to the unclassified sample based on the plurality of category of the neighbors. KNN is simple and robust. However, it has some drawbacks. First, KNN suffers from expensive computational cost in training. Second, its classification time is linear to the size of the training set. The larger the training set, the longer it takes to search for the k-nearest neighbors. Thus, more time is needed for classification. Third, the classification accuracy is very sensitive to the choice of parameter k. In most cases, the user has no intuition regarding the choice of k. An extension of KNN to solve the sensitivity of k has been proposed [9]. The idea is to adjust k based on the local density of the training set. However, this approach only improves the classification accuracy a little bit, and the classification time remains intensive when the training set is very large. P-KNN [8] employs vertical data structure to accelerate the classification time and uses Higher Order Bit (HOB) as the distance metric. HOB removes one bit at a time, starting from the lowest significant bit position, to search for the nearest neighbors. Experiments showed that P-KNN is fast and accurate in spatial data. However, the squared ring of HOB cannot evenly expand on both sides when a bit is removed, which consequently can reduce the classification accuracy. Another improvement of KNN to speed up the search uses kd-tree method to replace the linear search [5]. This approach reduces the searching complexity from O(n) to O(log n). However, the O(log n) behavior is realized only when the data points are dense. BOND [12] improves KNN search by projecting each dimension into a separate table. Each dimension is scanned to estimate the lower and upper bounds of the k-nearest neighbors. After that, the data points that are outside the boundary are discarded. This strategy reduces the number of candidates of nearest neighbors significantly. Making a smaller space to search for the k-nearest neighbors is also the main idea of SMART-TV algorithm. However, SMART-TV estimates a number of candidates of nearest neighbors by examining the absolute difference of total variation between data points in the training set and the unclassified point.
trees by decomposing each attribute in the table into separate bit vectors, one for each bit position of numeric values in that attribute, and storing them separately in a file. Then, each bit vectors can be constructed into 0-dimensional P-tree (the uncompressed vertical bit vectors) or into 1-dimensional, 2dimensional, or multi-dimensional P-tree in the form of tree. The creation of 1-dimensional P-tree, for example, is done by recursively dividing the bit vectors into halves and recording the truth of purely 1-bits predicate. A predicate 1 indicates that the bits in the subsequent level are all 1 while 0 indicates otherwise. In this work, we used 0-dimensional P-tree or the uncompressed vertical bit vectors. Basically, the logical operations AND (), OR (), and NOT (') are the main operations in this data structure. Advantage can be gained while performing aggregate operation such as root count. This operation counts the 1-bits from a basic P-tree or from the resulting P-trees of any logical operations. Refer to [4] for more information about P-tree vertical data structure and its operations.
xib1
xi 0 =
j =b 1
2 j xij
The first subscript of x refers to the attribute to which x belongs, and the second subscript refers to the bit order. The summation in the right-hand side is the actual value of x in base 10. Let x be a vector in a d-dimensional space, then x in b-bits binary representation can be written as:
x = ( x1(b 1)
x10 , x2(b 1)
x20 , , xd (b 1)
xd 0 )
Let X be a set of vectors in a relation R, x X, and a can either be a training point or an unclassified point, then the total variation of X about a , TV(X, a ), measures the cumulative squared separation of the vectors in X about a . TV(X, a ) transforms a multi-dimensional feature space of a into a single-dimensional total variation space. The total variation can be measured efficiently and scalably using the vertical set squared distance, derived as follows:
(X
a)
(X
a)
=
x X
(x a ) (x a ) =
d d
d x X i =1
(xi ai )
d x X i = 1
=
x X i =1
x i2 2
x X i = 1
xi ai +
a i2
= T1 2T2 + T3
d d 0 2
T1 =
xi2 =
xX i =1 xX i =1
d x X i =1 b 1
2 j xij
j =b 1
(2
b 1
xi (b 1) + 2 b 2 xi (b 2 ) + + 2 0 xi 0
+ 2 0 xi 0
(2
xi ( b 1) + 2 b 2 xi (b 2 ) +
537
=
i =1 j =b 1
22 j
xX
xij +
2k
xX k = 2 j j +1 l = j 1 0 AND j 0
xij xil
Let PX be the P-tree mask of a set X that can quickly mask all data points in X. Then, the above equation can be simplified by replacing rc ( PX Pij ) and x ij x il xij with
xX
xX
examination of the gap cannot guarantee 100% that all points are exactly close to the unclassified point, this examination can quickly approximate the superset of the nearest neighbors in each class. In fact, we can increase the chance to include more nearest neighbor points in the candidate sets by considering more data points with a small total variation gap. Thus, for this reason, a parameter hs is introduced, which specifies the number of points in each class that will be considered in the candidate set. Our empirical results show that with relatively small hs, i.e. 25 hs50, a high accuracy of classification can still be obtained.
T1 =
i =1 j =b1
k =2 j j +1 l = j 1 0 AND j 0
T2 =
T3 =
xi ai =
xX i =1
d x X i = i
ai
j =b 1
2 j rc( PX Pij )
ai2
SMART-TV algorithm consists of two phases: preprocessing and classifying. In preprocessing phase, all steps are conducted only once, whereas in classifying phase the steps are repeated during classification. The preprocessing steps are: 1) The computation of root counts of P-tree operands of each class Cj, where 1 j # of classes. The complexity of this computation is O(kdb2) where k is the number of classes, d is the total of dimensions, and b is the bit-width. 2) The computation of TV (C j , xi ) xi C j , 1 j # of classes. The complexity of this computation is O(n), where n is the cardinality of the training set. The root count and total variation values generated in step 1 and 2 above are stored in files, and they are loaded during classification.
ai 2 = rc( PX )
i =1
( X a) ( X a) =
2 rc ( PX Pij ) +
2j
i =1 j =b 1 k = 2 j j +1 l = j 1 0 AND j 0
d i =1
ai
j =b 1
ai2
Note that the root count operations are independent from a . These include the root count operations of P-tree class mask PX,
d 0
The classifying phase consists of four steps. The first step is called the filtering step. The second, third, and fourth steps are similar to KNN algorithm. The difference is that the k-nearest neighbors are not searched from the original training set, but they are searched from the candidate points that were filtered in step 1. The classifying steps are summarized as follows: 1) For each class Cj, where 1 j # of classes do the following: a) Compute TV(Cj, a ), where a is the unclassified point. b) Find hs number of points in Cj such that the gaps between the total variation of the points in Cj and the total variation of a are the smallest, i.e.
p ,q : 1 p hs,1 q n j : TV (C j , x p ) TV (C j , a ) TV (C j , x q ) TV (C j , a ) x q ( r : 1 r hs : x r )
i =1 j = b 1
operands
rc ( PX Pij Pil ) .
i =1 j = b 1 l = j 1 0 AND j 0
In classification problems where the classes are predefined, the independence of root count operations from the input vector a is an advantage. It allows us to run the root count operations once in advance and maintain the resulting root count values. Later, when the total variation of class X about a given point is measured, for example the total variation of class X about an unclassified point, the same root count values that have already counted can be reused. This reusability expedites the computation of total variation significantly [1].
c) Store those hs number of points in an array TVGapList. 2) For each point xk in TVGapList, 1 k Len(TVGapList), measure the Euclidian distance, d 2 ( xk , a ) =
d i =1
( xi ai ) 2 ,
where Len(TVGapList) = hs x # of classes. 3) Find the k nearest points to a from the list. 4) Vote a class label for a based on the plurality of category of the k nearest points.
5. EMPIRICAL RESULTS
We compared SMART-TV with KNN and P-KNN algorithms. PKNN has been developed in DataMIMETM system [11]. We measured the classification accuracy using F-score and used datasets with different cardinality to measure the scalability of the algorithms.
538
5.1 Datasets
We conducted the experiments on both real and synthetic datasets. The first dataset is the network intrusion dataset that was used in KDDCUP 1999 [6]. This dataset contains approximately 4.8 million records. We normalized both training and testing sets by dividing each value in each attribute over its maximum. We only selected six types of classes, normal, ipsweep, neptune, portsweep, satan, and smurf, each of which contains at least 10,000 records. We discarded categorical attributes because our method only deals with numerical attributes and found 32 numerical attributes. We generated four sub-sampling datasets with different cardinality from this dataset to analyze the scalability of the algorithms with respect to dataset size by selecting the records randomly but maintaining the classes distribution proportionally. The cardinality of these sampling datasets are 10,000 (SS-I), 100,000 (SS-II), 1,000,000 (SS-III), and 2,000,000 (SS-IV). We selected randomly 120 records, 20 records for each class, for the testing set. The second dataset is the OPTICS dataset [2]. This dataset was originally used for clustering problems. Thus, to make it suitable for our classification task, we carefully added a class label to each data point based on the original clusters found in the dataset. The class labels are: CL-1, CL-2, CL-3, CL-4, CL-5, CL-6, CL-7, and CL-8. This dataset originally contains 8,000 points in 2dimensional space. We took 7,920 points for the training set and selected randomly 80 points, 10 from each class, for the testing set. The third dataset is the IRIS dataset [7]. This dataset contains three classes: iris-setosa, iris-versicolor, and iris-virginica. IRIS dataset is the smallest dataset, contains only 150 records in 4 attributes. Thirty points were selected randomly for the testing set, and the rest, 120 points, were used for the training set. We measured the classification accuracy of the algorithms when classifying this dataset because IRIS dataset contains two classes that are not linearly separable. Only iris-setosa class is linearly separable. We neither determined the scalability nor measured the speed of the algorithms using this dataset because the dataset is very small.
Figure 1(a) compares the algorithms running time with varying datasets cardinality. We found that SMART-TV is much faster than KNN and P-KNN. For example, for the second largest dataset, N = 2,000,000, with hs = 25, SMART-TV takes only 3.88 seconds on an average to classify, P-KNN requires about 12.44 seconds, and KNN takes about 49.28 seconds. It is clear that by increasing cardinality, KNN scales super linearly. Figure 1(b) shows the SMART-TV running time against varying hs on the sub-sampling dataset of size 1,000,000. We also used k = 5 in this observation. We found that SMART-TV running time is linear to the increasing of hs. However, it increases insignificantly. The higher the hs, the more the points are in the candidate set. Thus, more candidate points will be examined by SMART-TV to find the k-nearest neighbors.
Running Time Against Varying Cardinality SMART-TV 60.00 Running Time in Seconds 2.50 P-KNN KNN Running Time of SMART-TV Against Varying hs on a Dataset of Size 1,000,000 Records 3.00
50.00
2.00
40.00
30.00
1.50
20.00
1.00
10.00
0.50
Figure 1. (a) The algorithms running time against varying cardinality; (b) SMART-TV running time against varying hs. We also observed that SMART-TV and P-KNN algorithms are scalable. They were able to classify the unclassified sample using the dataset of size 4,891,469. For this large dataset, SMART-TV takes about 9.27 seconds and P-KNN takes 30.79 seconds on an average to classify. Note that both algorithms use vertical data structure as their underlying structure, which makes the two algorithms scale well to very large dataset. In contrast, KNN uses horizontal record-based data structure, and while using the same resources, it failed to load the 4,891,469 training points to memory and thus, failed to classify. This demonstrates that KNN requires more resources to make it scales to a very large dataset. We also found that the overall classification accuracy of SMARTTV is moving toward the classification accuracy of KNN algorithm when hs is increased (see table 1). For example, when using hs = 50 on the 2,000,000 dataset (SS-IV), the difference between the overall accuracy of KNN and SMART-TV is only 1%. This difference indicates that within the 300 candidates of closest points filtered by SMART-TV, most of them are the right nearest points. In fact, the classification accuracy between KNN and SMART-TV became exactly the same when we used hs = 100. Similarly, for the OPTICS dataset, when hs = 25, SMART-TV successfully filtered most of the closest points. Furthermore, when we used IRIS dataset, which contains only 120 points, and specified hs = 100 or hs = 125, we actually did not filter any closest points. In fact, we considered the entire data points in the dataset because (hs x # of classes) is greater than the number of points in the dataset. In this situation, SMART-TV is no difference with KNN algorithm because the k-nearest neighbors are searched from all data points in the training set.
539
From this observation, we conclude that with a proper hs, the examination of absolute difference of total variation between data points and the unclassified point can be used to approximate the candidate of nearest neighbors. Thus, when the speed is the issue, our approach is efficient to expedite the nearest neighbor search. Table 1. Classification accuracy of SMART-TV against varying hs compared to KNN. Both algorithms used k=5.
Dataset 25 Network Intrusion (NI) 0.93 SS-I 0.96 SS-II 0.92 SS-III 0.94 SS-IV 0.92 OPTICS 0.96 IRIS 0.97 SMART-TV hs 50 75 100 0.93 0.94 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.97 0.96 0.96 0.96 0.97 0.97 0.97 KNN 125 0.96 0.96 0.97 0.96 0.97 0.97 0.97 NA 0.89 0.97 0.96 0.97 0.97 0.97
Figure 2 compares the algorithms overall accuracy (the average Fscores) of all datasets. We were not able to show the comparison for KNN using the largest dataset (NI) because it terminated when loading the dataset to the memory.
6. CONCLUSIONS
In this paper, we have presented a new classification algorithm that starts its classification steps by filtering out a number of candidates of nearest neighbors by examining the absolute difference of total variation between data points in the training set and the unclassified point. Then, the k-nearest neighbors are searched from those candidates to determine the appropriate class label for the unclassified point. We have conducted the performance evaluations in terms of speed, scalability, and classification accuracy. We found that the speed of our algorithm outperforms the speed of KNN and P-KNN algorithms. Our algorithm takes less than 10 seconds to classify a new sample using a training set of size more than 4.8 million records. In addition, our method is scalable, and it can classify with high classification accuracy. In our future work, we will devise some strategy for automatically providing hs value, e.g. from the inherit features of the training set.
7. REFERENCES
[1] Abidin, T., et al. Vertical Set Squared Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets. Proceeding of the International Conference on Computers and Their Applications (CATA), 2005, 60-65. [2] Ankerst, M., et al. OPTICS: Ordering Points to Identify the Clustering Structure. Proceeding of the ACM SIGMOD, 1999, 49-60. [3] Cover, T.M. and Hart, P.E. Nearest Neighbor Pattern Classification. IEEE Trans. on Info. Theory, 13, 1967, 21-27. [4] Ding, Q., Khan, M., Roy, A., and Perrizo, W. The P-tree Algebra, Proceedings of the ACM SAC, 2002, 426-431. [5] Grother, P.J., Candela, G.T., and Blue, J.L. Fast Implementations of Nearest Neighbor Classifiers. Pattern Recognition, 30, 1997, 459-465. [6] Hettich, S. and Bay, S. D. The UCI KDD Archive http://kdd.ics.uci.edu. Irvine, University of California, CA., Department of Information and Computer Science, 1999. [7] Iris Dataset, http://www.ailab.si/orange/doc/datasets/iris.htm. [8] Khan, M., Ding, Q., and Perrizo, W. K-Nearest Neighbor Classification of Spatial Data Streams using P-trees, Proceedings of the PAKDD, 2002, 517-528. [9] Mitchell, H. B. and Schaefer, P.A. A Soft K-Nearest Neighbor Voting Scheme. International Journal of Intelligent Systems, 16, 2001, 459-469.
P-KNN
KNN
In this observation, we found that the classification accuracy of SMART-TV is high and very comparable to the accuracy of KNN. Most classes have F-score above 90%. The same phenomenon is also found in the other datasets.
Comparison of the Algorithms Overall Classification Accuracy 1.00
0.50
[10] Perrizo, W. Peano Count Tree Technology, Technical Report NDSU-CSOR-TR-01-1, 2001. [11] Serazi, M., et al. DataMIME. Proceeding of the ACM SIGMOD, 2004, 923-924.
0.25
[12] Vries, A.P., et al. Efficient k-NN Search on Vertically Decomposed Data. Proceedings of the ACM SIGMOD, 2002, 322-333.
540