Sie sind auf Seite 1von 9

FAST: A ROC-based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems

Xue-wen Chen
Department of Electrical Engineering and Computer Science The University of Kansas Lawrence, KS 66045, USA

Michael Wasikowski
Department of Electrical Engineering and Computer Science The University of Kansas Lawrence, KS 66045, USA

xwchen@ku.edu ABSTRACT
The class imbalance problem is encountered in a large number of practical applications of machine learning and data mining, for example, information retrieval and filtering, and the detection of credit card fraud. It has been widely realized that this imbalance raises issues that are either nonexistent or less severe compared to balanced class cases and often results in a classifiers suboptimal performance. This is even more true when the imbalanced data are also high dimensional. In such cases, feature selection methods are critical to achieve optimal performance. In this paper, we propose a new feature selection method, Feature Assessment by Sliding Thresholds (FAST), which is based on the area under a ROC curve generated by moving the decision boundary of a single feature classifier with thresholds placed using an even-bin distribution. FAST is compared to two commonly-used feature selection methods, correlation coefficient and RELevance In Estimating Features (RELIEF), for imbalanced data classification. The experimental results obtained on text mining, mass spectrometry, and microarray data sets showed that the proposed method outperformed both RELIEF and correlation methods on skewed data sets and was comparable on balanced data sets; when small number of features is preferred, the classification performance of the proposed method was significantly improved compared to correlation and RELIEF-based methods.

mikewaz@ku.edu 1. INTRODUCTION
One of the greatest challenges in machine learning and data mining research is the class imbalance problem presented in realworld applications. The class imbalance problem refers to the issues that occur when a dataset is dominated by a class or classes that have significantly more samples that the other classes of the dataset. Imbalanced classes are seen in a variety of domains and many have major economic, commercial, and environmental concerns. Some examples include text classification, risk management, web categorization, medical diagnosis/monitoring, biological data analysis, credit card fraud detection, oil spill identification from satellite images. While the majority of learning methods are designed for wellbalanced training data, data imbalance presents a unique challenging problem to classifier design when the misclassification costs for the two classes are different (i.e., costsensitive classification) and accordingly, the overall classification rate is not appropriate to evaluate the performance. The class imbalance problem could hinder the performance of standard machine learning methods. For example, it is highly possible to achieve the high classification accuracy by simply classifying all samples as the class with majority samples. The practical applications of cost-sensitive classification arise frequently, for example, in medical diagnosis [1], in agricultural product inspection [2], in industrial production processes [3], and in automatic target detection [4]. Analyzing the imbalanced data thus requires new methods than those used in the past. The majority of current research in the class-imbalance problem can be grouped into two categories: sampling techniques and algorithmic methods, as discussed in two workshops at the AAAI conference [5] and the ICML conference [6], and later in the sixth issue of SIGKDD Exploration (see, for example, a review by Weiss [7]). The sampling methods involve leveling the class samples so that they are no longer imbalanced. Typically, this is done by under-sampling the larger class [8-9] or by over-sampling the smaller one [10-11] or by combination of these techniques [12]. Algorithmic methods include adjusting the costs associated with misclassification so as to improve performance [13-15], shifting the bias of a classifier to favor the rare class [16-17], creating Adaboost-like boosting schemes [18-19], and learning from one class [20]. The class imbalance problem is even more severe when the dimensionality is high. For example, in microarray-based cancer classification, the number of features is typically tens of thousands [21]; in text classification, the number of features in a

Categories and Subject Descriptors


I.5.2 [Pattern Recognition]: Design Methodology feature evaluation and selection.

General Terms
Algorithms.

Keywords
Feature selection, imbalanced data classification, ROC.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD08, August 2427, 2008, Las Vegas, Nevada, USA. Copyright 2008 ACM 978-1-60558-193-4/08/08$5.00.

124

bag of words is often more than an order of magnitude compared to the number of training documents [22]. Both sampling techniques and algorithmic methods may not work well for high dimensional class imbalance problems. Indeed, van der Putten and van Someren analyzed the COIL challenge 2000 datasets and concluded that to overcome overfitting problems, feature selection is even more important than classification algorithms [23]. A similar observation was made by Forman in highly imbalanced data classification problems [22]. As pointed out by Forman, no degree of clever induction can make up for a lack of predictive signal in the input space [22]. This holds even for the SVM which is engineered to work with hyper-dimensional datasets. Forman [22] found that the performance of the SVM could be improved by the judicious use of feature selection metrics. It is thus critical to develop effective feature selection methods for imbalanced data classification, especially if the data are also high dimensional. While feature selection has been extensively studied [24-30], its importance to class imbalance problems in particular was recently realized and attracted increasing attention from machine learning and data mining community. Mladenic and Grobelnik examined the performance of different feature selection metrics in classifying text mining data from the Yahoo hierarchy [31]. After applying one of nine different filters, they tested the classification power of the selected features using nave Bayes classifiers. Their results showed that the best metrics choose common features and consider the domain and learning machines inherent characteristics. Forman found improved results with the use of multiple different metrics, but the best performing results were those selected by metrics that focused primarily on the results of the minority class [22]. Zheng, Wu, and Srihari empirically tested different ratios of features indicating membership in a class versus features indicating lack of membership in a class [32]. This approach resulted in better accuracy compared to using one-sided metrics that solely score features indicating membership in a class and two-sided metrics that simultaneously score features indicating membership and lack of membership. One common problem with standard evaluation statistics used in previous studies, like information gain and odds ratios, is that they are dependent on the choice of the true positive (TP), false positive (FP), false negative (FN), and true negative (TN). These parameters are set based on a preset threshold. Consider imbalanced data classification with two different feature sets. The first feature set may yield higher TP, but lower TN, than the second feature set. By varying the decision threshold, the second feature set may produce higher TP and lower TN than the first feature set. Thus, one single threshold cannot tell us which feature set is better. This is an artifact of using a parametric statistic to evaluate a classifier's predictive power [33]. If we vary the classifier's decision threshold, we can find these statistics for each threshold and see how they vary based on where the threshold is placed. A receiver operating characteristic, or ROC curve, is one such non-parametric measure of a classifier's power that compares the true positive rate with the false positive rate. While the ROC curve has been extensively used for evaluating classification performance in class imbalance problems, it has not been directly applied for feature selection. In this paper, we construct a new feature selection metric based on an ROC curve generated on optimal simple linear discriminants and select those features with the highest area under the curve as the most relevant. Unlike other

feature selection metrics which depend on one particular decision boundary, our metric evaluates features in terms of their performance on multiple decision hyperplanes and is more appropriate to class imbalance problems. The rest of our paper is organized as follows. Section 2 provides a brief discussion about two commonly-used filter methods: correlation coefficient (CC), and RELevance In Estimating Features (RELIEF). In section 3, we follow with a description of the proposed new method, Feature Assessment by Sliding Thresholds (FAST). In section 4, we present the results comparing the performance of the linear support vector machines (SVM) and 1-nearest neighbor (1-NN) classifiers using features selected by each metric. These results are measured on two microarray, two mass spectrometry, and one text mining datasets. Finally, we give our concluding remarks in section 5.

2. FEATURE SEELCTION METHODS


In this section, we briefly review two commonly-used feature selection methods, CC and RELIEF.

2.1 Correlation Coefficient


The correlation coefficient is a statistical test that measures the strength and quality of the relationship between two variables. Correlation coefficients can range from -1 to 1. The absolute value of the coefficient gives the strength of the relationship; absolute values closer to 1 indicate a stronger relationship. The sign of the coefficient gives the direction of the relationship: a positive sign indicates then the two variables increase or decrease with each other and a negative sign shows that one variable increases as the other decreases. In machine learning problems, the correlation coefficient is used to evaluate how accurately a feature predicts the target independent of the context of other features. The features are then ranked based on the correlation score [25]. For problems where the covariance cov( X i , Y) between a feature ( X i ) and the target (Y) and the variances of the feature (var( X i )) and target (var(Y)) are known, the correlation can be directly calculated:
R (i ) = cov( X i , Y ) var( X i ) var(Y )

(1)

Equation 1 can only be used when the true values for the covariance and variances are known. When these values are unknown, an estimate of the correlation can be made using Pearson's product-moment correlation coefficient over a sample of the population (xi, y). This formula only requires finding the mean of each feature and the target to calculate: R (i ) =

k =1 ( xk ,i xi )( yk y ) m m k =1 ( xk ,i xi )2 k =1 ( yk yi )2
m

(2)

where m is the number of data points. Correlation coefficients can be used for both regressors and classifiers. When the machine is a regressor, the range of values of the target may be any ratio scale. When the learning machine is a classifier, we restrict the range of values for the target to 1 .

125

We then use the coefficient of determination, or R(i ) 2 , to enforce a ranking of the features according to the goodness of linear fit between individual features and the target [25]. When using the correlation coefficient as a feature selection metric, we must remember that the correlation only finds linear relationships between a feature and the target. Thus, a feature and the target may be perfectly related in a non-linear manner, but the correlation could be equal to 0. We may lift this restriction by using simple non-linear preprocessing techniques on the feature before calculating the correlation coefficients to establish a goodness of non-linear relationship fit between a feature and the target [25]. Another issue with using correlation coefficients comes from how we rank features. If features are solely ranked on their value, with features having a positive score getting picked first or vice versa, then we risk not choosing the features that have the strongest relationship with the target. Conversely, if features are chosen based on their absolute value, Zheng, Wu, and Srihari argue that we may not select a ratio of positive to negative features that gives the best results based on the imbalance in the data [32]. Finding this optimal ratio takes empirical testing, but it can result in extremely strong results.

nearest miss. Noisy data could make this approximation inaccurate. Second, if there are instances which have missing values for features, the algorithm will crash because it cannot calculate the distance between those instances. Kononenko created multiple extensions of RELIEF to address these issues [35]. RELIEF-A allowed the algorithm to check multiple nearest hits and misses. RELIEF-B, C, and D gave the method different ways to address missing values. Finally, RELIEF-E and F found a nearest miss from each different class instead of just one and used this to better estimate the separability of an instance from all other classes. These extensions added to RELIEF's adaptability to different types of problems.

3. METHOD DESCRIPTION: FAST


In this section, we propose to assess features based on the area under a ROC curve, which is determined by training a simple linear classifier on each feature and sliding the decision boundary for optimal classification. The new metric is called FAST (Feature Assessment by Sliding Thresholds). Most single feature classifiers set the decision boundary at the mid-point between the mean of the two classes [25]. This may not be the best choice for the decision boundary. By sliding the decision boundary, we can increase the number of true positives we find at the expense of classifying more false positives. Alternately, we could slide the threshold to decrease the number of true positives found in order to avoid misclassifying negatives. Thus, no single choice for the decision boundary may be ideal for quantifying the separation between two classes. We can avoid this problem by classifying the samples on multiple thresholds and gathering statistics about the performance at each boundary. If we calculate the true positive rate and false positive rate at each threshold, we can build an ROC curve and calculate the area under the curve. Because the area under the ROC curve is a strong predictor of performance, especially for imbalanced data classification problems, we can use this score as our feature ranking: we choose those features with the highest areas under the curve because they have the best predictive power for the dataset. By using a ROC curve as the means to rank features, we have introduced another problem: deciding where to place the thresholds. If there are a large number of samples clustered together in one region, we would like to place more thresholds between these points to find how separated the two classes are in this cluster. Likewise, if there is a region where samples are sparse and spread out, we want to avoid placing multiple thresholds between these points so as to avoid placing redundant thresholds between two points. One possible solution is to use a histogram to determine where to place the thresholds. A histogram fixes the bin width and varies the number of points in each bin. This method does not accomplish the goals detailed above. It may be the case that a particular histogram has multiple neighboring bins that have very few points. We would prefer that these bins be joined together so that the points would be placed into the same bin. Likewise, a histogram may also have a bin that has a significant proportion of the points. We would rather have this bin be split into multiple different bins so that we could better differentiate inside this cluster of points. We use a modified histogram, or an even-bin distribution, to correct both of these problems. Instead of fixing the bin width

2.2 RELIEF
RELIEF is a feature selection metric based on the nearest neighbor rule designed by Kira and Rendell [34]. It evaluates a feature based on how well its values differentiate themselves from nearby points. When RELIEF selects any specific instance, it searches for two nearest neighbors: one from the same class (the nearest hit), and one from the other class (the nearest miss). We then calculate the relevance of each attribute A by the rule: W(A) = P(different value of A | nearest miss) P(different value of A | nearest hit) (3)

This is justified by the thinking that instances of different classes should have vastly different values, while instances of the same class should have very similar values. Because the true probabilities cannot be calculated, we must estimate the difference in equation 3. This is done by calculating the distance between random instances and their nearest hits and misses. For discrete variables, the distance is 0 if the same and 1 if different; for continuous variables, we use the standard Euclidean distance. We may select any number of instances up to the number in the set, and more selections indicate a better approximation [35]. Algorithm 1 details the pseudo-code for implementing RELIEF. Algorithm 1 (RELIEF): Set all W(A) = 0 FOR i =1 to m Select instance R randomly Find nearest hit H and nearest miss M FOR A=1 to number of features W(A) = W(A) - dist(A, R, H)/m W(A) = W(A) + dist(A, R, M)/m The original version of RELIEF suffered from several problems. First, this method searches only for one nearest hit and one

126

and varying the number of points in each bin, we fix the number of points to fall in each bin and vary the bin width. This even-bin distribution accomplishes both of the above goals: areas in the feature space that have fewer samples will be covered by wider bins, and areas that have many samples will be covered by narrower bins. We then take the mean of each sample in each bin as our threshold and classify each sample according to this threshold. Algorithm 2 details the pseudo-codes for implementing FAST. Algorithm 2 (FAST): K: number of bins N: number of samples in dataset M: number of features in dataset Split = 0 to N with a step size N/K FOR i = 1 to M X is a vector of samples values for feature i Sort X FOR j = 1 to K Bottom = round(Split(j))+1 top = round(Split(j+1)) MU = mean(X(bottom to top)) Classify X using MU as threshold tpr(i, j) = tp/# positive fpr(i, j) = fp/# negative Calculate area under ROC by tpr, fpr One potential issue with this implementation is how it compares to the standard ROC algorithm of using each possible threshold as the standard is simpler but requires more computations. We conducted a pilot study using the CNS dataset to measure the difference between the FAST algorithm and this standard. Our findings showed that with a parameter of K=10, 99% of the FAST scores were within plus-minus 0.02 of the exact AUC score, and 50% were within plus-minus 0.005. Additionally, the FAST algorithm was nearly ten times as fast. Thus, we concluded that the approximation scores were sufficient. Note that the FAST method is a two-sided metric. The scores generated by the FAST method may range between 0.5 and 1. If a feature is irrelevant to classification, its score will be close to .5. If a feature is highly indicative of membership in the positive or negative class or both, it will have a score closer to 1. Thus, this method has the potential to select both positive and negative features for use in classification.

highly imbalanced classes versus balanced classes. The microarray sets were not preprocessed. The mass spectrometry sets were minimally preprocessed by subtracting the baseline, reducing the amount of noise, trimming the range of inspected mass/charge ratios, and normalizing. The bag-of-words set was constructed using RAINBOW [36] to extract the word counts from text documents. These data sets are summarized in Table 1. Because the largest data set has 320 samples, we used 10-fold cross-validation to evaluate the trained models. Each fold had a class ratio equal to the ratio of the full set. The results for each fold are combined with each other to obtain test results for the entire data set. To stabilize the results, we repeated the crossvalidation 20 times and averaged over each trial. Table 1. Data set descriptions Central Nervous System Embryonal Tumor Data [37]. This data set contains 90 samples: 60 have medulloblastomas and 30 have other types of tumors or no cancer. There are 7129 genes in this data set. Lymphoma Data [38]. This data set contains 77 samples: 58 are diffuse large B-cell lymphomas, and 19 are folicular lymphomas. There are 7129 genes in this data set. Ovarian Cancer Data [39]. This data set contains 66 samples: 50 are benign tumors, and 16 are malignant tumors. There are 6000 mass/charge ratios in this data set. Prostate Cancer Data [40]. This data set contains 89 samples: 63 have no evidence of cancer, and 26 have prostate cancer. There are 6000 mass/charge ratios in this data set. NIPS Bag-of-Words Data [41]. This data set contains 320 documents: 160 cover neurobiology topics, and 160 cover various applications topics. There are 13649 words in this data set. The set was rebalanced for five separate class ratios: 1:1, 1:2, 1:4, 1:8, and 1:16. The neurobiology class was the class shrunk to account for these imbalances.

CNS

LYMPH

OVARY

PROST

NIPS

4.2 Evaluation Statistics


The standard accuracy and error statistics quantify the strength of a classifier over the overall data set. However, these statistics do not take into account the class distribution. Forman argued that this is because a trivial majority classifier can give good results on a very imbalanced distribution [22]. It is more important to classify samples in the minority class at the potential expense of misclassifying majority samples. However, the converse is true as well: a trivial minority classifier will give great results for the minority class, but such a classifier would have too many false alarms to be usable. An ideal classifier would perform well on both the minority and the majority class. The balanced error rate (BER) statistic looks at the performance of a classifier on both classes. It is defined as the average of the error rates of two classes as shown in equation 4. If the classes are balanced, the BER is equal to the global error rate. It is commonly

4. EXPERIMENTAL RESULTS 4.1 Data Sets


We tested the effectiveness of correlation coefficient, RELIEF, and FAST features on five different data sets. Two of the data sets are microarray sets, two are mass spectrometry sets, and one is a bag-of-words set. Each of the microarray and mass spectrometry data sets has a small number of samples, a large number of features, and a significant imbalance between the two classes. The bag-of-words data set also has a small number of samples with a large number of features, but we artificially controlled the class skew to show differences in performance on

127

used for evaluating imbalanced data classification [42]. We used this statistic to evaluate trained classifiers on test data.

BER =

1 FP FN + 2 FP + TP FN + TN

(4)

4.3 Results
We evaluated the performance of FAST-selected features by comparing them with features chosen by correlation coefficients and RELIEF. Many researchers have used standard learning algorithms that maximize accuracy to evaluate imbalanced datasets. Zheng [32] used the Naive Bayes classifier and logistic regression methods, and Forman [22] used the linear SVM and noted its superiority over decision trees, Naive Bayes, and logistic regression. The object of study in these papers, and in our research, was the performance of the feature selection metrics and not the induction algorithms. Thus, we chose to evaluate the metrics using the performance of the linear SVM and 1-NN classifiers. These classifiers were chosen based on their differing classification philosophies. The 1-NN method is a lazy algorithm that defers computation until classification. In contrast, the SVM computes a maximum separating hyperplane before classification. The classification results are summarized in Figs. 1-10, where dashed lines with square markers indicate classifiers using RELIEF-selected features (with one nearest hit and miss), dashed lines with star markers indicate classifiers using correlationselected features, and dashed lines with diamond markers indicate classifiers using FAST-selected features (with 10 bins). The solid black line indicates the baseline performance where all the features are used for classification. Figures 1 and 2 show the BER versus the number of features selected using an 1-NN classifier and a linear SVM for CNS data, respectively. FAST features significantly outperformed RELIEF and correlation features when using the 1-NN classifier. When using the SVM classifier, FAST features performed the best for less than 40 features; and for more than 40 features, there was little difference between feature sets. For all the cases, using a small set of features outperforms the baseline with all the original features. Similar results can be obtained for other datasets. For example, Figures 3 and 4 show the results for LYMPH data with an 1-NN and a linear SVM, respectively. Due to page limits, we are not able to show the results for all the four datasets. Instead, we include the average results here. Figures 5 and 6 show the BER scores averaged over the four datasets with an 1-NN classifier and a SVM, respectively. For comparison, the baseline performance of the classifier using all features is also included. Another evaluation statistic commonly used on imbalanced datasets is the area under the ROC (AUC). This statistic is similar in nature to the BER in that it weights errors differently on the two classes. In this study, it lines up well with the design philosophy of FAST. FAST selects features that maximize the AUC, so it is reasonable to believe that a learning method using FAST-selected features would also maximize the AUC. We also used this statistic to evaluate trained classifiers on test data. Figures 7 and 8 show the AUC scores averaged over the four datasets with an 1-NN classifier and a SVM, respectively. Not surprisingly, FAST outperforms CC and RELIEF.

Figure 1. BER for CNS using an 1-NN classifier

Figure 2. BER for CNS using a SVM classifier

Figure 3. BER for LYMPH using 1-NN classifiers

128

Figure 4. BER for LYMPH using a SVM classifier

Figure 7. AUC averaged over CNS, LYMPH, OVARY, and PROST using an 1-NN classifier

Figure 5. BER averaged over CNS, LYMPH, OVARY, and PROST using an 1-NN classifier

Figure 8. AUC averaged over CNS, LYMPH, OVARY, and PROST using a SVM classifier

Figure 6. BER averaged over CNS, LYMPH, OVARY, and PROST using a SVM classifier

Figure 9. AUC for CNS using a SVM classifier

129

Figure 10. AUC for PROST using a SVM classifier

Figure 12. Training data distribution of CNS with the two best correlation-selected features

Figure 11. Training data distribution of CNS with the two best RELIEF-selected features

Figure 13. Training data distribution of CNS with the two best FAST-selected features

The average results in Figures 6 and 8 agree with the belief that SVM's are robust for high-dimensional data. Up to 100 RELIEFselected features did not improve the BER or the AUC of the SVM. Additionally, up to 100 correlation-selected features did not improve the BER. On the other hand, the SVM using more than 30 FAST-selected features did see a significant improvement on both BER and AUC. Thus, our results agree with the general finding that SVM's are resistant to feature selection, but also agree with the findings presented by Forman [22] that SVM's can benefit from prudent feature selection. Specific examples of this improvement in our datasets can be seen in Figures 2 and 4 using FAST on the BER scores for the CNS and LYMPH datasets, respectively, and in Figures 9 and 10 using FAST on the AUC scores for the CNS and PROST datasets, respectively. The results for the 1-NN classifiers, seen in Figures 5 and 7, are even more striking. Both RELIEF and correlation-selected features improved on the baseline performance of the classifier significantly for a minimum of 45 features selected. FASTselected features saw a significant jump in performance over that seen using RELIEF and correlation-selected features; the 1-NN classifiers using only 15 FAST-selected features beat the baseline.

Why would FAST features outperform correlation and RELIEF features by such a significant margin for both 1-NN and SVM classifiers? We visualized the features selected by the correlation, RELIEF, and FAST methods to answer this question. We show the training data of the CNS dataset with the two best features. Figures 11-13 show the data using the best two RELIEF features, the best two correlation features, and the best two FAST features respectively. FAST features appear to separate the two classes and group them into smaller clusters better than correlation and RELIEF features. This may explain why FAST features perform better using both the SVM and 1-NN classifiers; SVM's try to maximize the distance between two classes, and 1NN classifiers give the best results when similar samples are clustered close together. Finally, we show the effects of different class ratios on the performance of each feature selection metric. Figures 14 and 15 show the BER versus class ratios for the NIPS dataset with the SVM and 1-NN classifiers, respectively. Not surprisingly, as the class ratio increases, the BER tends to increase accordingly. For both the 1-NN and SVM classifiers, correlation and FAST features performed comparably well for datasets up to a 1:8 class ratio. For the 1:16 ratios, FAST features performed significantly better than correlation features. RELIEF features did not perform well on this dataset for any of the class ratios.

130

We conclude that FAST features perform better than RELIEF and correlation features; this boost in performance is especially large when the selected feature set is small and when the classes are extremely imbalanced. Because using less features helps classifiers avoid overfitting the data when the sample space is small, we believe that the FAST metric is of interest for use in learning patterns of real world datasets, especially those that have imbalanced classes and high dimensionality.

One interesting finding from this research was that correlation features tended to outperform RELIEF features for class imbalance and small sample problems, especially when the SVM classifier was used. This may have occurred because the correlation coefficient takes a global view as to whether a feature accurately predicts the target; in contrast, RELIEF, especially when the number of nearest hits and misses selected is small, has a local view of a feature's relevancy to predicting the target. If there are small clusters of points that are near each other but far away from the main cluster of points, these points can act as each others' nearest hits while being a great distance from the nearest misses. Thus, features that have this quality could be scored rather high when they are, in fact, highly irrelevant to classification. There is strong evidence for this claim in Fig. 11. There are multiple small clusters of points, some from the majority class and some from the minority class, that are close to each other but a significant distance away from the nearest miss. This would greatly affect the score of these two features and make them appear more relevant. Figures 5-8 clearly point to this deficiency as the performance of both SVM's and 1-NN classifiers using RELIEF features is only marginally better (or worse) than chance and significantly behind classifiers using correlation or FAST features. Our future work will investigate the use of other metrics for feature evaluation. For example, researchers have recently argued that precision-recall curves are preferable when dealing with highly skewed datasets [43]. Whether or not the precision-recall curves are also appropriate to small sample and imbalanced data problems remains to be examined.

Figure 14. BER for NIPS using SVM classifiers

6. ACKNOWLEDGMENTS
This work is supported by the US National Science Foundation Award IIS-0644366. We would also like to the reviewers for their valuable comments.

7. REFERENCES
[1] Nunez, M. 1991. The use of background knowledge in decision tree induction. Machine Learning, 6, 231-250.
Figure 15. BER for NIPS using 1-NN classifiers

5. CONCLUSION
Classification problems involving a small sample space and large feature space are especially prone to overfitting. Feature selection methods are often used to increase the generalization potential of a classifier. However, when the dataset to be learned is imbalanced, the most-used metrics tend to select less relevant features. In this paper, we proposed and tested a feature selection metric, FAST, that evaluates the relevance of features using the area under the ROC curve by sliding decision line in onedimensional feature space. We compared the FAST metric with commonly-used RELIEF and correlation coefficient scores on two mass spectrometry and two microarray datasets that have small sample sizes and imbalanced distributions. FAST features performed considerably better than RELIEF and correlation features; the increase in performance was magnified for smaller feature counts, and this makes FAST a practical candidate for feature selection.

[2] Casasent, D. and Chen, X.-W. 2003. New training strategies for RBF neural networks for X-ray agricultural product inspection. Pattern Recognition, 36(2), 535-547. [3] Verdenius, F. 1991. A method for inductive cost optimization. Proceedings of the Fifth European Working Session on Learning, EWSL-91, 179-191. New York: Springer-Verla. [4] Casasent, D. and Chen, X.-W. 2004. Feature reduction and morphological processing for hyperspectral image data. Applied Optics, 43 (2), 1-10. [5] Japkowicz, N. editor 2000. Proceedings of the AAAI2000 Workshop on Learning from Imbalanced Data Sets. AAAI Tech Report WS-00-05. [6] Chawla, N., Japkowicz, N., and Kolcz, A. editors 2003. Proceedings of the ICML2003 Workshop on Learning from Imbalanced Data Sets. [7] Weiss, G. 2004. Mining with rarity: A unifying framework. SIGKDD Explorations 6(1), 7-19. [8] Kubat, M. and Matwin, S. 1997. Addressing the curse of imbalanced data set: One sided sampling. In Proc. of the

131

Fourteenth International Conference on Machine Learning, 179-186. [9] Chen, X., Gerlach, B., and Casasent, D. 2005. Pruning support vectors for imbalanced data classification. In Proc. of International Joint Conference on Neural Networks, 1883-88. [10] Kubat, M. and Matwin, S. 1997. Learning when negative examples abound. In Proceedings of the Ninth European Conference on Machine Learning ECML97, 146-153. [11] Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, P. 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16, 321-357. [12] Estabrooks, A., Jo, T., and Japkowicz, N. 2004. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence, 20(1), 18-36. [13] Domingos, P. 1999. MetaCost: a general method for making classifiers cost-sensitive. Proc. of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 155-164. [14] Elkan, C. 2001. The foundations of cost-sensitive learning. Proc. of the Seventeenth International Joint Conference on Artificial Intelligence, 973-978. [15] Fawcett, T., Provost, F. 1997. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1(3), 291-316. [16] Huang, K., Yang, H., King, I., Lyu, M., 2004. Learning classifiers from imbalanced data based on biased minimax probability machine. Proc. of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2(27), II-558 - II-563. [17] Ting, K. 1994. The problem of small disjuncts: its remedy on decision trees. Proc. of the Tenth Canadian Conference on Artificial Intelligence, 91-97. [18] Chawla, N., Lazarevic, A., Hall, L., and Bowyer, K. 2003. SMOTEBoost: Improving prediction of the minority class in boosting. Principles of Knowledge Discovery in Databases, LNAI 2838, 107-119. [19] Sun, Y., Kamel, M., Wang, Y. 2006. Boosting for learning multiple classes with imbalanced class distribution. Sixth International Conference on Data Mining, 592-602. [20] Raskutti, A. and Kowalczyk, A. 2004. Extreme rebalancing for svms: a svm study. SIGKDD Explorations, 6(1), 60-69. [21] Xiong, H and Chen, X. 2006. Kernel-based distance metric learning for microarray data classification. BMC Bioinformatics, 7, 299. [22] Forman, G. 2003. An extensive empirical study of feature selection metrics for test classification. Journal of Machine Learning Research, 3, 1289-1305. [23] Van der Putten, P. and van Someren, M. 2004. A bias-variance analysis of a real world learning problem: the coil challenge 2000. Machine Learning, 57(1-2), 177-195. [24] Guyon, I., Weston J., Barnhill, S., and Vapnik, V. 2002. Gene selection for cancer classification using support vector machines. Machine Learning, 46(1-3), 389-422.

[25] Guyon, I., and Elisseeff, A. 2003. An introduction to variable and feature selection. JMRL special Issue on variable and Feature Selection 3, 1157-1182. [26] Weston, J. et al. 2000. Feature selection for support vector machines. In Advances in Neural Information Processing Systems. [27] Chen, X. and Jeong, J. 2007. Minimum reference set based feature selection for small sample classifications. Proc. of the 24th International Conference on Machine Learning, 153-160. [28] Chen, X. 2003. An improved branch and bound algorithm for feature selection. Pattern Recognition Letter, 24, 1925-1933. [29] Yu, L. and Liu, H. 2004. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 5, 1205-1224. [30] Pudil, P., Novovicova, J., and Kittler, J., 1994. Floating search methods in feature selection. Pattern Recognition Letters, 15, 1119 1125. [31] Mladenic, D. and Grobelnik, M. 1999. Feature selection for unbalanced class distribution and nave Bayes. In Proc. of the 16th International Conference on Machine Learning, 258-267. [32] Zheng, Z., Wu, X., and Srihari, R. 2004. Feature selection for text categorization on imbalanced data. SIGKDD Explorations 6(1), 80-89. [33] Lund, O., Nielsen, C., Lundegaard, C., and Brunak, S. 2005. Immunological Bioinformatics, 99-101. The MIT Press. [34] Kira, K. and Rendell, L. 1992. The feature selection problem: Traditional methods and new algorithms. In Proc. of the 9th International Conference on Machine Learning, 249-256. [35] Kononenko, I. 1994. Estimating attributes: Analysis and extension of RELIEF. In Proc. of the 7th European Conference on Machine Learning, 171-182. [36] McCallum, A. 1996. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow. [37] Pomeroy, S. et al. 2002. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415, 436442. [38] Shipp, M. et al. 2002. Diffuse large b-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning. Nature Medicine, 8, 6874. [39] Petricoin, E. et al. 2002. Use of proteomic patterns in serum to identify ovarian cancer. The Lancet, 359, 572577. [40] Petricoin, E. et al. 2002. Serum proteomic patterns for detection of prostate cancer. Journal of the National Cancer Institute, 94, 15761578. [41] Roweis, S. 2008. http://www.cs.toronto.edu/ roweis. [42] MPS, 2006. Performance prediction challenge evaluation. http://www.modelselect.inf.ethz.ch/evaluation.php. [43] Davis, J. and Goadrich, M. 2006. The relationship between precision-recall and ROC curves. In Proc. of the 23rd International Conference on Machine Learning, 30-38.

132

Das könnte Ihnen auch gefallen