Sie sind auf Seite 1von 6

Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005

A Threshold Varying Bisection Method for Cost


Sensitive Learning in Neural Networks
Parag C. Pendharkar
Information Systems
School of Business Administration
777 West Harrisburg Pike
Middletown, PA 17057
E-mail: pxpl9@psu.edu

Abstract-- We propose a bisection method for varying NNs outperform traditional classification systems, such as
classification threshold value for cost sensitive neural Logit and statistical discriminant analysis [4][15]. Thus,
network learning. Using simulated data and different we focus on improving NNs for minimizing
cost asymmetries, we -test- the proposed threshold misclassification cost.
varying bisection method and compare it with the The rest of the paper is organized as follows.
traditional fixed-threshold method based neural First, we provide an overview of the NN for classification.
network learning. The results of our experiments Next, we describe the bisection method based procedure
illustrate that the proposed threshold varying bisection for varying threshold, which is followed by experimental
method performs better than the traditional fixed- design, data description, and results of our experiments.
threshold method. Finally, we conclude the paper with a summary.
I. INTRODUCTION 11. OVERVIEW OF NN FOR CLASSIFICATION
Recently, there has been a considerable interest in For classification problems, there are several NN
the development and application of neural networks (NNs) architectures that are available. Among the popular NN
for classification problems [10][8][9][3]. Further, a stream architectures are probabilistic neural networks and multi-
of literature has suggested that the NN learning is sensitive layer feed forward NNs (MLFFNN). In this research, we
to misclassification cost [3], and a few researchers have use MLFFNN for cost sensitive learning. For
indicated the sensitivity of NN results to the cut-off classification problems with k classes a MLFFNN learns a
threshold value, data distributions and NN architecture classification function of type f: X Ak, where X
[15][4][12][11]. The prior research on cost sensitive represents an instance space of training cases; I represents
learning using NN has used penalty factors for minimizing a k-dimensional set containing only one element with
root-mean-square (RMS) error [14 ]. A few researchers value I and all others with value 0. Assume a data set S =
have used probabilistic NN (PNN) for cost sensitive { (x,,s,)I.,(Xa,S) } of a examples, where xp is a n-
learning. However, to our knowledge, there are no studies dimensional vector decision-making attributes and sp E
that have investigated the dynamic adjustment of cut-off V p E {I,.., a} is the known observed value of Ax,).
threshold to improve the NN learning in asymmetric Assuming A) is a logistic function, k = 2 and a popular
misclassification cost environment. This paper attempts to three-layer architecture [15] with five inputs (n=5), five
fill this gap by proposing a bisection method that nodes in the hidden layer and one node in output layer, a
dynamically selects a classification threshold so that a NN MLFFNN, illustrated in Figure 1, can be used for
learns to minimize misclassification cost. classification. Assume that wij is the connection weight
The motivation for our research comes from the for the connection from ith input node to jth hidden node
fact that a marginal improvement in misclassification cost and q, is the connection weight from thejth hidden node to
can result in substantial savings. For example, West et al. the output node. For any given example p, x,, the elements
[ 17] note that, in the US, the outstanding level of consumer of the vector, can be represented as {x,,..x,}. SinceJ() is a
debt is about $1.5 trillion dollars with high interest credit logistic function, the output Op, for a given example can be
card loans comprising $568.4 billion dollars. Further, written as follows.
more than 4% of credit card loans are delinquent every
year. Thus, a classification system that reduces the
misclassification cost can reduce the percentage of O,
delinquent cases and result in substantial savings. Among
several different statistical and machine learning
classification systems, several researchers have shown that

0-7803-9048-2/05/$20.00 02005 IEEE


0-7803-9048-2/05/$20.O0 @2005 IEEE
1039
1039
The learning in the neural networks begins with the
random initialization of weights wj and qj, and periodic The bisection method is guaranteed to converge, and the
updates to the weights such that certain error e.g, root- convergence speed is exponential [1]. When the function
mean-square (RMS) is minimized for the training data. A g(z) has multiple roots, the interval [a, ,B] is chosen so that
popular algorithm called backpropagation is often used for there is only one root in the interval.
minimizing RMS error [13]. When bisection method is used for varying
For classification problem, the value of output Op threshold X, special care must be exercised by the
and a threshold A are used to predict the class of an decision-maker as threshold varying problem is not a non-
example. Typically a value of threshold A=0.5 is used for linear root finding problem. Since the misclassification
classification [4][15]. The following rules can be used to cost function is unknown, it is important, in the beginning
predict the class of an example. of the procedure, to test that misclassification cost at the
IF 0 >2 THEN Class= I interval boundaries are greater than the misclassification
ELSE Class =0. cost at A=0.5. In other words, initially, three
Different values of A lead to different results and different misclassification costs are computed at A=2;i A=A and
misclassification costs. For example, if the value A is set A=0.5. Let Miss(A=c) represent a misclassification cost on
equal to zero all examples will be classified in class 1. a set of training examples when a threshold value of A=o-
Similarly, the value of A =1 will lead to the prediction of is used. Thus, in the beginning of the procedure a test is
all examples belonging to class 0. In reality the then performed to check if following condition is satisfied.
misclassification cost minimizing value of 2E (0,1). Miss(=0. 5) Min(Miss(A=cx),Miss(A=,8i))
<
Inp.u Looo Hidden Iy:r Ontpot l[y-
If the above condition is not satisfied then bisection
procedure is not used. Figure 3 illustrates the threshold
varying bisection method for cost sensitive NN classifier.

7Treshold Varying Bisecion Meehi,d (a A A, E


I. IfMiss(A 0 5)SMin(Miss(A=a), Miss(A=3) Then go to S1ep 2: otherwise A, =0.5 & exil
Fig. 1: A three-layer neural network 2. IfMiss(A a MiMiss(A=a). Miss(A=/J) Then f6-0.5. owhet-eise a= 0.5.
3. Definey(a/ 2.
4. IJMiss(A- 20<Min(Mise(A2=a. Mis.(A=A)) Then go to Step 5;
In this research, we propose a dynamic approach for 5.
oehenwise A. Min(Miss(A=a). Miss(A=) onsj
e-it
If,i-y:eThen A.=yand exit owherwisego lo slep6
obtaining the appropriate misclassification cost minimizing 6. IfAfss(A= yj Mss(A=e) Then /=y othermise a=y
7 Go to step 3
value of A.
III. BISECTION METHOD FOR VARYING Fig. 3. A Bisection Method for Cost Sensitive NN
THRESHOLD FOR A NN Classifier
The bisection method is a root finding method for The NN classifier that uses the bisection method
non-linear equations [1]. The traditional bisection method for varying threshold work as follows.
assumes a continuous function g(z) on the interval [a, ,B] I. The traditional backpropagation method is used to
such that the function g(z) satisfies following condition minimize RMS error and learn connection
g(a)g(/3) < 0 . weights.
2. The bisection method is then used to find the
Figure 2 illustrates the four different steps in the root appropriate threshold so that the total
finding bisection method. The variable e is the error misclassification cost is minimized for the
tolerance for the root of the function g(Z). examples in the training data set.
The selection of interval [oa, P] is very important as the
Bisection (g, a, , root, e) following Lemma illustrates that misclassification function
is not convex and multiple optima often exists.
1. Define r=(a+AIi)2
2. If/3- r< then accept root = r and exit. Lemma 1: The misclassification cost minimizing function
3. Ifg(/3).g(o) < 0, then a=- otherwise fi=y is not convex.
4. Go to step 1.
Proof: We consider a simple univariate case and Yarnold
Fig. 2. The root finding Bisection Method and Soltysik [17] discriminant rule IF Variable A < 0
THEN Group I ELSE Group 2. Assuming data and group

1040
memberships are given in Table 1, unit misclassification desired classification function. It is possible to have more
cost for each group, and values of 0 as 1, 2, 3, and 4 than three-layers, but the literature suggests that having
respectively, it can be shown that the total more than three layers may not result in significant
misclassification costs for different values of 0 will be 2, 1, performance improvements [5].
2 and I respectively. If a three-layer NN is chosen for learning
classification function then a decision-maker has to select
Table 1: Sample data and their group memberships the number of hidden nodes in the hidden layer [7][12].
Increasing the number of hidden nodes in an NN increases
Obs. I Obs. 2 Obs. 3 Obs. 4 the training peformance for an NN, but often results in
Variable A l 12 3 4 poor generalization [12][7]. For n inputs, Pendharkar [12]
r
Gp. Mem. Grp. I Grp. 2 Grp. I Grp. 2 suggests trying at least three different configurations each
with the number of hidden nodes equal to n+1, 2n+1, and
A reader may plot 0 and total misclassification cost to 3n+l.
check for non-convexity of misclassification cost function. Several studies have investigated the performance
In addition to misclassification cost function being non- of NNs under changing data distributions and report that
convex, it can be seen that multiple optima exist. The
NNs generally fare well under non-parametric distributions
existence of multiple optimum for unit misclassification [4][12][1 1][10][9][7]. Further, given that NNs are prone
cost has been noted previously [18]. to overfitting the training data, improved training
Given that multiple optima is likely to exist, it is performance is expected when training data are biased-
important that the [a, 3] is chosen so that it close to 0.5. In there are more that 50% examples belonging to any one
our experiments, we found that interval [0.3, 0.7] works
particular class. Additionally, when misclassification costs
are not equal, a few researchers have observed that a NN is
better than the interval [0.1, 0.9]. A smaller interval
precludes the procedure from obtaining a threshold that is sensitive to the misclassification cost matrices [3][1 1].
farther from 0.5. Thresholds that are too close to 0.1 or 0.9 Thus, our experimental design consists of
work well for training data, but the holdout samples following five factors.
performance of these thresholds is usually not good. Thus, I. Technique (T): we use two techniques, one is
a threshold interval [0.3, 0.7] works well for both training
traditional NN that uses fixed threshold for
and holdout samples. We use the value of e&0.0 1 in our classification, and the bisection method based
experiments. According to Atkinson [1], the number of threshold varying NN (BNN).
2. Size (S): the size of a NN is determined by the
iterates, n, required for convergence of the Bisection number of hidden nodes in the hidden layer. We
procedure is given by following expression. use three different sizes, small size, where the
number of hidden nodes is equal to the number of
nn
n
ln(2e)
inputs; medium size, where the number of hidden
nodes is equal to two times number of inputs; and
large size, where the number of hidden nodes is
Using the proposed interval and value £, the number of equal to three times the number of inputs.
iterations for convergence, in our experiments, is equal to 3. Data Distribution (DD): we consider the three
six. different data distributions for two classes. These
three data distributions are exponential data
IV. EXPERIMENTAL DESIGN AND DATA distribution, normal data distribution and uniform
data distribution.
Our experimental design is governed by a few 4. Data Bias (DB): data bias represent the ratio of
factors that are known to impact the performance of NNs. the number of examples belonging to class 1 and
Among these factors are the design of a NN and the the number of examples belonging to class 2. We
training data characteristics [7][4]. The NN design factors consider two data baises. The first type of data
consist of the learning algorithm used to learn connection bias is equal, where the ratio of the number of
weights, the architecture of a NN, and the NN learning examples belonging to class I is equal to the ratio
parameters (learning rates, stopping criterion, error of the number of examples belonging to class 2.
tolerence etc.). The training data set characteristics The second type of data bias is unequal, where
include the training data distribution and the training data there is 70: 30 ratio of the number of examples
bias [7][4][2]. belonging to class I and the number of examples
The architecture of an NN consists of either a belonging to class 2.
two-layer NN or a three layer NN. A two layer NN is 5. Missclassification Cost Assymmetry (MCA):
suitable for learning linear or quadratic classification misclassification cost asymmetry is the ratio of
functions [6]. However, a three-layer NIN can learn any misclassification cost of false positives to false

1041
negatives. We use three MCA matrices. The first Table 3: The Overall ANOVA Summary Table for Holdout Experiments
matrix is equal MCA where the ratio between the
misclassification costs of false positives and false Source Type III SS Mean Sq. F Val. Si
negatives is one. The second matrix is mild Model 6298161.90 787270.24 223.80 0.00*
unequal MCA, where the ratio between the T 182640.21 182640.21 51.92 0.00*
misclassification false positives and false S 45881.45 22940.73 6.521 0.00*
negatives is 0.5. The third matrix is unequal DD 674485.44 337242.72 95.87 0.00*
MCA, where the ratio between the DB 1494423.78 1494423.8 424.82 0.00*
misclassification false positives and false MCA 3900731.02 1950365.5 554.43 0.00*
negatives is 0.25. R-Squared 0.357; * significant at 99%
We use five-factor analysis of variance (ANOVA) and
simulated data to test the performance of aforementioned Both the training and holdout sample results
five factors on the total misclassification cost. indicate that all five factors play a significant role in
We use part of the data from Pendharkar [8] and explaining variance in misclassification cost. The F values
generate additional data using the methodology described indicate that the MCA and DB play the most important
in Vale and Maurelli [16]. Our final data set contained role in explaining variance in the misclassification cost and
sixty data samples for a given data distribution. We S has the least significant impact. The misclassification
considered three different data distributions, which were costs were significantly lower when the technique used
exponential distribution, uniform distribution and normal was BNN or when data bias was unequal. Since the
distribution. Each data sample consisted of 300 examples degree of freedoms for S, DD and MCA were more than
of ten independent attributes (x,, x,,..., x,,). For a given one, we conducted post-hoc pairwise comparison tests
data distribution, first 30 samples had an equal data bias, using Tukey's pairwise comparison of means technique.
where all the examples were equally split between two Tables 4 and 5 illustrate the Tukey's post hoc
groups, and the next 30 samples had an unequal data bias pairwise comparison of means tests on NN size for training
where 210 examples belonged to group I and remaining 90 and holdout experiments. The symbol , represents the
examples belonged to group 2. For all the sixty data mean misclassification cost value for a particular size
samples for all the data distributions, the mean for group I mentioned in the subscript. The results indicate that large
was approximately set equal to zero, and the mean of NN has a tendency to overfit the training data as the
group 2 was approximately set equal to 0.5. The standard misclassification cost for large NN is lowest for training
deviation for all the attributes was approximately set equal experiments and highest for holdout experiments. Both
to one, and skewness was set equal to zero. small and medium sizes performed well for our
experiments. This finding is consistent with
V. EXPERIMENTS AND RESULTS Bhattacharyya and Pendharkar [4] finding that suggests
that medium size NN work well for classification
For our experiments, we used bootstrap sampling problems.
and conducted 60 different training and holdout tests for
each of the data three data distributions. We use the Table 4: Pairwise idifference-in-meansi comparisons using Tukey's post
traditional backpropagation algorithm for learning hoc test for NN size for training experiments
connection weights. After initial experimentation, the
learning rate and stopping criterion for the Small Medium
backpropagation algorithm were fixed and were set to 0.1 Medium 6.06**
and 5, 000 learning iterations respectively. Tables 2 & 3 I Large 8.99** 2.93**
provide the results of overall ANOVA summary table for **significant at 95%; RsmaII=68.0S;nedium=61.99;
both training and holdout sample tests. Rliarge=59.06; Sample Size= 1080
Table 2: The overall ANOVA Summary Table for Training Experiments Table 5: Pairwise Idifference-in-meansl comparisons using Tukey's post
hoc test for NN size for holdout experiments
Source Type III SS Mean Sq. F Val. Sig.
Model 1869953.74 233744.22 307.44 0.00* | Small | Medium l
T 196373.51 196373.51 258.29 0.00* Medium 1.61
S 45451.13 22725.56 29.89 0.00* Large 8.67** 7.05**
DD 392252.71 196126.36 257.96 0.00* ** significant at 95%; ,.maII=130.09;4.edium131.70;
DB 535973.95 535973.95 704.96 0.00* large= 138.76; Sample Size=1080
MCA 699902.44 349951.22 460.29 0.00*
R-Squared 0.432; * significant at 99% Tables 6 and 7 illustrate the training and holdout
experiments' Tukey's post hoc pairwise comparison tests

1042
for the three data distributions. The results indicate that simulated data, different misclassification cost
exponential distributions are an easy problem domain for asymmetries and different NN architectures; we tested the
NNs. When compared to normal and uniform distributions BNN and compared it with traditional constant threshold
NNs do not overfit the training data for exponential NN. Our results indicate that the proposed BNN is very
distribution. Overfitting is more likely to happen when the robust and outperforms traditional constant threshold NN,
training and holdout examples are drawn from uniform and that NN classifiers generally fare well when data
distributions. distributions are exponential and the number of hidden
nodes are twice the number of inputs. Our procedure uses
Table 6: The Tukey's post hoc test pairwise Idifference-in-meansi the bisection method for varying threshold. It is possible
comparisons for data distribution in training experiments to replace the bisection method by other threshold varying
procedures such as Fibonacci series.
Exponential Normal In our study, we did not compare the threshold
I Normal 25.88** varying NN with cost sensitive probabilistic NN or other
Uniform 19.46* * 6.42** cost sensitive NN that use RMS penalty factors [14].
** significant at 95%; lexponental=47.92;,unifonr=67.38; Thus, it is difficult to highlight the performance
lnorma=73.80; Sample Size= 1080 advantages of threshold varying NN against other
competing NN architectures. However, for large size
Table 7: The Tukey's post hoc test pairwise Idifference-in-meansl datasets, we believe that threshold varying procedure may
comparisons for data distribution in holdout experiments
have a low memory requirement advantage over cost
sensitive PNN because a cost sensitive PNN has to keep all
| Exponential | Normal l the training examples in the memory.
Normal 24.14** T 1
Uniform 34.42"' 1 10.29** l REFERENCES
** significant at 95%; 9exponential=l l4-0;g,,,,,,=l48.42;
nornnal=138.13; Sample Size= 1080 [I] K. E. Atkinson, An Introduction to Numerical Analysis,
New York: John Wiley and Sons, 1993.
Table 8 and 9 illustrate the post hoc pairwise difference in [2] P. A. Benedict, "The use of synthetic data in dynamic
means comparisons for MCA in training and holdout bias selection," Proceedings of the 6Ih Aerospace
experiments. As expected, the mean misclassification Applications ofArtificial Intelligence Conference, Dayton,
costs were lower when MCA matrix was equal and higher Ohio, 1990, pp. 34-42.
when MCA matrix was unequal. All three pairwise [3] V. L. Berardi, and G. P. Zhang, "The effect of
comparisons in means were significant for both training misclassification costs on neural network classifiers,"
and holdout experiments. Decision Sciences, vol. 30, pp. 659-682, 1999.
[4] S. Bhattacharyya, and P. C. Pendharkar, " Inductive,
Table 8: The Tukey's post hoc test pairwise Idifference-in-meansi Evolutionary and Neural Techniques for Discrimination: A
comparisons for MCA in training experiments
Comparative Study", Decision Sciences, vol. 29, pp. 871 -

Mild-Unequal Equal l 900, 1998.


[5] W. Y. Huang, and R. P. Lippmann, "Comparisons
Equal 12.27* between neural net and conventional classifiers," IEEE
Unequal 23.17 * 35.45* International Conference on Neural Networks, San Diego,
* significant at 95%; Requal=47.l3;g,nlid-unequal=59.40; CA, 1987, pp. 485-493.
tunequal=82.57; Sample Size=1080 [6] R. P. Lippmann, "An introduction to computing with
neural nets," IEEE ASSP Magazine, vol. 4, pp. 2-22, 1987.
Table 9: The Tukey's post hoc test pairwise Idifference-in-meansi
comparisons for MCA in holdout experiments [7] E. Patuwo, M. Y. Hu, and M. S. Hung, "Two group
classification using neural networks," Decision Sciences,
Mild-Unequal | Equal vol. 24, pp. 825-846, 1993.
Equal 31.04** [8] P. C. Pendharkar, "Hybrid approaches for classification
Unequal under information acquisition cost constraint," Decision
153.00** 84.04*'* Support Systems and Electronic Commerce, in press,
** significant at 95%; Requ1=95. 16;mild-uneq.l=l26. 19; 2004.
gnequal=179.20; Sample Size=1080 [9] P. C. Pendharkar, "A threshold-varying artificial neural
network approach for classification and its application to
VI. SUMMARY AND CONCLUSIONS bankruptcy prediction problem," Computers & Operations
Research, in press, 2004.
We proposed a threshold varying bisection [10] P. C. Pendharkar, and J. A. Rodger, "An empirical
method for cost sensitive classification in a NN. Using study of impact of crossover operators on the performance

1043
of non-binary genetic algorithm based neural approaches Time Delay, Recurrent and Probabilistic Neural
for classification," Computers & Operations Research, Networks," IEEE Transactions on Neural Networks, vol.
vol. 31, pp. 481498, 2004. 9, pp. 1456-1470, 1998.
[11] P. C. Pendharkar, and S. Nanda, "A misclassification [15] L. M. Salchengerger, E. M. Cinar, and N. A. Lash,
cost minimizing evolutionary-neural classification "Neural networks: A new tool for predicting thrift
approach," Working Paper, no. 03-6, Penn State Capital failures," Decision Sciences, vol. 23, pp. 899-916, 1992.
College, 2003. [16] C. D. Vale, and V. A. Maurelli, "Simulating
[12] P. C. Pendharkar, "A computational study on the multivariate non-normal distributions," Psychometrika, 48,
performance of ANNs under changing structural design pp. 465471, 1983.
and data distributions," European Journal of Operational [17] D. West, S. Dellana, and J. Qian, "Neural network
Research, vol. 138, pp. 155-177, 2002. ensemble strategies for financial decision applications,
[13] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Computers & Operations Research, in press, 2004.
"Learning internal representations by error propagation," [18] P. R. Yarnold, and R. C. Soltysik, "Theoretical
In D. E. Rumelhart and R. J. Williams, (Eds.) Parallel distributions of optima for univariate discrimination of
distributed processing: Explorations in the microstructure random data," Decision Sciences, vol. 22, pp. 739-752,
of cognition. Cambridge: MIT Press, 1986. 1991.
[14] E. W. Saad, D. V. Prokhorov, and D. C. Wunsch,
"Comparative Study of Stock Trend Prediction Using

1044

Das könnte Ihnen auch gefallen