Beruflich Dokumente
Kultur Dokumente
Abstract-- We propose a bisection method for varying NNs outperform traditional classification systems, such as
classification threshold value for cost sensitive neural Logit and statistical discriminant analysis [4][15]. Thus,
network learning. Using simulated data and different we focus on improving NNs for minimizing
cost asymmetries, we -test- the proposed threshold misclassification cost.
varying bisection method and compare it with the The rest of the paper is organized as follows.
traditional fixed-threshold method based neural First, we provide an overview of the NN for classification.
network learning. The results of our experiments Next, we describe the bisection method based procedure
illustrate that the proposed threshold varying bisection for varying threshold, which is followed by experimental
method performs better than the traditional fixed- design, data description, and results of our experiments.
threshold method. Finally, we conclude the paper with a summary.
I. INTRODUCTION 11. OVERVIEW OF NN FOR CLASSIFICATION
Recently, there has been a considerable interest in For classification problems, there are several NN
the development and application of neural networks (NNs) architectures that are available. Among the popular NN
for classification problems [10][8][9][3]. Further, a stream architectures are probabilistic neural networks and multi-
of literature has suggested that the NN learning is sensitive layer feed forward NNs (MLFFNN). In this research, we
to misclassification cost [3], and a few researchers have use MLFFNN for cost sensitive learning. For
indicated the sensitivity of NN results to the cut-off classification problems with k classes a MLFFNN learns a
threshold value, data distributions and NN architecture classification function of type f: X Ak, where X
[15][4][12][11]. The prior research on cost sensitive represents an instance space of training cases; I represents
learning using NN has used penalty factors for minimizing a k-dimensional set containing only one element with
root-mean-square (RMS) error [14 ]. A few researchers value I and all others with value 0. Assume a data set S =
have used probabilistic NN (PNN) for cost sensitive { (x,,s,)I.,(Xa,S) } of a examples, where xp is a n-
learning. However, to our knowledge, there are no studies dimensional vector decision-making attributes and sp E
that have investigated the dynamic adjustment of cut-off V p E {I,.., a} is the known observed value of Ax,).
threshold to improve the NN learning in asymmetric Assuming A) is a logistic function, k = 2 and a popular
misclassification cost environment. This paper attempts to three-layer architecture [15] with five inputs (n=5), five
fill this gap by proposing a bisection method that nodes in the hidden layer and one node in output layer, a
dynamically selects a classification threshold so that a NN MLFFNN, illustrated in Figure 1, can be used for
learns to minimize misclassification cost. classification. Assume that wij is the connection weight
The motivation for our research comes from the for the connection from ith input node to jth hidden node
fact that a marginal improvement in misclassification cost and q, is the connection weight from thejth hidden node to
can result in substantial savings. For example, West et al. the output node. For any given example p, x,, the elements
[ 17] note that, in the US, the outstanding level of consumer of the vector, can be represented as {x,,..x,}. SinceJ() is a
debt is about $1.5 trillion dollars with high interest credit logistic function, the output Op, for a given example can be
card loans comprising $568.4 billion dollars. Further, written as follows.
more than 4% of credit card loans are delinquent every
year. Thus, a classification system that reduces the
misclassification cost can reduce the percentage of O,
delinquent cases and result in substantial savings. Among
several different statistical and machine learning
classification systems, several researchers have shown that
1040
memberships are given in Table 1, unit misclassification desired classification function. It is possible to have more
cost for each group, and values of 0 as 1, 2, 3, and 4 than three-layers, but the literature suggests that having
respectively, it can be shown that the total more than three layers may not result in significant
misclassification costs for different values of 0 will be 2, 1, performance improvements [5].
2 and I respectively. If a three-layer NN is chosen for learning
classification function then a decision-maker has to select
Table 1: Sample data and their group memberships the number of hidden nodes in the hidden layer [7][12].
Increasing the number of hidden nodes in an NN increases
Obs. I Obs. 2 Obs. 3 Obs. 4 the training peformance for an NN, but often results in
Variable A l 12 3 4 poor generalization [12][7]. For n inputs, Pendharkar [12]
r
Gp. Mem. Grp. I Grp. 2 Grp. I Grp. 2 suggests trying at least three different configurations each
with the number of hidden nodes equal to n+1, 2n+1, and
A reader may plot 0 and total misclassification cost to 3n+l.
check for non-convexity of misclassification cost function. Several studies have investigated the performance
In addition to misclassification cost function being non- of NNs under changing data distributions and report that
convex, it can be seen that multiple optima exist. The
NNs generally fare well under non-parametric distributions
existence of multiple optimum for unit misclassification [4][12][1 1][10][9][7]. Further, given that NNs are prone
cost has been noted previously [18]. to overfitting the training data, improved training
Given that multiple optima is likely to exist, it is performance is expected when training data are biased-
important that the [a, 3] is chosen so that it close to 0.5. In there are more that 50% examples belonging to any one
our experiments, we found that interval [0.3, 0.7] works
particular class. Additionally, when misclassification costs
are not equal, a few researchers have observed that a NN is
better than the interval [0.1, 0.9]. A smaller interval
precludes the procedure from obtaining a threshold that is sensitive to the misclassification cost matrices [3][1 1].
farther from 0.5. Thresholds that are too close to 0.1 or 0.9 Thus, our experimental design consists of
work well for training data, but the holdout samples following five factors.
performance of these thresholds is usually not good. Thus, I. Technique (T): we use two techniques, one is
a threshold interval [0.3, 0.7] works well for both training
traditional NN that uses fixed threshold for
and holdout samples. We use the value of e&0.0 1 in our classification, and the bisection method based
experiments. According to Atkinson [1], the number of threshold varying NN (BNN).
2. Size (S): the size of a NN is determined by the
iterates, n, required for convergence of the Bisection number of hidden nodes in the hidden layer. We
procedure is given by following expression. use three different sizes, small size, where the
number of hidden nodes is equal to the number of
nn
n
ln(2e)
inputs; medium size, where the number of hidden
nodes is equal to two times number of inputs; and
large size, where the number of hidden nodes is
Using the proposed interval and value £, the number of equal to three times the number of inputs.
iterations for convergence, in our experiments, is equal to 3. Data Distribution (DD): we consider the three
six. different data distributions for two classes. These
three data distributions are exponential data
IV. EXPERIMENTAL DESIGN AND DATA distribution, normal data distribution and uniform
data distribution.
Our experimental design is governed by a few 4. Data Bias (DB): data bias represent the ratio of
factors that are known to impact the performance of NNs. the number of examples belonging to class 1 and
Among these factors are the design of a NN and the the number of examples belonging to class 2. We
training data characteristics [7][4]. The NN design factors consider two data baises. The first type of data
consist of the learning algorithm used to learn connection bias is equal, where the ratio of the number of
weights, the architecture of a NN, and the NN learning examples belonging to class I is equal to the ratio
parameters (learning rates, stopping criterion, error of the number of examples belonging to class 2.
tolerence etc.). The training data set characteristics The second type of data bias is unequal, where
include the training data distribution and the training data there is 70: 30 ratio of the number of examples
bias [7][4][2]. belonging to class I and the number of examples
The architecture of an NN consists of either a belonging to class 2.
two-layer NN or a three layer NN. A two layer NN is 5. Missclassification Cost Assymmetry (MCA):
suitable for learning linear or quadratic classification misclassification cost asymmetry is the ratio of
functions [6]. However, a three-layer NIN can learn any misclassification cost of false positives to false
1041
negatives. We use three MCA matrices. The first Table 3: The Overall ANOVA Summary Table for Holdout Experiments
matrix is equal MCA where the ratio between the
misclassification costs of false positives and false Source Type III SS Mean Sq. F Val. Si
negatives is one. The second matrix is mild Model 6298161.90 787270.24 223.80 0.00*
unequal MCA, where the ratio between the T 182640.21 182640.21 51.92 0.00*
misclassification false positives and false S 45881.45 22940.73 6.521 0.00*
negatives is 0.5. The third matrix is unequal DD 674485.44 337242.72 95.87 0.00*
MCA, where the ratio between the DB 1494423.78 1494423.8 424.82 0.00*
misclassification false positives and false MCA 3900731.02 1950365.5 554.43 0.00*
negatives is 0.25. R-Squared 0.357; * significant at 99%
We use five-factor analysis of variance (ANOVA) and
simulated data to test the performance of aforementioned Both the training and holdout sample results
five factors on the total misclassification cost. indicate that all five factors play a significant role in
We use part of the data from Pendharkar [8] and explaining variance in misclassification cost. The F values
generate additional data using the methodology described indicate that the MCA and DB play the most important
in Vale and Maurelli [16]. Our final data set contained role in explaining variance in the misclassification cost and
sixty data samples for a given data distribution. We S has the least significant impact. The misclassification
considered three different data distributions, which were costs were significantly lower when the technique used
exponential distribution, uniform distribution and normal was BNN or when data bias was unequal. Since the
distribution. Each data sample consisted of 300 examples degree of freedoms for S, DD and MCA were more than
of ten independent attributes (x,, x,,..., x,,). For a given one, we conducted post-hoc pairwise comparison tests
data distribution, first 30 samples had an equal data bias, using Tukey's pairwise comparison of means technique.
where all the examples were equally split between two Tables 4 and 5 illustrate the Tukey's post hoc
groups, and the next 30 samples had an unequal data bias pairwise comparison of means tests on NN size for training
where 210 examples belonged to group I and remaining 90 and holdout experiments. The symbol , represents the
examples belonged to group 2. For all the sixty data mean misclassification cost value for a particular size
samples for all the data distributions, the mean for group I mentioned in the subscript. The results indicate that large
was approximately set equal to zero, and the mean of NN has a tendency to overfit the training data as the
group 2 was approximately set equal to 0.5. The standard misclassification cost for large NN is lowest for training
deviation for all the attributes was approximately set equal experiments and highest for holdout experiments. Both
to one, and skewness was set equal to zero. small and medium sizes performed well for our
experiments. This finding is consistent with
V. EXPERIMENTS AND RESULTS Bhattacharyya and Pendharkar [4] finding that suggests
that medium size NN work well for classification
For our experiments, we used bootstrap sampling problems.
and conducted 60 different training and holdout tests for
each of the data three data distributions. We use the Table 4: Pairwise idifference-in-meansi comparisons using Tukey's post
traditional backpropagation algorithm for learning hoc test for NN size for training experiments
connection weights. After initial experimentation, the
learning rate and stopping criterion for the Small Medium
backpropagation algorithm were fixed and were set to 0.1 Medium 6.06**
and 5, 000 learning iterations respectively. Tables 2 & 3 I Large 8.99** 2.93**
provide the results of overall ANOVA summary table for **significant at 95%; RsmaII=68.0S;nedium=61.99;
both training and holdout sample tests. Rliarge=59.06; Sample Size= 1080
Table 2: The overall ANOVA Summary Table for Training Experiments Table 5: Pairwise Idifference-in-meansl comparisons using Tukey's post
hoc test for NN size for holdout experiments
Source Type III SS Mean Sq. F Val. Sig.
Model 1869953.74 233744.22 307.44 0.00* | Small | Medium l
T 196373.51 196373.51 258.29 0.00* Medium 1.61
S 45451.13 22725.56 29.89 0.00* Large 8.67** 7.05**
DD 392252.71 196126.36 257.96 0.00* ** significant at 95%; ,.maII=130.09;4.edium131.70;
DB 535973.95 535973.95 704.96 0.00* large= 138.76; Sample Size=1080
MCA 699902.44 349951.22 460.29 0.00*
R-Squared 0.432; * significant at 99% Tables 6 and 7 illustrate the training and holdout
experiments' Tukey's post hoc pairwise comparison tests
1042
for the three data distributions. The results indicate that simulated data, different misclassification cost
exponential distributions are an easy problem domain for asymmetries and different NN architectures; we tested the
NNs. When compared to normal and uniform distributions BNN and compared it with traditional constant threshold
NNs do not overfit the training data for exponential NN. Our results indicate that the proposed BNN is very
distribution. Overfitting is more likely to happen when the robust and outperforms traditional constant threshold NN,
training and holdout examples are drawn from uniform and that NN classifiers generally fare well when data
distributions. distributions are exponential and the number of hidden
nodes are twice the number of inputs. Our procedure uses
Table 6: The Tukey's post hoc test pairwise Idifference-in-meansi the bisection method for varying threshold. It is possible
comparisons for data distribution in training experiments to replace the bisection method by other threshold varying
procedures such as Fibonacci series.
Exponential Normal In our study, we did not compare the threshold
I Normal 25.88** varying NN with cost sensitive probabilistic NN or other
Uniform 19.46* * 6.42** cost sensitive NN that use RMS penalty factors [14].
** significant at 95%; lexponental=47.92;,unifonr=67.38; Thus, it is difficult to highlight the performance
lnorma=73.80; Sample Size= 1080 advantages of threshold varying NN against other
competing NN architectures. However, for large size
Table 7: The Tukey's post hoc test pairwise Idifference-in-meansl datasets, we believe that threshold varying procedure may
comparisons for data distribution in holdout experiments
have a low memory requirement advantage over cost
sensitive PNN because a cost sensitive PNN has to keep all
| Exponential | Normal l the training examples in the memory.
Normal 24.14** T 1
Uniform 34.42"' 1 10.29** l REFERENCES
** significant at 95%; 9exponential=l l4-0;g,,,,,,=l48.42;
nornnal=138.13; Sample Size= 1080 [I] K. E. Atkinson, An Introduction to Numerical Analysis,
New York: John Wiley and Sons, 1993.
Table 8 and 9 illustrate the post hoc pairwise difference in [2] P. A. Benedict, "The use of synthetic data in dynamic
means comparisons for MCA in training and holdout bias selection," Proceedings of the 6Ih Aerospace
experiments. As expected, the mean misclassification Applications ofArtificial Intelligence Conference, Dayton,
costs were lower when MCA matrix was equal and higher Ohio, 1990, pp. 34-42.
when MCA matrix was unequal. All three pairwise [3] V. L. Berardi, and G. P. Zhang, "The effect of
comparisons in means were significant for both training misclassification costs on neural network classifiers,"
and holdout experiments. Decision Sciences, vol. 30, pp. 659-682, 1999.
[4] S. Bhattacharyya, and P. C. Pendharkar, " Inductive,
Table 8: The Tukey's post hoc test pairwise Idifference-in-meansi Evolutionary and Neural Techniques for Discrimination: A
comparisons for MCA in training experiments
Comparative Study", Decision Sciences, vol. 29, pp. 871 -
1043
of non-binary genetic algorithm based neural approaches Time Delay, Recurrent and Probabilistic Neural
for classification," Computers & Operations Research, Networks," IEEE Transactions on Neural Networks, vol.
vol. 31, pp. 481498, 2004. 9, pp. 1456-1470, 1998.
[11] P. C. Pendharkar, and S. Nanda, "A misclassification [15] L. M. Salchengerger, E. M. Cinar, and N. A. Lash,
cost minimizing evolutionary-neural classification "Neural networks: A new tool for predicting thrift
approach," Working Paper, no. 03-6, Penn State Capital failures," Decision Sciences, vol. 23, pp. 899-916, 1992.
College, 2003. [16] C. D. Vale, and V. A. Maurelli, "Simulating
[12] P. C. Pendharkar, "A computational study on the multivariate non-normal distributions," Psychometrika, 48,
performance of ANNs under changing structural design pp. 465471, 1983.
and data distributions," European Journal of Operational [17] D. West, S. Dellana, and J. Qian, "Neural network
Research, vol. 138, pp. 155-177, 2002. ensemble strategies for financial decision applications,
[13] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Computers & Operations Research, in press, 2004.
"Learning internal representations by error propagation," [18] P. R. Yarnold, and R. C. Soltysik, "Theoretical
In D. E. Rumelhart and R. J. Williams, (Eds.) Parallel distributions of optima for univariate discrimination of
distributed processing: Explorations in the microstructure random data," Decision Sciences, vol. 22, pp. 739-752,
of cognition. Cambridge: MIT Press, 1986. 1991.
[14] E. W. Saad, D. V. Prokhorov, and D. C. Wunsch,
"Comparative Study of Stock Trend Prediction Using
1044