You are on page 1of 16

The Computer Journal Advance Access published February 1, 2012

The Author 2012. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved.
For Permissions, please email: journals.permissions@oup.com
doi:10.1093/comjnl/bxs001

Nearest Neighbor Classifier Based


on Nearest Feature Decisions
Alex Pappachen James1, and Sima Dimitrijev2
1 Machine

Intelligence Group, School of Computer Science, Indian Institute of Information Technology and
Management, Kerala, India
2 Queensland Micro- and Nanotechnology Centre and Griffith School of Engineering, Griffith University,
Nathan, Australia
Corresponding author: apj@ieee.org

Keywords: nearest neighbors; classification; local features; local ranking


Received 2 September 2011; revised 3 December 2011
Handling editor: Ethem Alpaydin

1.

INTRODUCTION

Automatic classification of patterns has been continuously and


rigorously investigated for the last 30 years. Simple classifiers,
based on the nearest neighbor (NN) principle, have been used
to solve a wide range of classification problems [15]. The NN
classification works on the idea of calculating global distances
between patterns, followed by ranking to determine the NNs
that best represent the class of a test pattern. Usually, distance
metric measures are used to compute the distances between
feature vectors. The accuracy of the calculated distances is
affected by the quality of the features, which can be degraded by
natural variability and measurement noise. Furthermore, some
distance calculations are affected by falsely assumed correlation
between different features. For example, Mahalanobis distance
will include the comparison between poorly or uncorrelated
features. This problem is more pronounced when the number
of features in a pattern is very large because the irrelevant
distance calculations can accumulate to a large value (for
example, there will be many false correlations in gene
expressions data that can have dimensionality higher than 104
features). In addition to this problem, a considerable increase

in dimensionality complicates the classifier implementations


resulting in curse of dimensionality, where a possible
convergence to a classification solution becomes very slow
and inaccurate [6, 7]. The conventional solution to address
these problems is to rely on feature extraction and feature
selection methods [810]. However, unpredictability of natural
variability in patterns makes processing a specific feature
inapplicable to diverse pattern-recognition problems. Another
approach to improve the classifier performance is by using
machine learning techniques to learn the distance metrics
[1113]. These methods attempt to reduce the inaccuracies
that occur with distance calculations. However, this solution
tends to include optimization problems that suffer from
high computational complexity and require reduced feature
dimensionality, resulting in low accuracies when the feature
vectors are highly dimensional and the number of intra-class
gallery objects is low. Learning distance metrics can completely
fail in high- and ultra-high-dimensional databases when the
relevance and redundancy of features often become impossible
to trace even with feature weighting or selection schemes.
Owing to these reasons, performance improvement of the NN

The Computer Journal, 2012

Downloaded from http://comjnl.oxfordjournals.org/ at Selcuk University on January 9, 2015

High feature dimensionality of realistic datasets adversely affects the recognition accuracy of nearest
neighbor (NN) classifiers. To address this issue, we introduce a nearest feature classifier that shifts
the NN concept from the global-decision level to the level of individual features. Performance
comparisons with 12 instance-based classifiers on 13 benchmark University of California Irvine
classification datasets show average improvements of 6 and 3.5% in recognition accuracy and
area under curve performance measures, respectively. The statistical significance of the observed
performance improvements is verified by the Friedman test and by the post hoc BonferroniDunn
test. In addition, the application of the classifier is demonstrated on face recognition databases, a
character recognition database and medical diagnosis problems for binary and multi-class diagnosis
on databases including morphological and gene expression features.

A.P. James and S. Dimitrijev

classifier in tasks with high-dimensional data remains an open


research problem.
In this work, we propose a simple classifier that applies the
NN concept at the level of individual features. We demonstrate
that the proposed framework is general enough to be applied to
different pattern recognition problems involving feature vectors
with both low and high dimensionality.

2.

RELATED WORK

3.

NEAREST FEATURE APPROACH TO NN


CLASSIFICATION

The main idea in the new approach is to extend the decision


making in the NN classifier from the global level to the level
of individual features. The result of the decision-making step at
the global level is the classification of a number of objects as the
NNs. Analogously, the introduction of a decision-making step
at the feature level is classification of a number of features as
the nearest features (NFs). These local decisions (feature-level
decisions) are then processed to make the global decision for

3.1.

Calculation of threshold values, i

The idea of NFs is implemented as a conversion of the featureto-feature distances (di ) to NF decisions, labeled as NF votes
(vi ) in Table 1. This conversion requires threshold values i ,
which are obtained from the standard deviations of feature-tofeature inter-class distances.
Consider a feature vector G = {gi , i [1, M], M
Integer} in a training set having a class label c and feature
number i. We can define the inter-class distances of a feature

The Computer Journal, 2012

Downloaded from http://comjnl.oxfordjournals.org/ at Selcuk University on January 9, 2015

The accuracy of NN classifiers can depend very strongly on the


specific selection of distance measures. Considerable research
aimed at improving the NN methods has been focused on finding
new ways to determine distances by automatic learning and
embedding techniques [1116]. However, defining an accurate
distance measure for a NN classification task is considered
challenging and perhaps impossible in realistic classification
problems. Learning a distance metric through global distances
would require various levels of parametric optimizations, which
are often sensitive to inaccuracies and minor changes in highly
dimensional data. Classification performance of such learning
methods, when using highly dimensional data, is poor and not
better than the basic k-NN classifier.
Several attempts have been made to improve the performance
of NN classifiers: assigning different weights to every NN [17
19], reducing the effect of outliers by using a nearest local
mean vector [19], selecting a subset of weighting features
according to their discriminatory power [17, 20], optimizing the
parameter k for NN classification by lowest estimated error rate
[21], reducing implementation complexity by using Kd-trees
and hashing functions [2123], generalizing the basic k-NN
classifier and introducing feature voting that considers each
feature separately [24, 25]. It should be noted that the use of
complex techniques for improving the NN classifier negates
the simplicity of the basic classifier as one of its most important
advantages. Furthermore, the use of learning and embedding
techniques to improve classifier performance results in loss of
generality and makes real-time implementations impractical. As
distinct from such approaches, we propose a simple technique
that is easy to implement, applicable to real-time tasks and that
shows robust recognition performance across several databases.

the usual NN classification. Accordingly, the new approach can


be described as the NN classifier based on NF decisions. For
convenience, we shall also use a simplified term in this paper:
NF classifier.
To explain the feature-level decisions in specific terms,
assume that each object has M features and that the features
are labeled by a number i = 1, . . . , M. A comparison between
two objects is performed so to as calculate M feature-to-feature
distances that correspond to each of the object features. The
extension of the decision making to the feature level means that
the features corresponding to distances smaller than a threshold
distance i are defined as the NFs. In effect, the role of the
parameter k at the global level of k-NN classifier is played by the
thresholds i at the feature level. This means that there could be
as many i values to be optimized as the number of features. The
selection of different values of i for different features is not a
desirable practical solution, especially when the dimensionality
of feature vectors is high. To avoid such complex and inefficient
implementation, the statistics of the ith feature of all inter-class
gallery objects are utilized to set a unique criterion that can
be used to calculate all the threshold values i , which ensures
that the NF classifier retains the single-parameter characteristic
of the k-NN classifier. The unique threshold criterion is set as
i = wi , where i is the standard deviation of the inter-class
feature distances associated with the ith feature and w is a
unique weight parameter. The number of NFs per object are
then used for global similarity calculation and ranking of the
test objects and the global decision.
We shall use a simple example, based on the Iris plant
database [25], to introduce the specific steps needed to make the
NF decisions and then to use these decisions in the NN classifier.
A numerical example using the Iris plant database is given in
Table 1. In this example, the task is to identify the unknown class
of a test object Z by comparing it with a gallery set of objects:
G {S1 , S2 , VC1 , VC2 , VC3 , V1 , V2 }. The gallery consists of
objects from three classes: (1) Setosa, (2) Versicolor and (3)
Viginica. Each object has four features: (1) sepal length (g1 ),
(2) sepal width (g2 ), (3) petal length (g3 ) and (4) petal width
(g4 ). It can be seen from Table 1 that the gallery set is formed of
two objects from Setosa, three objects from Versicolor and two
objects from Virginica classes. For validation of the recognition
results, it is known that the test object Z used in this example
belongs to the Virginica class.

NN Classifier Based on NF Decisions

TABLE 1. The proposed nearest feature vote calculation to identify the class of unknown object Z from Setosa, Versicolor and Virginica classes
in the Iris plant database.
Feature-to-feature
distances (cm)

Class label

g1

g2

g3

g4

d1

d2

d3

d4

1 (0.4 cm)

2 (0.4 cm)

3 (1.4 cm)

4 (0.6 cm)

Object similarity
score

Setosa S1
Setosa S2

5.1
4.9

3.5
3.0

1.4
1.4

0.2
0.2

1.4
1.6

0.5
0.0

4.4
4.4

2.0
2.0

0
0

0
1

0
0

0
0

0
1

Versicolor VC1
Versicolor VC2
Versicolor VC3

52
5.9
5.2

3.5
3.0
2.7

1.0
4.2
3.9

1.5
1.5
1.4

1.0
0.6
1.3

2.3
0.0
0.3

1.2
1.6
1.9

0.0
0.7
0.8

0
0
0

0
1
1

0
0
0

0
0
0

0
1
1

Virginica V1
Virginica V2

6.3
5.8

3.3
2.7

6.0
5.1

2.5
1.9

0.2
0.7

0.3
0.3

0.2
0.7

0.3
0.3

1
0

1
1

1
1

1
1

4
3

Unknown Z

6.5

3.0

5.8

2.2

NF votes, vi (w = 1)

gi (c) from a class c within the training set as


di (c, c)
= |gi (c) gi (c)|,

(1)

where c  = c and (c, c)


ClassLabel. For example, consider the
objects S1 and V1 , which belong to two different classes. The
feature-to-feature distances between these inter-class objects
is defined as the inter-class distance di (S1 , V1 ) = |gi (S1 )
gi (V1 )|. The use of the gallery objects in Table 1 results in a total
of 16 di values per feature. These 16 distances are comprised
of 6 inter-class comparisons between objects from Setosa and
Versicolor, 6 inter-class comparisons between objects from
Versicolor and Virginica and 4 inter-class comparisons between
objects from Setosa and Virginica.
The histogram of the inter-class distances for a feature i can be
obtained by a function qi that counts the number of observations
that fall into each of the disjoint categories called bins. The
optimal size of the bin width h of a histogram can be obtained
2
by minimizing
cost function minh ((2
kthe
qki1 ui )/h ), where
1
qi = (1/k1 ) z=1 qi,z and ui = (1/k1 ) z=1 (qi,z qi )2 , and
k1 is the total number of bins. The standard deviation of the

inter-class distance histogram i = ui represents the relative


measure of closeness of feature values from different objects in
different classes for the ith feature in the feature vector G.
=
Definition
1. The
standard
deviation
i

k1
2
(1/k1 ) z=1 (qi,z qi ) of a distance histogram qi describes
the relative inter-class closeness of the values of ith feature.
In our example, the standard deviation i of the 16 feature-tofeature inter-class distances di gives the measure of variability
of the ith feature across the inter-class gallery objects. The
obtained numerical values of the standard deviation i using 16
di values per feature are as follows: 1 = 0.4 cm, 2 = 0.4 cm,
3 = 1.4 cm and 4 = 0.6 cm.

The threshold values i can now be calculated as i = wi ,


where the weight w is the unique optimization parameter.
During the introduction of the new method, we will assume
the value w = 1, as shown in Table 1.
3.2.

Determination of the NFs

The second step is to calculate the feature-to-feature distances,


di , between the test object Z and each of the gallery objects.
Table 1 shows the distance values for each of the four features
and each gallery object (there are seven objects in the gallery,
and so there are seven distances for each of the four features).
When compared with the threshold values i , these distances are
measures of closeness of each test feature to the corresponding
gallery feature. This measure of closeness can be used to make
a decision on the NNs in a classification or search task. The
decision is essentially a process of thresholding.
Definition 2. A feature i, belonging to a set of M features
(i = 1, . . . , M), is classified as one of the NFs of a given gallery
object if the distance between the feature values of the test
and the gallery objects, di , is smaller than the corresponding
threshold value i = wi , where i is the standard deviation
of the inter-class feature distances and w is a unique weight
parameter.
As mentioned in the previous section, to illustrate the concept
of NF detection based on the distances di and the inter-class
gallery thresholds i = wi , we shall assume that w = 1. This
means that the feature thresholds are 1 = 0.4 cm, 2 = 0.4 cm,
3 = 1.4 cm and 4 = 0.6 cm, as shown in Table 1.
Each of the graphs in Fig. 1ad illustrates the comparisons
of gallery and test values for each of the four features, sepal
height, sepal width, petal height and petal width, respectively.
The shaded region in each graph illustrates the corresponding

The Computer Journal, 2012

Downloaded from http://comjnl.oxfordjournals.org/ at Selcuk University on January 9, 2015

Feature values (cm)

A.P. James and S. Dimitrijev

inter-class gallery threshold i . It can be seen that the sizes of


the shaded regions are different for each feature because the
i values are determined by the standard deviations i of the
inter-class distances for each feature. The set of NFs for each
gallery object is determined by the feature-to-feature distances
that are within the tolerances defined by the threshold i . The
features that fall within the shaded region are the nearest set of
features. It can be seen that one NF from Virginica is detected
for sepal height in Fig. 1a; two NFs from Virginica, two NFs
from Versicolor and one NF from Setosa are detected for sepal
width in Fig. 1b; and two NFs from Virginica are detected for
petal height and petal width in Fig. 1c and d, respectively.
The process of thresholding the distances to determine the
NFs can be called local voting, where the vote values are

vi =

1
0

c = arg max sn (c).


c

if di i ,
otherwise.

(2)

These votes represent the local decisions on the selection of NFs


by numerical values and they are shown in Table 1.
3.3.

To make the global decision about the class of the test object
Z, the top-rank method is attempted in the first instance. In
this method, the object with the highest similarity score in each
class (the top-ranked object in each class) is taken to represent
the whole class. This means that the global similarity scores
of the top-ranked objects in each class are taken as the values
of the cumulative class similarity scores, sg (c), shown in the
rank-1 columns in Table 2.
These similarity scores can be normalized into class
likelihood similarity scores, sn , by dividing with the maximum
similarity score across the classes (Table 2). The decision on
identifying the class c of the test object Z can be obtained by
finding the class that has the maximum value of sn as follows:

Global decisions

The votes representing the NFs for each object are added up
to obtain the global similarity score for each of the objects, as
illustrated in the column labeled as object similarity score in
Table 1.

(3)

Although the top-rank method is the simplest approach to


make a global decision, there can be numerous situations
when the top rank corresponds to multiple classes. An obvious
addition that improves the global decisions is to take into
account similarity values from lower ranks. To do this, the
global similarity scores (shown in Table 1) of the first- and
second-ranked objects in each class can be added to obtain the
cumulative class similarity scores for rank 2 (shown in Table 2).
This can be continued for rank 3, which is only applicable to
Versicolor class in this example. The cumulative class scores for
each rank are the normalized into the class likelihood similarity
scores, sn , by dividing with the maximum similarity score across

The Computer Journal, 2012

Downloaded from http://comjnl.oxfordjournals.org/ at Selcuk University on January 9, 2015

FIGURE 1. Illustrations of the feature-to-feature comparisons between the test and gallery objects shown in Table 1 for the four features from the
Iris plant database: (a) sepal height, (b) sepal width, (c) petal height and (d) petal width. The NFs are those that fall in the shaded regions, defined
by the threshold distances i .

NN Classifier Based on NF Decisions


TABLE 2. Class similarity scores used for the global decision.
Cumulative class similarity
score for each rank, sg (c)

Class likelihood score for each


rank, sn (c)

Rank

Rank

Class, c

Setosa

Versicolor

Virginica

1
4
2
4
4
4

3.4.

Method summary

The specific stages and steps of the NF classification algorithm


are summarized in Algorithm 1.

4. AUTOMATIC THRESHOLD SELECTION


As specified in Definition 2, the NF decisions are based on
the threshold value i , which is directly related to the feature
values in the training set. Specifically, i = wi , where i is
the standard deviation of the inter-class distances of the ith
feature in the training set. This means that the values of i are
determined by the feature values in the database and the problem
of optimizing i becomes equivalent to optimizing the unique
value of w. The possibility of optimizing the value of w is a
new element in the proposed classification algorithm.
Consider the inter-class distances obtained from all the
features in a training set as the values of a random variable X

= 0.25
= 0.25
= 1.00

1
7
2
7
7
7

3
= 0.14

= 0.28

2
7

= 1.00

= 0.28

Algorithm 1 NF classifier
Training Stage
Requires: A training set with M features per object, G =
{gi , i [1, M]} (there is a total of c classes and a maximum
of N objects per class).
Steps:
1. Calculate the inter-class feature-to-feature distances di =
where the gi (c)
belongs to a feature from any
|gi (c) gi (c)|,
class other than that from c.
2. Calculate the standard deviation i from the histogram
distribution of di .
3. Determine a threshold i = wi , where w is found
by maximization of inter-class and intra-class distance
distributions originating from the training set.
Testing Stage
Requires: A training set with M features per object, G =
{gi , i [1, M]} (there is a total of c classes, and a maximum
of N objects per class); a test object Z, whose class is
unknown.
Steps:
1. Calculate the feature-to-feature distance di = |gi (c)
gi (Z)|, where the gi (Z) is the i-th feature from object Z and
gi (c) is a feature from an object belonging to class c.
2. Find the nearest features by selecting those features whose
distances di fall within the threshold i . Each nearest feature
of an object is assigned a vote vi = 1.
3. Object similarity scores for each training object are
calculated by counting the total number of votes vi = 1 (the
nearest features) associated with the object.
4. The object similarity scores of m top ranked training
objects in each class are added to obtain the cumulative class
similarity score for the m-th rank.
5. The cumulative class similarity scores are normalized by
dividing with the rank-wise maximum score across classes to
obtain the class likelihood scores sn . The class of the training
object that has the highest score sn at top rank (m = 1)
is assigned as the class of the object Z. In the event of
inconclusive decision for m = 1, sn for m = 2 is used for the
decision.

The Computer Journal, 2012

Downloaded from http://comjnl.oxfordjournals.org/ at Selcuk University on January 9, 2015

the classes (Table 2). So, if a prediction results in a tie after the
top rank classification, then sn values from the next lower rank
can be added to represent the score for the class. By adopting
lower ranks in score calculations, the indecisiveness of class
selection resulting from ties can be resolved. Clearly, including
lower ranks in this manner makes the global decision of the NN
non-parametric. Furthermore, the maximum number of ranks
that are available for a classification problem will be limited by
the number of features in an object. This is because sn values
originate from the integer scores that are limited by the total
number of features in an object.
The use of classical majority voting [1, 2] can further improve
the global decisions when the number of objects per class in the
gallery is more than one. It should be noted that majority voting
is only useful when there are one or more intra-class gallery
objects that have an equal number of features with near-zero
feature-to-feature distances. In classification problems, such
situations occur more frequently for a large number of gallery
objects with low-dimensional feature vectors and less frequently
for a small number of gallery objects with high-dimensional
feature vectors.

A.P. James and S. Dimitrijev

Downloaded from http://comjnl.oxfordjournals.org/ at Selcuk University on January 9, 2015

with the probability density function fX (x) and the standard


deviation e . Analogously, consider the intra-class distances
obtained from all the features as the values of a random variable
Y with the probability density function fY (y) and the standard
deviation a . Then X/e and Y /a are normalized inter-class
and normalized intra-class distances, respectively, with the
corresponding probability density functions fY / a (y/a ) and
fX/ e (x/e ).
Given that w = i /i , the weight w can be considered as
the normalized threshold and can be related to the normalized
distances x/e and y/a . So, to investigate the impact of
different values of w, we can consider the relationship between
the inter-class and intra-class probability density functions for
different values of the normalized distances.
As a numerical example, we can again use the Iris plant
database that has 3 classes, 50 objects per class and 4 features
per object. Assume that we want to perform a classification
experiment on this dataset such that 25 objects from each class
are used for the gallery and the remaining 25 objects from each
class are used for testing. Only the objects from the gallery are
used to determine the values of i and w. Accordingly, these
objects are also referred to as the training set.
Figure 2a shows the inter-class and intra-class distributions
of the normalized Euclidean distances for a training set of
the Iris database. As expected, the inter-class distribution
is shifted toward higher values of the normalized distance.
The actual value of the separation between the inter- and
intra-class distribution will be different in different databases.
Obviously, larger separations will enable one to set a higher
value of the normalized distance as the normalized threshold,
or weight, w. The separation between the inter- and intraclass distributions can be quantified by the normalized distance
that corresponds to the maximum of the product of the interand intra-class distributions, f (x/e )f (y/a ). This product for
the Iris database is shown in Fig. 2b and this graph can be
compared with the recognition accuracies, shown in Fig. 2c.
It can be seen that the range of w values that corresponds to
the highest recognition accuracies (Fig. 2c) is consistent with
the range of w values that corresponds to the highest values
of f (x/e )f (y/a ) (Fig. 2b). It can also be seen that the
recognition accuracies in this range of w values are higher than
or equal to the recognition accuracies of the NN classifier and
that the variations of the recognition accuracies in this range are
small. This means that any value of w from this range would
be a good selection. However, to enable automatic selection of
the value for w, we define the value of w that corresponds to
the maximum of f (x/e )f (y/a ) as the normalized threshold
w = i /i .
In the example of the Iris database, illustrated in Fig. 2,
the value obtained by the described procedure for automatic
selection is w = 0.9. This means that, for the ith
feature, the threshold for the decision whether this feature
would be classified as the NF is set as i = 0.9i . The
classification accuracy in the Iris database that corresponds

FIGURE 2. (a) Inter-class and intra-class distributions of the normalized distances for a training set of Iris database; (b) the product
of inter- and intra-class distributions, illustrating that the normalized
distance corresponding to the maximum value of this product is automatically selected as the normalized threshold, or weight parameter,
w; (c) recognition accuracies for different values of w in comparison
with the recognition accuracy of the NN classifier.

to the automatically selected value w = 0.9 is 96%,


which is higher than the accuracy of the NN classifier
of 91%.

The Computer Journal, 2012

86.7(1.7)
95.6(0.8)
86.4(1.8)
75.5(1.5)
93.4(4.1)
68.2(4.1)
80.4(2.8)
70.3(2.3)
95.3(1.7)
80.9(3.8)
95.1(1.3)
90.8(2.3)
76.4(4.4)
82.7(3.1)
95.3(1.2)
85.7(1.2)
76.2(2.1)
42.2(1.2)
71.0(3.8)
79.1(3.5)
72.1(1.8)
94.8(2.3)
80.8(3.7)
95.3(1.2)
91.0(3.0)
76.3(5.3)
74.6(3.5)
94.6(1.1)
85.6(1.8)
75.6(2.0)
60.3(1.3)
44.8(1.5)
80.7(2.5)
40.3(0.7)
94.7(2.4)
80.8(3.1)
95.3(1.2)
90.2(2.7)
74.9(4.5)
79.1(2.9)
95.9(1.0)
82.7(2.5)
72.7(2.4)
94.2(2.9)
65.7(4.8)
76.5(3.5)
60.3(3.8)
95.8(1.8)
81.3(3.5)
94.8(1.5)
88.8(2.8)
68.1(6.6)

Bagging
Adaboost
NNge
VFI
LWL
K
NN
Dk-NN

The Computer Journal, 2012

Balance (4)
89.6(2.2) 89.2(1.3) 89.2(1.1) 89.2(1.0) 78.5(2.5) 87.5(1.3) 62.6(2.7) 68.5(15.7)
Breast (9)
97.2(0.9) 96.1(0.8) 96.2(0.8) 96.2(0.8) 95.8(0.8) 94.9(1.2) 92.0(1.8) 95.3(1.4)
Credit (15)
87.4(1.8) 85.8(1.7) 85.7(1.3) 86.2(1.4) 81.3(1.8) 78.6(1.9) 86.1(1.3) 85.2(1.9)
Pima (8)
77.7(1.8) 73.6(2.0) 74.3(2.1) 73.9(2.2) 70.4(2.2) 70.2(2.3) 71.9(3.5) 54.8(11.4)
Zoo (17)
96.9(4.9) 94.3(3.4) 94.5(3.8) 94.4(3.9) 95.4(3.2) 96.7(2.5) 86.3(2.8) 94.8(2.4)
Glass (9)
70.7(3.3) 66.3(4.2) 66.8(5.0) 66.5(5.8) 67.5(5.3) 73.1(4.0) 45.8(2.5) 55.6(6.0)
Statlog (13)
85.8(2.9) 79.7(3.4) 79.5(2.7) 80.4(2.1) 76.1(2.6) 75.7(2.4) 71.8(2.0) 77.9(2.3)
Vehicle (18)
84.7(1.6) 67.6(2.4) 69.3(2.2) 69.3(2.1) 68.7(1.9) 69.5(1.7) 45.2(2.1) 51.9(2.7)
Iris (4)
97.6(2.2) 95.3(2.3) 95.5(2.4) 95.3(2.2) 95.3(2.6) 94.3(2.3) 94.2(2.3) 95.8(2.7)
Hepatits (19)
86.9(4.3) 82.5(3.4) 82.2(3.1) 82.0(2.6) 80.3(4.0) 80.7(4.0) 79.9(4.8) 82.6(4.0)
Vote (16)
98.5(1.6) 92.2(2.1) 92.4(2.0) 92.4(2.0) 91.8(2.0) 92.5(1.7) 95.3(1.2) 91.7(1.7)
Ionosphere (34) 98.2(2.5) 88.8(1.4) 85.8(2.4) 85.7(2.4) 86.1(2.1) 82.3(1.9) 83.0(3.3) 92.8(2.4)
Sonar (60)
98.1(1.5) 84.3(4.0) 84.4(4.4) 84.4(4.4) 84.5(4.3) 83.8(4.4) 71.6(4.3) 59.7(5.2)
The number next to each dataset name represents the number of features in each feature vector within the database.

5.1.2. Statistical significance tests


The Friedman test is a non-parametric test that makes no
assumption about the distribution of the data. The test is applied
to n = 13 datasets and k = 12 classifiers to examine whether
the NF classifier performs better than the other classifiers. The
large value of k (k > 4) enables one to use the 2 distribution
to obtain the relevant P -values that are used to either reject or
accept the null hypothesis. In this case, the null hypothesis is
that the mean accuracy of the NF classifier is equal to the mean
accuracy of the comparing classifier. Analogous null hypothesis
is used for the AUC data. This hypothesis is rejected when the

Ik-NN

5.1.1. Results
Tables 3 and 4 show the obtained recognition accuracies and
areas under the curve (AUC), respectively. The recognition
performance of the proposed NF classifier is better in most
cases, whereas the AUC values are comparable. The average
weight values for the threshold selection for each dataset are
shown in Fig. 3. The wide variation in the weight values is due
to the variability of the data in the databases.

Downloaded from http://comjnl.oxfordjournals.org/ at Selcuk University on January 9, 2015

In this section, the proposed NF classifier is compared with


several instance-based NN classifiers with the data from
the commonly used University of California Irvine machine
learning datasets. The most widely used instance-based NN
classifiers with demonstrated good recognition performance on
the UCI dataset have been selected for the comparison with
the NF classifier: k-nearest neighbor classifier (k-NN) [26],
inverse distance-weighted K-nearest neighbor classifier (IkNN), distance-weighted k-NN (Dk-NN), K* classifier [27],
locally weighted learning (LWL) classifier [28], voting feature
interval (VFI) classifier [25], NN using generalized exemplars
(NNge) [29], adaboost classifier (Adaboost) [30], bagging
classifier [31] and logitBoost classifier. The parameters of the
classifiers are obtained by cross-validation on the training data.
In all the databases, 6040% random split of the available
objects is used to form the training and test sets, respectively.
For each database, 30 random splits are created and tested to
determine the average performance measures.

k-NN

Comparison with instance-based NN classifiers

NF

5.1.

EXPERIMENTS ON UCI DATASET

Databases

5.

TABLE 3. Recognition accuracies of the proposed method in comparison with 11 classifiers using 13 UCI databases.

This method of automatic threshold selection is used for


the results reported in Sections 57. Because this method
for threshold detection utilizes the class information, this
implementation of the classifier can be regarded as supervised.
In unsupervised data classification situations, such as a
clustering problem similar to k, the threshold i can be selected
by a trial and error approach. It should be further noted that,
to have a fair comparison with NN method, we select k in the
NN by cross-validation that exclusively uses class labels in the
training set.

LogitBoost

NN Classifier Based on NF Decisions

TABLE 4. AUC of the proposed method in comparison with 11 methods using 13 UCI datasets.
Ik-NN

Dk-NN

NN

LWL

VFI

NNge

Adaboost

Bagging

LogitBoost
0.90(0.01)
0.96(0.01)
0.84(0.01)
0.72(0.02)
0.99(0.01)
0.91(0.01)
0.82(0.03)
0.90(0.01)
0.99(0.01)
0.87(0.04)
0.99(0.01)
0.99(0.02)
0.98(0.02)

0.99(0.01)
0.99(0.01)
0.91(0.02)
0.78(0.03)
1.00(0.00)
0.80(0.07)
0.85(0.04)
0.75(0.07)
1.00(0.00)
0.78(0.06)
0.97(0.02)
0.86(0.02)
0.85(0.05)

0.99(0.01)
0.99(0.01)
0.91(0.01)
0.79(0.02)
1.00(0.00)
0.84(0.07)
0.86(0.04)
0.79(0.04)
1.00(0.00)
0.82(0.04)
0.98(0.01)
0.97(0.01)
0.91(0.03)

0.99(0.01)
0.99(0.01)
0.91(0.01)
0.79(0.02)
1.00(0.00)
0.81(0.07)
0.86(0.02)
0.79(0.04)
1.00(0.00)
0.82(0.04)
0.97(0.01)
0.97(0.01)
0.91(0.03)

0.86(0.02)
0.95(0.01)
0.81(0.02)
0.67(0.02)
1.00(0.00)
0.78(0.04)
0.76(0.03)
0.66(0.02)
1.00(0.00)
0.69(0.06)
0.92(0.02)
0.82(0.03)
0.84(0.04)

0.98(0.01)
0.99(0.00)
0.85(0.02)
0.73(0.02)
1.00(0.00)
0.91(0.03)
0.83(0.02)
0.78(0.02)
1.00(0.00)
0.81(0.06)
0.98(0.01)
0.95(0.02)
0.93(0.03)

0.83(0.03)
0.98(0.01)
0.92(0.01)
0.78(0.02)
1.00(0.02)
0.84(0.04)
0.86(0.02)
0.76(0.02)
1.00(0.00)
0.77(0.07)
0.98(0.01)
0.94(0.02)
0.81(0.05)

0.93(0.01)
0.99(0.01)
0.90(0.02)
0.59(0.03)
1.00(0.00)
0.78(0.06)
0.84(0.02)
0.74(0.03)
1.00(0.00)
0.85(0.06)
0.98(0.01)
0.96(0.01)
0.64(0.06)

0.86(0.03)
0.96(0.01)
0.82(0.03)
0.69(0.03)
0.99(0.01)
0.76(0.05)
0.76(0.04)
0.61(0.04)
1.00(0.00)
0.66(0.06)
0.95(0.01)
0.88(0.02)
0.68(0.06)

0.90(0.03)
0.99(0.00)
0.93(0.01)
0.80(0.02)
1.00(0.02)
0.71(0.02)
0.88(0.03)
0.66(0.02)
1.00(0.00)
0.81(0.05)
0.99(0.00)
0.94(0.02)
0.83(0.04)

0.95(0.02)
0.99(0.01)
0.91(0.02)
0.82(0.02)
0.51(0.01)
0.90(0.03)
0.87(0.02)
0.85(0.01)
1.00(0.00)
0.82(0.07)
0.98(0.01)
0.95(0.02)
0.84(0.05)

0.98(0.01)
0.99(0.00)
0.93(0.01)
0.81(0.02)
1.00(0.00)
0.87(0.03)
0.88(0.03)
0.83(0.02)
1.00(0.00)
0.81(0.05)
0.99(0.02)
0.95(0.02)
0.83(0.04)

Downloaded from http://comjnl.oxfordjournals.org/ at Selcuk University on January 9, 2015

Comparison with kernel-based NN classifiers

The Computer Journal, 2012

In this section, the proposed classifier is compared with


following kernel-based methods for recognition performance:
adaptive quasiconformal kernel NNs (i.e. AQK-k, AQK-e,
AQK-, AQK-i as outlined in [32]), Machte [33], Scythe
[33], discriminant adaptive NN (DANN) [34] and local parzen
windows. The procedural parameters for these methods are
empirically determined through cross-validation. Consistently
with the reported results in [32] for the Iris, Sonar, Vote,
Ionosphere and Hepatitis datasets, 60% of the data points are
used for training, whereas the remaining 40% are used as the

5.2.

P -value is small (usually <0.05). Using the 2 distribution


with k 1 degrees of freedom, we obtain 1.2 107 and
4.2 107 as the P -values for the recognition accuracies and
AUC data, respectively. These P -values are much smaller than
the significance level of 0.05, and so the null-hypothesis that the
performance of the NF classifier is equal to the other classifiers
is rejected.
The statistical significance can also be demonstrated by
the post hoc BonferroniDunn test that utilizes the critical
values from the t distribution (after a Bonferroni adjustment
to compensate for multiple comparisons). In this case, the
significance level of = 0.05 is commonly used. Again, two
null hypotheses are tested: one is that the mean accuracies of
the NF classifier are equal to all others and the other hypothesis
is that the mean AUC values of the NF classifier are equal to the
mean AUC values obtained by the other classifiers. The results
are shown in Table 5. It can be seen that the results obtained
by the proposed NF classifier are statistically better in the case
of eight accuracy comparisons and in the case of two AUC
comparisons. These results conform the statistical significance
of the improvements that can be achieved by the NF classifier,
especially in terms of the recognition accuracy.

FIGURE 3. The average weight values (the optimization parameter


of the NF classifier) obtained from 30 random splits of training and
test data for each database.

A.P. James and S. Dimitrijev

k-NN

Balance
Breast
Credit
Pima
Zoo
Glass
Statlog
Vehicle
Iris
Hepatits
Vote
Ionosphere
Sonar

NF

NN Classifier Based on NF Decisions


TABLE 5. The results of BonferroniDunn test,
where  denotes rejection of the null hypothesis (no
difference between NF and the comparing classifier)
and denotes that the null hypothesis is not rejected.
Accuracy

AUC

k-NN
Ik-NN
Dk-NN
NN
K
LWL
VFI
NNge
AdaBoost
Bagging
LogitBoost









test data. The classification errors are averaged over 20 different


runs.
5.2.1. Results
The obtained recognition errors are shown in Table 6. It can
be seen that the NF classifier significantly outperforms all
the kernel-based NN classifiers for the Iris, Sonar, Vote and
Ionosphere datasets, with slightly better performance on the
Hepatitis dataset.
5.3.

Computational complexity

The proposed algorithm has a computational complexity of


O(M(N N1)), where N 1 is the total number of objects that
do not exhibit an NF, N is the total number of objects and
M is the total number of features per object. The decisiontime complexity of the k-NN classifier is O(kMN ). Partitioning
techniques such as the based on kd-trees, aimed at reducing the
computation complexity of the k-NN implementation, result in
a complexity of O(MN log(M)). However, when the metric

FIGURE 4. The ratio of the number of NFs and the total number of
features used for similarity calculation in 13 databases.

TABLE 6. Recognition performances of the NF classifier and five kernel-based NN


classifiers using five UCI datasets.
Recognition error %

NF
AQK-k
Machete
Scythe
DANN
Parzen

Iris (4)

Sonar (60)

Vote (16)

Ionosphere (34)

Hepatitis (19)

2.4 2.2
5.5 3.7
6.8 4.2
6.0 3.5
6.9 3.8
6.4 3.4

1.9 1.5
15.7 2.8
26.6 3.7
22.4 4.5
13.3 3.7
16.1 3.5

1.5 1.6
6.0 2.2
6.3 2.7
5.9 2.8
5.4 1.9
9.3 2.4

1.8 2.5
12.0 1.8
19.4 2.9
12.8 2.6
12.6 2.3
11.0 2.7

13.1 4.3
14.3 2.3
18.6 4.0
17.9 4.3
15.1 4.2
19.4 4.5

The Computer Journal, 2012

Downloaded from http://comjnl.oxfordjournals.org/ at Selcuk University on January 9, 2015

Classifier

space dimensionality is increased, this complexity approaches


the complexity of the exhaustive NN search. Application of
the kernel trick in the NN [32] increases the computational
complexity by Kn times, i.e. O(kKn MN ), where Kn is the
scaled-up dimensionality of each feature in the feature space.
Therefore, the computational complexity of the NF classifier
is smaller in comparison to the conventional k-NN and higher
than the techniques based on kd-trees at lower dimensionality.
In highly dimensional data, the NF classifier is computationally
simpler because it ignores all the features |gi (c)| > |i gi ()|
from the similarity calculations, where gi (c) is a feature from a
training object and gi () is a feature from a test object. Figure 4
shows the ratio of the average number of NFs and the total
number of features in the training data. It can be seen that across
all the databases, the number of the NFs is lower for the interclass objects. Given that there are more inter-class features in
any multi-class classification problem, the removal of a larger
number of inter-class features results in reduced computational
complexity.
Another simplification that is achieved by the NF classifier
is due to the fact that cross-validation is not required. The

10

A.P. James and S. Dimitrijev

single optimization parameter of the NF classifier can simply


be estimated from the difference distributions obtained from the
training data.

6. APPLICATION TO IMAGE RECOGNITION

6.1.

Face recognition

To demonstrate the application of the NF classifier in a


practical application of image recognition using pixels, we
use well-known face-recognition databases: AR, PUT and
EYaleB [3537]. The example images from these databases,
shown in Fig. 5, illustrate that there is a wide range of image
variability, including variations in illumination conditions,
occlusions and large changes in face pose.
6.1.1. Database organization
The AR database consists of face images of 126 persons with
26 images per person representing 13 conditions taken at two
sessions with a time gap of two weeks.1 The PUT database
consists of 9971 images of 100 people.2 EyaleB database is
the benchmark for testing performance under large illumination
changes and consists of 5850 facial images of 10 persons each
with 585 images.3
1 These images are localized by the location of the eye coordinates and then
cropped to a size of 160120 pixels.Also the color information is converted into
the corresponding grayscale levels. The gallery set is made up of seven images
per person that have facial expression changes and illumination changes, but
do not have occlusions. The test set consists of 19 images that have larger
changes in facial expressions and illumination, and include facial occlusions.
Some examples of test images in the AR databases are shown in Fig. 5a.
2 The face images are aligned and cropped based on the ground truth eye
coordinates to a size of 360 360 pixels. The gallery set is made up of 2200
images (22 images per person) and the test set is made up of 1100 images (11
images per person). The gallery consists of images with limited and controlled
pose variations such that the facial pose varies from left to right, right to left,
top to bottom and bottom to top, keeping the center of the neutral face as the
reference. The test set consists of images with larger variations in pose compared
with the gallery images.
3 It is taken under 65 illumination conditions and in 9 different poses with
an image resolution of 480 640 pixels. Only 45 out of the 65 conditions are
usually reported in previous works. The database is usually divided into four
subsets (1, 2, 3 and 4) based on the angle of light source (12 , 25 , 50 and
77 ) with 70, 120, 120 and 140 images per pose, respectively. The remaining

n(i, j ) = a1 +

g(i, j ) g(i,
j)
.
b1 g (i, j )

(4)

Here the scaling parameter a1 adjusts the offsets and b1


compresses the range. For example, to scale the value of y within
a range of [0,1], the values of the scaling parameters are a1 = 0.5
and b1 = 6. These are determined under the assumption that,
when a1 = 0 and b1 = 1, Equation (4) would result in normally
distributed n with the mean equal to 0 and a standard deviation of
1. On the basis of the assumption of normal distribution, 99.73%
of the values (or values within 3 ) are normalized to the range
of [3, 3]. Clearly, b1 = 6 compresses the original range to
[0.5, 0.5], whereas a1 = 0.5 offsets the data by 0.5, resulting
in the range [0, 1]. The remaining 0.27% of values in y can
be treated as outliers and rounded up to the nearest boundary
limits. Figure 6 shows the histogram and the corresponding
image before and after normalizing a test image from EyaleB
database, exhibiting large variation in illumination. It can be
seen that normalization brings out details in the image that are
otherwise not visible to the human eye.
6.1.3. Recognition results
The performance of the NF classifier for face recognition using
the AR, PUT and EYaleB databases is compared with the
performance of the NN classifier in Table 7. It is obvious
that significant improvement in recognition performance can
be achieved by the NF method. It is also obvious that very high
images with larger variations than those of subsets 14 are denoted as subset 5.
Subset 5, shown in Fig. 5c, contains the largest image variations. All 90 images
(10 persons, 9 poses, 1 image) from subset 1 are used as the gallery set, while
the remaining 5760 images (10 persons, 9 poses, 64 images) form the test set.

The Computer Journal, 2012

Downloaded from http://comjnl.oxfordjournals.org/ at Selcuk University on January 9, 2015

In the past, it has been observed that the use of extracted features
in classifiers results in better recognition rates compared with
the use of raw features, such as pixels. However, a general
classifier should be able to utilize the features both in its
original form and an extracted form. Image classification is an
example of a problem where either raw or extracted features
can be employed. In this section, we select two representative
image-recognition examples to demonstrate the performance
of the proposed NF classifier: (1) face recognition using raw
features and (2) handwritten character recognition using raw
and extracted features.

6.1.2. Normalization of pixels


The raw pixels in their original form cannot be used
for classification because of considerably large changes in
illumination and intensity offsets between test and gallery
images. Normalization is an essential preprocessing step that
is required to bring the test and gallery pixels to the same
reference point for comparison. Consider two images that are
two-dimensional matrices of pixel intensity values and that the
task prior to classification is to calculate the difference between
the two images. Assume that the two compared images belong
to the same class and that everything in the images looks the
same except that one image is brighter than the other. This
kind of illumination change is often the main cause for wrong
distance calculations that adversely affect image classification.
The most effective way to compensate for such changes is by
applying normalization methods [3846]. Perhaps the simplest
way to implement normalization is by local standardization of
the feature vectors [46]. Consider the raw feature vector g with
a size of U V pixels with local standard deviation t and
local mean g of the image pixels encompassed by a window w
consisting of M N pixels. Then the normalized feature n(i, j )
is defined as

NN Classifier Based on NF Decisions

11

FIGURE 6. Illustration of the effect of normalization performed by Equation (4): (a) the distribution of pixel intensities of a raw image is
approximately exponential; (b) the distribution of pixel intensities of the normalized image is approximately Gaussian. The normalization brings
out features that are otherwise not visible to the human eye.

TABLE 7. Face recognition performance of the NF classifier.

Images per class

Recognition
accuracy (%)

Database

Gallery

Test

NF

k-NN

AR [35]
PUT [36]
EyaleB [37]

7
22
9

19
11
576

94
90
100

84
83
96

recognition accuracies are achieved across different databases


covering a wide range of natural variability. This demonstrates
the high level of robustness that is associated with the method
of decision making at the feature level (the NF decisions).
6.2.

6.2.1. Database organization


The character recognition database described in [47] consists
of 200 objects per class, representing 09 numerical digits, as
illustrated in Fig. 7.4
6.2.2. Recognition results
Table 8 shows the recognition performance of the NF classifier,
in comparison with the NN classifier. These results are obtained
by dividing the database into gallery and test sets by randomly
selecting equal number of gallery and test objects for each
class. The reported recognition accuracies are the average values
obtained for 100 repeated tests with random selections of test
and gallery images. It can be seen that the implementation of
the NF decisions significantly improves the recognition rates,
irrespective of the feature type employed. It can also be seen
4 The images of the digits are extracted and cropped to 30 48 pixel regions

Character recognition

The character recognition database described in [47] is a good


example where either raw or transformed features can be used
in classification.

to form the raw feature vectors. The transformations applied to the raw features
are Zernike moments, KarhunenLoeve features and Fourier descriptors. The
Zernike features consist of 47 rotationally invariant Zernike moments and 6
morphological features. KarhunenLoeve transformation results in 64 features
and Fourier transformation results in 76 two-dimensional shape descriptors.

The Computer Journal, 2012

Downloaded from http://comjnl.oxfordjournals.org/ at Selcuk University on January 9, 2015

FIGURE 5. Examples of face images from (a) AR database [35], (b) PUT database [36] and (c) EYaleB database [37].

12

A.P. James and S. Dimitrijev


TABLE 9. Organization of the medical datasets used in our
experiments.

Class types

Number of objects
per class

Number of
features

WBCD

Benign
Malignant
Benign
Malignant
Nephritis
No Nephritis
Inflammation
No inflammation
Normal
Tumor
Normal
Tumor
Acute myeloid
Acute lymphoblastic
Breast
Prostate
Lung
Colorectal
Lymphoma
Bladder
Melanoma
Uterus
Leukemia
Renal
Pancreas
Ovary
Mesothelioma
CNS

458
241
357
212
70
50
61
59
40
22
39
21
25
47
11
10
11
11
14
11
10
10
14
11
11
11
11
12

10
10
33
33
6
6
6
6
2000
2000
7129
7129
7129
7129
16 063
16 063
16 063
16 063
16 063
16 063
16 063
16 063
16 063
16 063
16 063
16 063
16 063
16 063

WDBC
Acute-1
Acute-2
Colon
TumorC
Leukemia
GCM
FIGURE 7. Examples of images used in the handwritten character
recognition experiment. There are 10 classes, each representing a
handwritten integer.
TABLE 8. Character recognition performance of the NF classifier.

Objects per class

Recognition
accuracy(%)

Feature type

Gallery

Test

NF

k-NN

Zernike
KL
Fourier
Pixels

100
100
100
100

100
100
100
100

93
95
96
98

76
93
82
96

from Table 8 that the use of raw pixels results in comparatively


higher recognition accuracies and robustness, which indicates
the stability of the NF classifier when using higher number of
features.

Accumulated knowledge by adding case studies of different


patients, which increases the number of objects in the gallery,
should improve the accuracy of the automatic diagnosis. The
primary role of this section is to provide benchmark results for
the NF classifier in medical diagnosis using publicly available
datasets.

7. APPLICATION TO MEDICAL DIAGNOSIS

7.1. Two class problems

The role of pattern classification methods in medical diagnosis


is to provide a presumptive diagnosis of diseases or conditions
affecting a patient. Automatic diagnosis is a difficult problem of
high practical significance. Even the 100% detection of a disease
does not completely ensure the correct diagnosis in real-life
situations due to the changing nature of infections, accuracy of
test methods and missing data. It should be noted that the main
role of automatic detection is to provide a mandatory second
opinion to the medical practitioner who is treating the patient.

7.1.1. Database organization


To demonstrate the performance of the NF classifier, we use the
databases from UCI Machine Learning Repository described in
Table 9: Wisconsin Breast Cancer Database (WBCD) [48, 49],
Wisconsin Diagnostic Breast Cancer (WDBC) [50],5 Acute
5 WBCD has a total of 699 objects that were collected between 1989 and
1991. Each object consists of 10 features that are diagnostic attributes of breast
cancer. This database is suitable as a binary classification problem with 468
objects that represent measurements from benign cancer and 241 objects that

The Computer Journal, 2012

Downloaded from http://comjnl.oxfordjournals.org/ at Selcuk University on January 9, 2015

Dataset

NN Classifier Based on NF Decisions


TABLE 10. Medical diagnosis performance of the NF
classifier.

Dataset

WBCD [48]
WDBC [50]
Acute-1 [51]
Acute-2 [51]
Colon [52]
Leukemia [54]

GCM [55]

Recognition
accuracy (%)

Gallery/Test

NF

349
150
284
284
60
60
59
61
31
31
38
34
30
30
144
46

k-NN

97.8

96.3

97.1

91.6

100.0

92.9

100.0

93.2

84.1

74.7

97.1

82.8

77.2

55.1

54.3

41.6

Inflammations Dataset6 [51] (Accute-1 and Accute-2), Colon


Tissues Cancer Dataset (Colon) [52], Central Nervous System
Embryonal Tumor Dataset (TumorC) [53] and Leukemia
Dataset (Leukemia) [54].7
7.1.2. Recognition results
Table 9 shows that the databases Colon, TumorC and Leukemia
have more features and fewer gallery objects than WBCD,
WDBC, Acute-1 and Acute-2. Classification on databases
having a large number of features and small number of gallery
objects are more difficult than the one having fewer features
and a large number of gallery objects. Table 10 shows the
benchmark detection performance of the NF classifier in some
represent measurements from malignant cancer. The WDBC database has 30
features that are computed from a digitized image of a fine needle aspirate of
a breast mass. They describe characteristics of the cell nuclei present in the
image. This database is used for a binary classification problem of benign and
malignant cancers in the cells.
6 The acute inflammation database contains six features and can be used to
make presumptive diagnosis of two diseases of the urinary system. This would
require two separate classifiers, one for making a decision on the presence
and absence of inflammation of the urinary bladder and another for a making
decision on the presence and absence of nephritis of renal pelvis origin. The
former is shown as Acute-1 and the later as Acute-2.
7 In the recent past, gene expressions are being extensively used in the
process of automatic binary diagnosis problems. Here, we use Colon Tissues
Cancer Dataset (Colon), Central Nervous System Embryonal Tumor Dataset
(TumorC) and Leukemia Dataset (Leukemia) to demonstrate the recognition
performance of our method. The colon dataset consists of 2000 features
representing genes with the highest minimal intensity across the 62 tissues that
are either tumor or normal colon tissues. The gene intensities that represent
the features are derived from the 20 feature pairs that correspond to the
gene on the chip and derived using the filtering process described in [54].
The leukemia database has 7129 features representing genes with 62 objects
either belonging to acute myeloid leukemia or acute lymphoblastic leukemia
conditions. TumorC dataset consists of 7129 features representing genes and
has 60 samples/objects belonging to either of two classes. These databases
can be downloaded from the following repositories: http:// tunedIT.org and
http://archive.ics.uci.edu/ml/datasets.html.

of the examples of binary classification problems in medical


diagnosis. Excluding the Leukemia dataset that has a predefined
gallery and test sets, the test and gallery sets are randomly
selected into equal class sizes. The recognition accuracies are
calculated as the average of 100 random divisions of the gallery
and test sets. It should be noted that, for each random selection
of test and gallery sets, cross-validation is performed to find
out the optimal value of i . It can be seen from Table 10 that
the proposed NN classifier outperforms the NN classifier. The
performance improvement by the NF approach is the most
significant when the number of features in the database is large.
This shows the robustness of the NF classifier when highdimensional feature vectors are used.
7.2.

Multiple class problems

Multiclass cancer detection using tumor samples having gene


expressions as their features is a difficult and challenging
classification problem [55]. The database described in [55] has
gene expression profiles of 218 tumor samples. This database
is known as Global Cancer Map (GCM) and consists of 14
classes of cancer described in Table 9. Automatic classification
of cancer based on measurements from genes can be used
as a mandatory second opinion in patient diagnosis. Such a
multiclass problem needs special attention due to its clinical
and practical importance. In our classification experiments, we
use 144 samples to form the gallery and 46 samples to form the
test sets.
7.2.1. Using direct multiclass classification
The direct approach to detect the cancer is by treating the
problem as a multiclass classification using a single classifier.
The top rank recognition performance of the NF classifier is
shown in Table 10. The overall recognition accuracy of 54.3%
at the top rank implies that many right detections would be in
the lower ranks. We find that including lower ranks, for example
rank 7 (to consider the top 50% of the class selections), the
recognition accuracy increases to 75%. However, since only
one in seven classes would give such a higher chance of correct
detection, the direct multiclass classification is not a viable
option in practical diagnosis.
7.2.2. Using one versus all NF classification
To improve the cancer detection accuracy of the NF classifier,
the one versus all (OVA) classification framework is designed,
as shown Fig. 8. The multiclass problem having 14 classes
is divided into a series of 14 OVA comparisons. Each test
object/sample is compared and tested with 14 OVA classifiers,
each of which either detects (yes decision) or rejects (no
decision) the possibility for that specific cancer. Along with
the decision, each OVA classifier also presents a global voting
score that can be used to check the confidence of the decision
with respect to the remaining OVA classifiers. The feature
thresholds of the OVA classifiers are calculated individually

The Computer Journal, 2012

Downloaded from http://comjnl.oxfordjournals.org/ at Selcuk University on January 9, 2015

TumorC [53]

Objects per set

13

14

A.P. James and S. Dimitrijev


TABLE 11. OVA NF classifier performance on GCM dataset.

from the 144 gallery objects using cross-validation by dividing


it into a binary classification problem, e.g. breast cancer
verses no breast cancer samples/objects. The values that best
discriminate the correct detection of a class while rejecting
the others are selected as the optimal thresholds for that OVA
classifier.

7.2.3. Recognition results


Table 11 shows the detection performance of the presented OVA
NF classifier. In the table, the average false detection is the
percentage of samples that are wrongly detected as belonging
to a particular class. For example, 0% of false detection for
lung cancer would mean that all test samples belonging to the
lung cancer class were correctly detected and none of them
was detected as belonging to any other class. Using the 46 test
samples and taking into account, the average false detection
percentages of each class, the overall prediction accuracy of
the classifier is 83%. This is considerably higher than the 71%
accuracy achieved by the NN classifier implemented in the OVA
approach.

8.

OVA detection
accuracy (%)

Average false
detection (%)

Breast
Prostate
Lung
Colorectal
Lymphoma
Bladder
Melanoma
Uterus
Leukemia
Renal
Pancreas
Ovary
Mesothelioma
CNS

67
100
100
100
100
100
100
100
100
100
100
100
100
100

8
43
0
19
17
21
7
32
0
34
21
40
31
0

CONCLUSIONS

In this paper, the concept of detecting the NFs is introduced


and a general purpose NF classifier is presented. The proposed
approach enables the use of highly dimensional feature
vectors without compromising recognition accuracies. Robust
recognition performance is demonstrated for different problems
of image recognition and medical diagnosis using a wide range
of benchmark databases.
The performance of the NF classifier is demonstrated
for binary, multiclass and in OVA classification problems.
The method shows considerable improvement in recognition
accuracies in comparison with its closest counterpartthe
NN classifier without the detection of NFs. As opposed
to the conventional classification approaches that rely on
feature reduction methods, the results shown in this paper
demonstrate that the NF classifier can successfully utilize the
increased availability of features in the problems involving
gene sequences, high-resolution images and data for predictive
analysis.

ACKNOWLEDGEMENTS
The authors would like to acknowledge the anonymous
reviewers for their constructive comments, which have resulted
in significant improvements of the presented research.

REFERENCES
[1] Dasarathy, B. (1990) Nearest Neighbor Classification Techniques. IEEE Press, Hoboken, NJ.
[2] Duda, R.O., Hart, P.G. and Stork, D.E. (2001) Pattern
Classification. Wiley, New York.

The Computer Journal, 2012

Downloaded from http://comjnl.oxfordjournals.org/ at Selcuk University on January 9, 2015

FIGURE 8. A multiclass classification scheme using OVA NF


classifier. Similar to the scheme in [55], the multiclass cancer detection
problem is divided into 14 OVA problems, where the OVA problem
(e.g. breast cancer versus no breast cancer) is addressed by a separate
classifier. The optimal parameters of 14 classifiers are determined
separately by cross-validation using gallery data. The test sample is
presented to all the classifiers and a positive yes response from a
classifier indicates the possibility of that cancer. One or more classifiers
can indicate a yes decision. The confidence of the yes decision
can be detected by ranking the global scores associated with the yes
decisions.

Class types

NN Classifier Based on NF Decisions

[19] Shakhnarovich, G., Darrell, T. and Indyk, P. (2006) NearestNeighbor Methods in Learning and Vision: Theory and Practice.
MIT press, Cambridge.
[20] Short, R. and Fukunaga, K. (1984) The optimal distance measure
for nearest neighbor classification. IEEE Trans. Inf. Theory,
IT-27, 622627.
[21] Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R. and
Wu, A.Y. (1998) An optimal algorithm for approximate nearest
neighbor searching in fixed dimensions. J. ACM, 45, 891923.
[22] Kleinberg, J. (1997) TwoAlgorithms for Nearest Neighbor Search
in High Dimensions. Proc. 29th Annual ACM Symp. Theory of
Computing, El Paso, TX, USA, May 46, pp. 599608. ACM.
[23] Indyk, P. and Motwani, R. (1998) Approximate Nearest
Neighbors: Towards Removing the Curse of Dimensionality.
Proc. 30th Symp. Theory of Computing, Dallas, TX, USA, May
2426, pp. 604613. ACM.
[24] Akkus, A. and Guvenir, A.H. (1996) K Nearest Neighbor
Classification on Feature Projections. Proc. ICML, Bari, Italy,
July 36, pp. 1219. Morgan Kaufmann.
[25] Demirz, G. and Guvenir, A.H. (1997) Classification by Voting
Feature Intervals. Proc. ECML-97, Prague, Czech Republic,April
2325, pp. 8592. Springer.
[26] Aha, D. and Kibler, D. (1991) Instance-based learning algorithms.
J. Mach. Learn., 6, 3766.
[27] Cleary, J.G. and Trigg, L.E. (1995) K*: An Instance-Based
Learner Using an Entropic Distance Measure. Proc. 12th Int.
Conf. Machine Learning, Tahoe City, CA, USA, July 912, pp.
108114. Morgan Kaufmann.
[28] Frank, E., Hall, M. and Pfahringer, B. (2003) Locally Weighted
Naive Bayes. Proc. 19th Conf. Uncertainty in Artificial
Intelligence, Acapulco, Mexico, August 710, pp. 249256.
Morgan Kaufmann.
[29] Freund, Y. and Schapire, R.E. (1996) Experiments with a New
Boosting Algorithm. Proc. 13th Int. Conf. Machine Learning
ICML 1996, Bari, Italy, July 36, pp. 148156. Morgan
Kaufmann.
[30] Breiman, L. (1996) Bagging predictors. J. Mach. Learn., 24, 123
140.
[31] Friedman, J., Hastie, T. and Tibshirani, R. (2000)Additive logistic
regression: a statistical view of boosting. Ann. Stat., 28, 337407.
[32] Peng, J., Heisterkamp, D.R. and Dai, H.K. (2004) Adaptive
quasiconformal kernel nearest neighbor classification. IEEE
Trans. Pattern Anal. Mach. Intell., 26, 656661.
[33] Friedman, J.H. (1994) Flexible Metric Nearest Neighbour
Classification. Technical Report, Department of Statistics,
Stanford University.
[34] Hastie, T. and Tibshirani, R. (1996) Discriminant adaptive nearest
neighbor classification. IEEE Trans. Pattern Anal. Mach. Intell.,
18, 607615.
[35] Martinez, A. and Benavente, R. (1998) The AR Face Database.
Technical Report 24. CVC Technical Report.
[36] Kasinski, A., Florek, A. and Schmidt, A. (2008) The put face
database. Image Process. Commun., 13, 5964.
[37] Georghiades, A.S., Belhumeur, P.N. and Kriegman, D.J. (2001)
From few to many: illumination cone models for face recognition
under variable lighting and pose. IEEE Trans. Pattern Anal. Mach.
Intell., 23, 643660.

The Computer Journal, 2012

Downloaded from http://comjnl.oxfordjournals.org/ at Selcuk University on January 9, 2015

[3] Cover, T. and Hart, P. (1967) Nearest neighbor pattern


classification. IEEE Trans. Inf. Theory, 13, 2127.
[4] Bailey, T. and Jain, A. (1978) A note on distance-weighted knearest neighbor rules. IEEE Trans. Syst. Man Cybern., 8, 311
313.
[5] Dudani, S. (1976) The distance-weighted k-nearest-neighbor
rule. IEEE Trans. Syst. Man Cybern., 6, 325327.
[6] Beyer, K., Goldstein, J., Ramakrishnan, R. and Shaft, U. (1999)
When is Nearest Neighbour Meaningful?. Proc. 7th Int. Conf.
Database Theory ICDT 99, Lecture Notes in Computer Science,
Jerusalem, Israel, January 1012, pp. 217235. Springer.
[7] Houle, M.E., Kriegel, H.-P., Krger, P., Schubert, E. and Zimek,
A. (2010) Can Shared-Neighbor Distances Defeat the Curse of
Dimensionality? Proc. SSDBM 2010, Lecture Notes in Computer
Science, Heidelberg, Germany, June 30July 30, pp. 482500.
Springer.
[8] Guyon, I. and Elisseeff, A. (2003) An introduction to variable and
feature selection. J. Mach. Learn. Res., 3, 11571182.
[9] Peng, H., Long, F. and Ding, C. (2005) Feature selection based on
mutual information: criteria of max-dependency, max-relevance,
and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell., 27,
12261238.
[10] Guyon, I. and Elisseeff, A. (2003) An introduction to variable and
feature selection. JMLR Spec. Issue Var. Feature Sel., 3, 1157
1182.
[11] Weinberger, K.Q. and Saul, L.K. (2009) Distance metric learning
for large margin nearest neighbor classification. J. Mach. Learn.
Res., 10, 207244.
[12] Xing, E.P., Ng, A.Y., Jordan, M.I. and Russell, S. (2002) Distance
Metric Learning, with Application to Clustering with SideInformation. Advances in Neural Information Processing Systems
NIPS 2001, Vancouver, Canada, December 1012, pp. 521528.
MIT press.
[13] Goldberger, J., Roweis, S., Hinton, G. and Salakhutdinov, R.
(2004) Neighbourhood Component Analysis. Advances in Neural
Information Processing Systems NIPS 2004, Vancouver, BC,
Canada, December 1318, pp. 513520. MIT Press.
[14] Athitsos, V., Alon, J. and Sclaroff, S. (2005) Efficient Nearest
Neighbor Classification Using a Cascade of Approximate
Similarity Measures. IEEE Conf. Computer Vision and Pattern
Recognition CVPR 2005, San Diego, CA, USA, June 2026, pp.
486493. IEEE Computer Society.
[15] Athitsos, V., Alon, J., Sclaroff, S. and Kollios, G. (2004)
Boostmap: A Method for Efficient Approximate Similarity
Rankings. IEEE Conf. Computer Vision and Pattern Recognition
CVPR 2004, Washington, DC, USA, June 27July 2, pp. 268
275. IEEE Computer Society.
[16] Athitsos, V. (2006) Learning embeddings for indexing, retrieval,
and classification, with applications to object and shape
recognition in image databases. Ph.D Thesis, Boston University.
[17] Denoeux, T. (1995) A k-nearest neighbor classification rule based
on dempster-shafer theory. IEEE Trans. Syst. Man Cybern., 25,
804813.
[18] Zuo, W., Zhang, D. and Wang, K. (2008) On kernel differenceweighted k-nearest neighbor classification. Pattern Anal. Appl.,
11, 247257.

15

16

A.P. James and S. Dimitrijev


[47] van Breukelen, M., Duin, R., Tax, D. and den Hartog, J.
(1998) Handwritten digit recognition by combined classifiers.
Kybernetika, 34, 381386.
[48] Mangasarian, O.L. and Wolberg, W.H. (1990) Cancer diagnosis
via linear programming. SIAM News, 23, 118.
[49] Wolberg, W.H. and Mangasarian, O. (1990) Multisurface method
of pattern separation for medical diagnosis applied to breast
cytology. Proc. Natl Acad. Sci. USA, 87, 91939196.
[50] Street, W., Wolberg, W. and Mangasarian, O. (1993) Nuclear
Feature Extraction for Breast Tumor Diagnosis. IS and T/SPIE
1993 Int. Symp. Electronic Imaging: Science and Technology,
San Jose, CA, USA, January 31February 4, pp. 861870. SPIE.
[51] Czerniak, J. and Zarzycki, H. (2003) Application of Rough Sets
in the Presumptive Diagnosis of Urinary System Diseases. Proc.
9th Int. Conf. Artificial Intelligence and Security in Computing
Systems, ACS2002, Miedzyzdroje, Poland, October 2325, pp.
4151. Kluwer Academic Publishers.
[52] Alon, U., Barkai, N., Notterman, D.A., Gish, K.,Ybarra, S., Mack,
D. and Levine, A.J. (1999) Broad patterns of gene expression
revealed by clustering of tumor and normal colon tissues probed
by oligonucleotide arrays. PNAS, 96, 67456750.
[53] Pomeroy, S.L. et al. (2002) Prediction of central nervous system
embryonal tumour outcome based on gene expression. Nature,
415, 436442.
[54] Golub, T. et al. (1999) Molecular classification of cancer: class
discovery and class prediction by gene expression monitoring.
Science, 286, 531537.
[55] Ramaswamy, S. et al. (2001) Multiclass cancer diagnosis using
tumor gene expression signatures. PNAS, 98, 1514915154.

The Computer Journal, 2012

Downloaded from http://comjnl.oxfordjournals.org/ at Selcuk University on January 9, 2015

[38] truc, V. and Paveic, N. (2011). Photometric normalization


techniques for illumination invariance, In Zhang, Y.-J (ed.),
Advances in Face Image Analysis. IGI Global.
[39] truc, V. and Paveic, N. (2009) Gabor-based kernel partial-leastsquares discrimination features for face recognition. Informatica
(Vilnius), 20, 115138.
[40] Gross, R. and Brajovic, V. (2003) An Image Preprocessing
Algorithm for Illumination Invariant Face Recognition. Proc.
4th Int. Conf. Audio- and Video-Based Biometric Personal
Authentication, Guildford, UK, June 911, pp. 1018. Springer.
[41] Park, Y., Park, S. and Kim, J. (2008) Retinex method based on
adaptive smoothing for illumination invariant face recognition.
Signal Process., 88, 19291945.
[42] Chen, W., Er, M. and Wu, S. (2006) Illumination compensation
and normalization for robust face recognition using discrete
cosine transform in logarithmic domain. IEEE Trans. Syst. Man
Cybern. B, 36, 458466.
[43] truc, V., ibert, J. and Paveic, N. (2009) Histogram remapping
as a preprocessing step for robust face recognition. WSEAS Trans.
Inf. Sci. Appl., 6, 520529.
[44] Jobson, D., Rahman, Z. and Woodell, G. (1997) A multiscale
retinex for bridging the gap between color images and the human
observations of scenes. IEEE Trans. Image Process., 6, 965976.
[45] truc, V. and Paveic, N. (2009) Illumination Invariant Face
Recognition by Non-local Smoothing. Proc. BIOID Multicomm,
Madrid, Spain, September 1618, pp. 18. Springer.
[46] James, A.P. and Dimitrijev, S. (2010) Inter-image outliers and
their application to image classification. Pattern Recognit., 43,
41014112.