Sie sind auf Seite 1von 19

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.

3, June 2015

Regularized Weighted Ensemble of Deep


Classifiers
Shruti Asmita1 and K.K. Shukla2
1

Department of Computer Science, Banasthali University, Jaipur-302001,Rajasthan, India


2
Department of Computer Science and Engineering, Indian Institute of Technology,
Banaras Hindu University, Varanasi-221005, Uttar Pradesh, India

ABSTRACT
Ensemble of classifiers increases the performance of the classification since the decision of many experts
are fused together to generate the resultant decision for prediction making. Deep learning is a
classification algorithm where along with the basic learning technique, fine tuning learning is done for
improved precision of learning. Deep classifier ensemble learning is having a good scope of research.
Feature subset selection is another for creating individual classifiers to be fused for ensemble learning. All
these ensemble techniques faces ill posed problem of overfitting. Regularized weighted ensemble of deep
support vector machine performs the prediction analysis on the three UCI repository problems IRIS,
Ionosphere and Seed data set, thereby increasing the generalization of the boundary plot between the
classes of the data set. The singular value decomposition reduced norm 2 regularization with the two level
deep classifier ensemble gives the best result in our experiments.

KEYWORDS
Deep learning, support vector machine, feature subset selection, singular value decomposition,
regularization

1.INTRODUCTION
Machine learning is a domain of computational statistics, a specialized field of prediction making.
This aims at artificial learning i.e. the construction of such algorithms which are capable of
learning from data [1]. Such learning is based on the development of model from training data
and hence making decisions using the model on the test data. Supervised machine learning [2] is
marked by the presence of a supervisor in a way that training set comprising of a number of
inputs and corresponding output i.e. associated label is provided to the machine for initial
learning and model forming. Later with the help of this generated model, required output is
generated on any input not present in the training set. On the other side, the unsupervised learning
[2] does not contain any such supervisor. These try to find out hidden relation between the
unlabelled data. Classification, regression etc. are techniques of supervised learning whereas
clustering, self-organizing neural network map etc. are techniques of unsupervised learning.
Other learning approaches in existence are semi supervised learning, reinforcement learning,
developmental learning etc.
In classification [3], the training data is divided into two or more classes. A model is required to
be formed which can distinguish between the category and generate an ability to place new input
instances in the correct class to which it belongs. The performance measure of classification is the
classification accuracy. The goal of any learning lies in achieving best possible classification
DOI:10.5121/ijcsa.2015.5305

47

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

accuracy. Several classification algorithms are being applied onto various datasets but the scope
of improvement in the performance through the use of new techniques is always there. Machine
learning aims at obtaining high test accuracy. Number of popular classifiers used widely for
several classification techniques are k nearest neighbour classifier, decision tree classifier,
frequent pattern classifier, bayes classifier, rule based classifier, support vector machine (SVM)
classifier etc. [4] Among these SVM [5] classifier is most studied and implemented classifier
these days because of its high accuracy and exceptional ability to model complex non-linear
decision boundaries by mapping non-linear data to higher dimensions. Hence both linear as well
as non-linear data can be well classified by this classifier. Also, because of the presence of
support vectors in SVM classifiers, the compactness of the classification is very high. Groups of
people can often make better decisions than individuals [6]. Hence the ensemble of classification
models results into improved classification accuracy than the individual classifier model.
The task of prediction can be time series where the training data for model generation is recorder
over a long span of time and in such cases batch learning is done [7]. In batch learning, the model
generated on individual batches till the previous time unit is ensembled to form the resultant
model for testing of present batch data. Another prediction can be non-time series where the
training data for model generation contains various instances at one particular time instant. Batch
learning is not feasible in such classifications since all the instances are equally related to each
other. Hence for obtaining the ensemble of classifiers, the techniques possible for individual
model generation are bagging [6] with bootstrap subsampling, deep learning and feature subset
selection etc. These techniques aim at increasing the diversity for the ensemble of classifiers.
Even in the ensemble of classifier model, there occurs an ill posed problem of overfitting. This
problem can be handled through regularization. The vector norms applied in the process of
regularization handles overfitting by reducing the mean squared distance between the training
instances.
This paper deals with three prediction problems, first, the prediction of type of IRIS plant from
among Iris Setosa, Iris Versicolour and Iris Virginica, second, the prediction of a good radar
return or a bad radar return from the Ionosphere and third, the prediction of type of wheat kernel
from among Kama, Rosa and Canadian variety. The prediction making for above is done through
regularized weighted ensemble of deep support vector machine classifiers. The individual models
for the ensemble learning are generated through feature subset selection and deep learning. The
weights are assigned to each individual model by majority voting technique. These weights are
then regularized through four variations i.e. norm 1, norm 2, tikhonov and singular value
decomposition (SVD) reduced norm 2 regularization. This form of regularization reduces the
curvature of each depression and convolution of the non-linear boundary plot of SVM and hence
the loss function is modified to promote generalization and provide the essential curve fitting over
the input feature vectors for classification. To the best of our knowledge, this technique of
regularization of weights with deep learning and such ensemble learning approaches in the
supervised machine learning task, for dealing with the problem of overfitting of the classifiers has
yet not been applied to such prediction problems. In the stretch of paper firstly the detail about
dataset and background concepts are discussed. Moving further the algorithm, framework,
experiment results and comparison analysis is done.

2. DATA SET
Three prediction problems used in this paper are summarized in table 1. The training set and test
set comprise of 70% and 30% of the whole database respectively. This ratio of 7:3 is an arbitrary
ratio but is chosen because it is a good practical ratio according to most of the experiments in
machine learning.
48

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

2.1.IRIS Dataset
Iris database is created by R.A. Fisher and donated by Michael Marshall in July 1988 [8]. This is
a popular dataset and is being successfully used in several problems related to prediction and
pattern recognition. The data set contains 3 classes specifying the type of iris plant from among
Iris Setosa, Iris Versicolour and Iris Virginica. There are a total of 50 instances per class in the
whole dataset. The classification problem is the prediction of category of Iris plant. The four
attributes or features in record of the dataset are sepal length (cm), sepal width (cm), petal length
(cm) and petal width (cm). Table 2 describes the number of instances of each class in total,
training and test data of Iris data set. Table 3 describes major previous related work done on Iris
data.
Table 1. Instances distribution in training and test set of data

S.
No.

Dataset

Year of Number
data set of classes
creation

Number
of
features

1
2
3

IRIS
Ionosphere
Seed

1988
1989
2012

4
34
7

3
2
3

Total
number
of
instances
150
351
210

Training
Test
number of number of
instances
instances
105
246
147

45
105
63

Table 2. Number of instances of each class in total, training and test data set of Iris Data set

S. No.

Feature

1
2
3

Iris Setosa
Iris Versicolour
Iris Virginica

Total number of Training


instance
number
instance
50
35
50
35
50
35

Test number of
of instance
15
15
15

Table 3. Previous major experiments reported on Iris data set and classification accuracy achieved in each
case

S. No.

Year of Problem Statement


Research

Reported
classification
accuracy (%)
Neuro-fuzzy classifier system [9]
96.70
Evolving neural network ensembles using string genetic 93.30
algorithms for pattern classification [10]
Hybrid SVM and decision tree classifier [11]
97.08
Classifier ensemble for SVM [12]
95.00
One class SVM weighted bagging [13]
92.00
Large margin classifier SVM [14]
95.30
Feature subset selection in neural network classifier[15]
97.00
SVM based semi supervised classification [16]
95.00
SVM Ensemble with majority voting [17] : SVM
96.50
: Bagging
96.80
: Boosting
97.20

1.
2.

2014
2013

3.
4.
5.
6.
7.
8.
9.

2012
2012
2011
2010
2010
2008
2003

49

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

2.2.Ionosphere Dataset
Ionosphere database is created at Johns Hopkins University and donated Vince Sigillito in 1989
[18]. For collection of the dataset, radar system is used. This radar contains phased array of 16
high frequency antennas with the help of which the free electrons in the ionosphere are recorded.
The two classes into which the categorization has to be done are good and bad ionosphere.
Predictions are done on the basis of 34 attributes. This large number of attribute lists marks this
dataset different from the other two dataset mentioned in this section. Table 4 shows the number
of instances of each class in total, training and test data of Ionosphere data set. Table 5 shows
previous major similar contribution on Ionosphere data set
Table 4. Number of instances of each class in total, training and test data set of Ionosphere Data set

S. No.

Feature

1
2

Good radar signal


Bad radar signal

Total number of Training


instance
number
instance
224
168
127
78

Test number
of of instance
56
49

Table 5. Previous major experiments reported on Ionosphere data set and classification accuracy achieved
in each case

S. No.

Year of Problem Statement


Research

Reported
classification
accuracy (%)
Classifier ensemble based on weighted accuracy and 94.00
diversity [19]
Weighted classifier ensemble SVM [20]
94.00
Artificial
immune
recognition
through
SVM 93.00
classification[21]
One class ensemble classifier majority voting approach 89.80
[22]
Fast local Radial basis function kernel SVM 93.72
classification[23]
Oblique decision tree embedded with SVM 92.59
classification [24]
SVM infinite ensemble learning [25]
92.00
Evolving ensemble of classifiers with majority voting 81.00
[26]

2014

2
3

2014
2013

2013

2010

2008

7
8

2008
2006

2.3. Seed Dataset


Seed database is one of the new database and hence has a very few previous experiments in its
list. Dataset mentions in it the geometrical properties of the kernel which is a characteristic to
differentiate varieties of wheat i.e. Kama, Rosa and Canadian. For the collection of the dataset
some X-ray techniques are used [27]. Seven parameters of wheat kernels which forms the feature
set in the dataset are area (A), compactness (C = 4*pi*A/P^2), perimeter (P), length of kernel,
asymmetry coefficient, width of kernel, and length of kernel groove. Table 6 shows number of
instances of each class in total, training and test data set of seed data. Table 7 shows the major
50

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

previous similar contribution on seed data. (To the best of our knowledge, this data has been
worked upon on the similar proposals, only by its developers, till date. Hence a single previous
work is reported in table 7).
Table 6. Number of instances of each class in total, training and test data set of Seed Data set

S. No.

Feature

1
2
3

Kama
Kosa
Canadian

Total number of Training


instance
number
instance
70
49
70
49
70
49

Test number of
of instance
21
21
21

Table 7. Previous major experiments reported on seed data set and classification accuracy achieved in each
case

S. No.

Year of Problem Statement


Research

Reported
classification
accuracy (%)
Complete gradient clustering with K Mean Algorithm 92.00
[28]

2012

3.BACKGROUND APPROACH
3.1.SVM Classifier
Origin of SVM classifiers lies in VC dimensions. VC dimension is defined on a set of function. It
is the maximum number of points that can be separated in all possible ways by that set of
function. The non-linearly separable data are transformed to higher dimensions for achieving
classification through SVM (figure 1). The margin between the classes can be soft margin or hard
margin (figure 2). In case of soft margin classifiers, the model generated contains the
compensation of the misclassified instances. However the hard margin does not allow any
misclassification. Instead, it plots strict non-linear boundary to avoid misclassification. SVM
classifies the data through hinge loss optimization function. Soft margin classification is more
prevalent than hard margin classification since the later faces a very high rate of overfitting.

Figure 1. SVM
51

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

3.2.Ensemble of classifiers
Ensemble Learning is the process of training multiple learning machines individually and thereby
combining their outputs similar to a committee of decision makers. The principle behind this
method of decision making is that the individual predictions combined appropriately, should have
better overall accuracy, on average, than any individual committee member [29]. Prime
Aggregation method applied in the ensemble learning are voting techniques such as majority
voting, borda count aggregation, behaviour knowledge based aggregation, dynamic classifier
selection etc. [30]. Out of these, our proposed learning technique uses majority voting [31]
aggregation. The three versions of majority voting are unanimous voting, simple voting and
plurality voting. Plurality voting is the most optimal form of majority voting.
Majority voting in the proposed statement of this paper aims at giving high weightage to more
qualified experts in the ensemble of classifiers. The expertise is inversely proportional to the
classification error.

Figure 2. Hard Margin SVM plot

3.3.Feature Subset Selection


Feature selection algorithms attempt to select features which are useful and deselect the features
which are not helpful or destructive to learning [32]. Feature subset selection is an important
phase of pre-processing in machine learning [33]. At times in this phase some feature are
removed totally. However these removed features may become important when incorporated in
some combination with other features. This disadvantage of feature selection can be removed by
utilizing it in ensemble learning. Here several combinations of features are selected through some
algorithms to form individual models to be ensembled. Various selection algorithms are
exhaustive selection (evaluation of all possible subset of features), branch and bound selection
(evaluation using branch and bound algorithm), sequential forward selection (SFS) (select best
single feature and then add one feature at a time in combination which maximizes decision
accuracy), sequential backward selection (SBS) (select all the features and remove one feature at
a time which maximizes the decision accuracy) and best individual feature selection (evaluation
of all the N features individually and then taking the best set of features) etc. [34]. SFS is bottom
up procedure and SBS is top down procedure. Here the exhaustive selection is most ideal
approach but is feasible only when the number of attributes is few in numbers. Otherwise the
possible combination can shoot exponentially in number, not possible to handle.
52

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

3.4.Deep Learning
Deep SVM is inspired from the success of deep neural networks [35], deep belief networks [35],
and deep Boltzmann machine [36] etc. Multilayer perceptron with many hidden layers is an
example of deep learning. Deep learning is a type of machine learning techniques that learns
multiple levels of representations in deep architectures [37]. There are chances of the
conventional classifiers to get trapped in local optima of objective function. But the deep
architectures learns the feature representations through both supervised training and fine tuning at
further deep phases of learning. First phase of deep SVM is the standard training process. Then in
the second phase, the kernel activations of the support vectors of first phase are set as inputs for
another SVM and so on till whatever level of tuning is required to be done [38]. Usually the
tuning starts to repeat after 3-4 levels of deep learning. This training procedure is greedy in
nature. This makes the computationally very efficient. Ensemble of each phase of learning in the
deep learning further increases the precision of the model. However, there exist fine tuning
learning, but the model function still over fits the data points due to non-linear kernel activation
learning.

3.5.Regularization
The concept of regularization came into existence in 1990s. In the supervised machine learning
problems, accurate prediction is more important than the close fit of the function onto the data.
Hence generalization is appreciated or in other words overfitting of function has to be checked. In
figure 3 the blue curve is a 2 degree curve, red curve is a 4 degree curve and the green curve is the
8 degree curve which is the maximum out of the three. The green curve plots the close fit
boundary between the two classes, but the test accuracy decreases. However the blue curve shows
minimum training accuracy but chances of betterment in test accuracy is the maximum in this
case. Green curve marks overfitting. Hence it can be said that the overfitting occurs when
generalization is decreased. Regularization is a measure to check this overfitting. This provides
problem stability. Regularization restricts the hypothesis space to a linear function or a
polynomial of a particular degree according to the scenarios and smoothness to the function is
provided by putting the function in Reproducing kernel hilbert space (RKHS). A regularization
parameter associated with the regularization term of optimization function which controls the
trade-off between stability and accuracy.

Figure 3. Fitting of classifier on the data set

53

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

In case of the ensemble learning, the regularization can be applied to the optimization of the loss
function. By doing this the degree of the best fit polynomial is reduced and test classification
accuracy is improved. On the other side, overfitting can also be dealt with by keeping the degree
of the best fit function constant and regularizing the weightage associated to each individual
classifiers participating in the ensemble learning. This reduces the curvature of each positive or
negative depression in the curve without reducing the degree of whole curve. Hence the loss
function is modified to provide the boundary fitting over the input feature vectors.
Another statistical technique is bootstrap resampling in which a new set dataset DT is drawn out
from the previous dataset DT by random sampling with replacement. Bagging is performed by
applying this in several iterations and then performing ensemble learning onto this. For a large
DT, the number of individual samples that are not present in any of the bootstrapped dataset is
large. The probability that first training sample is not selected once is (1- 1/N) and not selected at
all is (1-1/N)N [1]. Since N -> , 1/e = 0.36 .Hence only about 63% of original training samples
are represented in any bootstrapped set. Since bagging reduces variance, it provides an alternative
approach to regularization [6] because even if each classifier is individually overfit, they are
likely to be overfit to different things.

4.PROPOSED WORK
In our work, regularized ensemble of deep SVM classifier has been used which shows a markable
improvement in the classification accuracy of prediction problems. For training and optimization
of our problem, we have used a popular library libSVM [40,41]. The ensemble of deep classifiers
is generated using four different frameworks shown in fig 4, fig 5, fig 6, fig 7. Fig 4 shows
ensemble of classifiers based on feature subset selection framework where the individual models
are formed by different training on different feature subset. Even those features which do not
contribute well in isolation or total combination, may work well in some combinations. This
explores all the best possible decisions using feature combinations. Fig 5 shows the ensemble of
deep classifiers level 1 where each individual model is generated by the training in each phase of
deep learning. This provides improved basic training through fine tuning of deep phases. Fig 6
shows the ensemble of deep classifiers level 2, where fine tuning at a further level is done. Fig 7
shows a combination of motive achieved in fig 4 and fig 6 i.e. ensemble of deep classifiers
learning with feature subset selection.

Figure 4. Ensemble of classifiers based on feature subset selection framework


54

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

Figure 5. Ensemble of deep classifiers level 1 framework

Figure 6. Ensemble of deep classifiers level 2 framework

55

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

Figure 7. Ensemble of deep classifiers learning with feature subset selection

For SVM, the loss function optimized is the hinge loss L(f(x),y)=max(0,1-y.f(x)). It has been
observed that the regularization technique that generates the best accuracy for our proposed work
is the singular value decomposition (SVD) reduced weight matrix with regularization parameter
1 and square of norm 2 of weight matrix with regularization parameter 2. Other regularization
factors are norm1, norm2 and tikhonov regularization. The objective function is described in
equation 1:
(1)
Here i is achieved through regularized majority voting.
Algorithm 1: Regularized ensemble of classifiers using exhaustive feature subset selection
1: Start
2: Find all the possible combinations of features
3: Train the SVM classifier all combinations received in 1
4: Estimate the weights {1 t} associated with each individual model through regularized
majority voting
5: Evaluate ensemble of classifier model
6: Report ensemble model, classification accuracy on Test data set and the weights {1 t}.
7: End
Algorithm 2: Regularized ensemble of classifiers using best N feature subset selection
1: Start
2: Train the SVM classifier on all individual features.
3: Record the accuracy generated and the corresponding feature in descending order.
4: Train SVM classifiers on Classifierset= {Best N, Best N-1, Best N-2 Best 1}
56

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

4: Estimate the weights {1 t} associated with each member of Classifierset through


regularized majority voting technique
5: Evaluate ensemble of classifier model
6: Report ensemble model, classification accuracy on Test data set and the weights {1 t}.
7: End
Algorithm 3: Regularized ensemble of deep classifiers
1: Start
2: for level= 1: t
3:
Train the SVM classifier on data set D and record the model generated in [Model]
4:
Generate new data set D with the support vectors of model generated
5:
D=D
6: end for
7: Estimate the weights {1 t} associated with each member of [Model] through regularized
majority voting technique
5: Evaluate ensemble of classifier model
6: Report ensemble model, classification accuracy on Test data set and the weights {1 t}.
7: End
Regularization parameter associated with the regularization term is an important term to control
the trade-off between stability and accuracy. There are many regularization techniques in
existence and this is also a topic under further research. L1 Regularization is norm 1
regularization factor which penalizes all the factors equally. This focuses on selection of only the
relevant factors. Its numerical definition is 1.||||1. L1 penalty is linear which tends to produce
many points with zero curvature. A disadvantage with this regularizer is slow convergence in case
of large scale problems. Secondly, L2 regularizer minimizes curvature at all the points in the
curve by applying penalty that scales square of curvature. Its numerical definition is 1.||||2.
Complexity of L2 regularization is greater than L1 regularizer. Thirdly,
Tikhonov regularizer is a special case of L2 Regularization numerically defined by term
(1)2.(||||2)2. Further the SVD reduced norm 2 regularization is represented as 1. SVD() + 2.
(||||2). SVD has multiple roles and can be viewed as a method for transforming correlated
variables into a set of uncorrelated ones that better expose the various relationships among the
original data items, a method for identifying and ordering the dimensions along which data points
exhibit the most variation and a method for data reduction by finding the best approximation of
the original data points using fewer dimensions. Regularization path varies with the experimental
conditions.

5.EXPERIMENTS
In all the experiments listed below, SVM classifier is used because it evaluates dot products of
vectors in the higher dimension to construct the dividing boundary. The choice of a kernel
function depends on the model to plot. A polynomial kernel allows to model feature conjunctions
up to the order of the polynomial. Radial basis functions (RBF) allows plotting circular
boundaries in higher dimensions. Linear kernel allows putting linear boundaries in higher
dimensions. Multiclass classification is best achieved through RBF. If is the kernel bandwidth
parameter and (Xi , Xj) is vector to be transformed to higher dimensions, equations 2 shows RBF
kernel equation.
(2)
57

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

Other important algorithm used is the parameter estimation algorithm of Grid search. In v-fold
cross-validation, the training set is divided into v subsets of equal size. Classifiers are trained on
v-1 subsets and are tested on one subset. Hence each instance is predicted once and so the cross
validation accuracy is the percentage of data which are correctly classified. The kernel parameters
(C, ) are estimated using cross-validation. Various combination of (C, ) is tried and one with
best cross validation accuracy is picked. In the experiments of our proposed work, libSVM library
[40, 41], is used for training multi class SVMs with RBF kernel. The features in the training and
test datasets were scaled in the range [-1, +1]. 10 fold cross validation is used for choosing the
kernel bandwidth parameter and SVM C parameter through grid search. The range of (C, ) are
[2-10 ,2-9, ..25] and [2-5, 2-4,..210] respectively. The range for regularization parameter 1 and 2
is 0 < 1 < 0.5 and 0 < 2 < 0.5 respectively. Five cases of experiment are described below. Results
of bagging technique are listed in table 8 for a comparative vision.
Case 1: Bagging Ensemble of classifiers
Case 2: Ensemble of classifiers based on feature subset selection
Case 3: Ensemble of classifiers in deep learning level 1
Case 4: Ensemble of classifiers in deep learning level 2
Case 5: Ensemble of classifiers in deep learning level 1 with the feature subset selection
Cases 2,3,4,5 have subcases for the following regularization schemes:
Setting 1: SVD reduced Norm 2 regularization
Setting 2: Norm 1 regularization
Setting 3: Norm 2 regularization
Setting 4: Tikhonov regularization
Table 8. Results of Bagging ensemble of classifiers in all the three dataset

S.No. Dataset

1.
2.
3.

Classification
accuracy
(%) in
Bagging Ensemble of
classifiers
IRIS Data set
96.66
Ionosphere Data 87.87
set
Seed Data set
95.38

For the feature subset selection IRIS data set uses Algorithm 1 i.e. exhaustive feature subset
selection. This is most optimal selection algorithm. For Ionosphere and Seed data set, since the
number of features or the attributes is very large in number, it is very lengthy and highly complex
to find out all the possible combinations of attributes. Hence they both use Algorithm 2 i.e. best N
feature subset selection. Fig 8 represents the classification accuracy results of experiments on
IRIS dataset. Fig 9 and fig 10 represents 2D and 3D scatterplot where different colours mark the
different class vectors respectively. Similarly Fig 11, fig 12, fig 13 are the corresponding results
on Ionosphere dataset and fig 14, fig 15 and fig 16 are the corresponding results on Seed dataset.

58

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

Figure 8. Results of experiments on IRIS Dataset

Figure 9. 2D Scatter plot between all pair of attributes in IRIS dataset

Figure 10. 3D Scatter plot between all pair of attributes in IRIS dataset

59

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

Figure 11. Results of experiment on Ionosphere dataset

Figure 12. 2D Scatter plot on the best set of features in Ionosphere dataset.

Figure 13. 3D Scatter plot on the best set of features in Ionosphere dataset
60

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

Figure 14. Results of experiment on seed dataset

Figure 15. 2D Scatter plot on the best set of features in Seed dataset.

61

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

Figure 16. 3D Scatter plot on the best set of features in Seed dataset.

6.OBSERVATION
The results in all the above three set of experiments show the improved classification accuracy
than the major reported previous results, in the case of ensemble of deep classifiers level 2 with
the SVD reduced norm 2 regularizations which is nearly 99%. Time taken in this particular case
for various dataset is reported in table 9. It is to be noted that time taken in case of ionosphere
data is comparatively larger than other two dataset due to comparatively large number of features
in it. The deep learning on the complete dataset is generating better results than the deep learning
on the feature subset selection schemes. This is because the fine tuning in the presence of all the
features is better in comparison to the feature subset. The penalty in Norm 1 regularization deletes
many noise features by estimating their coefficients to zero since it is not differentiable at zero.
Whereas the penalty in Norm 2 regularization uses all the input features in classification because
it is differentiable at all points in the function. Hence Norm 2 regularization achieves higher order
smoothness for curve estimation.
Table 9. Time taken in deep learning level 2 with full feature set ensemble learning with SVD
reduced Norm 2 regularization

S. No.
1.
2.
3.

Dataset
IRIS Data set
Ionosphere Data set
Seed Data set

Time (sec)
16.37
123.78
32.78

Next, since the bagging model shows the inclusion of only about 63% of the original training
samples in any bootstrapped set (as discussed in section 3.5), the regularization provided by this
technique is not as smooth as the ensemble of deep classifiers. Analysis of the regularizers
applied above can be done on the basis of worst case time complexity. In Norm 1 regularization,
there are total of (t-1) sum operations computed at run of algorithm. Time Complexity O(t) is
reported. In Norm 2 regularization, there are total of (t-1) sum operations, t operations to square
all the elements, and 1 square root operation is computed. Time complexity O(3t) is reported. One
62

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

degree regularization parameter is applied. In the tikhonov regularization, time complexity O(3t)
is same as L2 regularization but here 2 degree regularization parameter is applied. In
(SVD+Norm2), there are two expressions involved. O(t2) for SVD computation summed with
O(3t) for norm 2 computation. Hence time complexity O(t2) is reported.

7.CONCLUSION
The deep learning approach for the improvement in the classification accuracy is very prevalent
in the artificial neural network field. The deep SVM classifier is still an emerging concept. Here
the experiments prove a good scope of deep learning with SVM classifiers. Regularization of
deep learning has further marked an improvement in classification accuracy. Many other
regularization techniques could be applied for comparison and better results. Other feature
selection strategies such as SFS and SBS could also be applied for feature subset selection.

REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]

[8]
[9]
[10]

[11]

[12]

[13]
[14]
[15]

[16]

[17]

Rob Schapire, Theoretical Machine Learning, COS 511, Lec No. 1, p. 1-6, 2008
R. Sathya, Annamma Abraham, Comparison of Supervised and Unsupervised Learning Algorithms
for Pattern Classification, IJARAI, Vol 2, No. 2, 2013
D. Michie, D.J. Spiegelhalter, C.C. Taylor, Machine Learning, Neural and Statistical Classification,
Tutorial section 2.1, p. 6-16, 1994
S. B. Kotsiantis, Supervised Machine Learning: A Review of Classification Techniques, Informatica,
Vol 31, p. 249-268, 2007
Koby Crammer, Yoram Singer, On the Algorithmic Implementation of Multiclass Kernel-based
Vector Machines, JMLR 2, p. 256-295, 2001
Hal Daume III, A course in Machine Learning, Ensemble learning CIML, V0-8, Ch 1,p. 148-155,
2012
A.Vergara, Shankar Vembu, Tuba Ayhanb, Margaret A. Ryanc, Margie L. Homerc, Ramon Huertaa
Chemical gas sensor drift compensation using classifier ensembles . Sensors and Actuators B, p.
166-167 2012.
Fishers IRIS dataset, UCI repository, https://archive.ics.uci.edu/ml/datasets/Iris, 1988
Vaishali Arya, R.K.Rathy, An Efficient Neuro-Fuzzy Approach for Classification of Iris Dataset,
International Conference on Reliability, Optimization and Information Technology, p. 161- 165, 2014.
Xiaoyang Fu and Shuqing Zhang, Evolving Neural Network Ensembles Using Variable String
Genetic Algorithm for Pattern Classification, Sixth International Conference on Computational
Intelligence, p. 81-85 2013.
Anshu Bharadwaj, Sonajharia Minz, Hybrid Approach for Classification using Support Vector
Machine and Decision Tree, International Conference on Advances in Electronics, Electrical and
Computer Science Engineering, p. 337-341, 2012.
Hamid Parvin, Sajad Parvin, Robust Classifier Ensemble for Improving the Performance of
Classification, Eleventh Mexican International Conference on Artificial Intelligence, IEEE special
session, Vol 11, p. 52-57, 2012.
Xue-Fang Chen, Hong-Jie Xing, Xi-Zhao Wang, A modified AdaBoost method for one-class SVM
and its application to novelty detection, IEEE, Vol 11 p. 3506-3511, 2011.
Hakan Cevikalp, Bill Triggs , Hasan Serhan Yavuz , Yalc, Mahide, Atalay Barkana, Large margin
classifiers based on affine hulls Elsevier, Vol 73, p. 3160-3168, 2010.
A. Marcano-Cedeo, J. Quintanilla-Domnguez, M.G. Cortina-Januchs, D. Andina, Feature Selection
Using Sequential Forward Selection and classification applying Artificial Metaplasticity Neural
Network, IEEE, No. 36, p. 2845-2850, 2010
Narendra S. Chaudhari, Aruna Tiwari, Jaya Thomas,Performance Evaluation of SVM Based Semisupervised Classification Algorithm, 10th Intl. Conf. on Control, Automation, Robotics and Vision,
No. 10, p. 1942-1947, 2008
Hyun-Chul Kim, Shaoning Pang, Hong-Mo Je, Daijin Kim, Sung Yang Bang, Constructing support
vector machine ensemble, The journal of the pattern recognition society, Vol 36, p. 2757-2767, 2003.
63

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
[18] Vince Sigillito, Ionosphere Dataset , UCI repository, https://archive.ics.uci.edu/ml/datasets/Ionosphere,
1989
[19] Xiaodong Zeng, Derek F. Wong, Lidia S. Chao, Constructing Better Classifier Ensemble Based on
Weighted Accuracy and Diversity Measure, The Scientific World Journal, Volume 2014, Article No.
961747, p. 1-12, 2014
[20] Shasha Mao, LichengJiao, LinXiong, ShuipingGou, BoChen, Sai-KitYeung, Weighted classifier
ensemble based on quadratic form, Elsevier Vol 48, Issue 5, p. 1688-1706, 2014
[21] Darwin Tay, Chueh Loo Poh, Richard I. Kitney, An Evolutionary Data-Conscious Artificial Immune
Recognition System , Proceedings of the 15th annual conference on Genetic and evolutionary
computation, p. 1101-1108, 2013
[22] Eitan Menahem, Lior Rokach, Yuval Elovici, Combining One-Class Classifiers via Meta Learning,
Proceedings of 22 ACM international conference on information & knowledge management, No. 22,
p. 2435-2440, 2013
[23] Nicola Segata , Enrico Blanzieri, Fast and Scalable Local Kernel Machines, JMLR, Vol 1, p. 18831926, 2010
[24] Vlado Menkovski, Ioannis T. Christou, Sofoklis Efremidis, Oblique Decision Trees Using
Embedded Support Vector Machines in Classifier Ensembles , Vol 11, p. 1-6, 2008
[25] Hsuan-Tien Lin , Ling Li, Support Vector Machinery for Infinite Ensemble Learning, JMLR , Vol
9, p. 285-312, 2008
[26] Albert Hung-Ren Ko, Robert Sabourin, Alceu de Souza Britto, Evolving Ensemble of Classifiers in
Random Subspace, Proceedings of the 8th annual conference on Genetic and evolutionary
computation, p. 1473-1480, 2006
[27] Gorzatas Seed Data set, UCI repository, https://archive.ics.uci.edu/ml/datasets/seeds, 2012
[28] M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak, A Complete
HGradient Clustering Algorithm for Feature Analysis of X-ray Images, Information Technology in
Biomedicine, Springer-Verlag, p. 15-24, 2010
[29] Gavin Brown, Encyclopaedia of Machine Learning Vol 1, p. 312-320, 2010
[30] Robi Polaker, Ensemble based systems in decision making, IEEE, Vol 6, Issue 3, p. 21-45
[31] Hyun-Chul Kim, Shaoning Pang, Hong-Mo Je, Daijin Kim, and Sung-Yang Bang, Support Vector
Machine Ensemble with Bagging, Springer, LNCS 2388, p. 397-408, 2002
[32] David W. Opitz, Feature Selection for Ensembles, American Association for Artificial Intelligence,
AAAI Proceeding No. 99, p.1-6, 1999
[33] Mohamed A. Aly, Novel Methods for the Feature Subset Ensembles Approach, International
Journal of Artificial Intelligence and Machine Learning, Vol. 6, No. 4, p. 1-7, 2006
[34] Anil K. Jain, Robert P.W. Duin, Jianchang Mao, Statistical Pattern Recognition: A Review,IEEE
transactions on pattern analysis and machine intelligence, Vol 22, Issue 1, p. 4-37, 2000
[35] Dong Yu and Li Deng, Deep Learning and Its Applications to Signal and Information Processing ,
IEEE processing magazine Vol 28, Issue 1, p. 145-154, 2011
[36] Nitish Srivastava, Ruslan Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines,
ICML, 25 Annual Conferrence on learning theory, No. 25, p. 1-9, 2012
[37] Xue-Wen chen, Xiaotong Lin, Big Data Deep Learning: Challenges and Perspectives, IEEE Access,
Vol 2, p. 514-525, 2014.
[38] Azizi Abdullah, Remco C. Veltkamp, Marco A. Wiering, An Ensemble of Deep Support Vector
Machines for Image Categorization, International Conference of Soft Computing and Pattern
Recognition, p.301-306, 2009.
[39] Hal Daume III, From zero to reproducing kernel hilbert spaces in twelve pages or less, p.1-12, 2004
[40] C.-C. Chang, C.-J. Lin, LIBSVM: A Library for Support Vector Machines, Software. Available at
http://www.csie.ntu.edu.tw/cjlin/libsvm, 2001.
[41] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin, Department of Computer Science, National
Taiwan University,Taipei 106, Taiwan, 2003, Practical Guide to Support Vector Classification, p 116, 2003.

64

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
Authors
Ms. Shruti Asmita (B.Tech., 2013 KEC Ghaziabad, Uttar Pradesh Technical University, Lucknow) is a
M.Tech. Computer Science scholar at Banasthali University, Jaipur and pursuing her research internship at
IIT-BHU (CSE), Varanasi. Her research interests include data mining, image processing, machine learning
and sensor networks etc.
Dr. K.K. Shukla (Ph. D., 1993 - Institute of Technology (BHU), Varanasi) is professor and current head of
department at Indian Institute of Technology, Banaras Hindu University Varanasi, India. He has been
awarded B.Tech from APSU, Rewa in 1980, M.Tech. from IT (BHU) in 1982 and PhD from IT (BHU) in
1993. He is having research and teaching experience of 30 years, He is having more than 120 research
papers in reputed journals and conferences and more than 90 citations. His present research collaborations
in India include ISRO and TCS. Out of India research collaborations includes INRIA, France and ETS,
Canada. He has many popular books under his authorship on subjects Neuro-computers, RTS Scheduling,
Fuzzy modelling and Image Compression. His field of research includes image processing and pattern
recognition, fuzzy logics, wireless sensor networks and machine learning etc.

65

Das könnte Ihnen auch gefallen