Sie sind auf Seite 1von 9

Front. Comput. Sci.

, 2018, 12(6): 1140–1148


https://doi.org/10.1007/s11704-016-6107-0

Convolutional adaptive denoising autoencoders for hierarchical


feature extraction

Qianjun ZHANG, Lei ZHANG

Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu 610065, China


c Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Abstract Convolutional neural networks (CNNs) are typi- processing, and other fields. Most deep learning methods
cal structures for deep learning and are widely used in im- are based on the encoder–decoder architecture; this archi-
age recognition and classification. However, the random ini- tecture attempts to preserve information content by always
tialization strategy tends to become stuck at local plateaus being able to reconstruct their own input, can be stacked to
or even diverge, which results in rather unstable and ineffec- obtain multiple layers, and is very effective at recognition
tive solutions in real applications. To address this limitation, tasks. Methods using this architecture include stacked au-
we propose a hybrid deep learning CNN-AdapDAE model, toencoders (AEs), stacked denoising AEs (DAEs), stacked
which applies the features learned by the AdapDAE algo- sparse AEs (SparseAEs), deep belief networks (DBNs) [1],
rithm to initialize CNN filters and then train the improved and deep Boltzmann machines (DBMs) [2].
CNN for classification tasks. In this model, AdapDAE is pro- In addition to the above-mentioned deep architectures, an-
posed as a CNN pre-training procedure, which adaptively ob- other typical deep learning architecture is convolutional neu-
tains the noise level based on the principle of annealing, by ral networks (CNNs). CNNs are hierarchical models whose
starting with a high level of noise and lowering it as the train- convolutional layers alternate with subsampling layers. Since
ing progresses. Thus, the features learned by AdapDAE in- its introduction by LeCun et al. [3] in the early 1990s, CNNs
clude a combination of features at different levels of granu- have demonstrated remarkable performance at a number of
larity. Extensive experimental results on STL-10, CIFAR-10, tasks such as handwritten digit recognition, object recogni-
and MNIST datasets demonstrate that the proposed algorithm tion, and face detection. CNNs exhibit two key properties
performs favorably compared to CNN (random filters), CNN- that make them very useful for numerous vision applications:
AE (pre-training filters by autoencoder), and a few other un- spatially shared weights and spatial pooling. Spatially shared
supervised feature learning methods. weights imply that each layer unit shares an identical weight,
Keywords convolutional neural networks, annealing, de- which reduces the number of learnable parameters, and the
noising autoencoder, adaptive noise level, pre-training pooling layers are responsible for reducing the sensitivity of
the output to marginal input shifts and distortions.
However, numerous researchers [4–6] have reported that
1 Introduction when starting from a random initialization, the training pro-
cedure of deep multi-layer neural networks including CNNs
Deep learning methods, being the most successful learn-
tend to become stuck at local plateaus or even diverge, which
ing methods developed, have recently become widely used
results in rather ineffective solutions. To circumvent this ob-
in image recognition, speech recognition, natural language
stacle, researchers have proposed a number of new methods.
Received February 22, 2016; accepted November 11, 2016 Masci et al. [7] proposed stacked convolutional AEs for hier-
E-mail: leizhang@scu.edu.cn archical feature learning, which initializes a CNN with filters
Qianjun ZHANG et al. Convolutional adaptive denoising autoencoders for hierarchical feature extraction 1141

from a trained convolutional AE. Tan and Li [4] presented tialized CNN becomes stuck at a local plateau or diverges.
stacked convolutional AEs for steganalysis of digital images, Experimentally, our algorithm is rigorously evaluated on
which uses a convolutional AE in the pre-training procedure. the STL-10 [10], CIFAR-10 [11], and MNIST [6] datasets
Lee et al. [8] proposed convolutional DBNs, which combine and compared with previous unsupervised feature learning
convolutional layers with a DBN and achieve remarkable per- methods. We observe that the proposed CNN-AdapDAE
formance on several visual recognition tasks. Ji et al. [9] con- method achieves higher classification performance, outper-
structed a multiple CNN pre-trained by a sparse AE for pre- forming CNN (random filters), CNN-AE (pre-training filters
processing free surface material classification. We note that by AE), and a few other unsupervised feature learning meth-
all the aforementioned pre-training methods rely on AEs or ods.
DBNs and not DAEs. The remainder of the paper is organized as follows. Section
In this paper, we address this limitation by (i) learning fea- 2 presents a brief overview of AEs, DAEs, SparseAEs, and
tures through AdapDAE, which is an improved DAE with CNNs. Section 3 describes the proposed algorithm. Section 4
an adaptive noise level; (ii) sampling filters from the above presents the experiments and results. Finally, the concluding
learned features to initialize CNN filters. The AdapDAE al- remarks are presented in Section 5.
gorithm builds on the observation that when the input is
marginally corrupted during training, the network tends to
learn fine features, whereas when the input is heavily cor- 2 Background
rupted, the network tends to learn general features. This im-
In this section, we introduce a few classical deep learning
plies that we can encourage the learning of features at dif-
methods such as AEs, DAEs, SparseAEs, and CNNs. These
ferent scales by training the same network at multiple noise
methods are also the foundation of our proposed algorithm.
levels. In AdapDAE, the principle of annealing is used to
compute the noise level of the input neurons for each epoch. 2.1 Autoencoders
The noise level is maintained high during the initial training
phase; however, as the training progresses, the noise is grad- A classical AE [12] consists of an encoder and a decoder. The
ually reduced. At the end of the epoch, the network includes encoder maps inputs x into a hidden representation using the
a combination of both general and fine features. The learned function y = f (W x + b), where f is the encoding function
features are then used to initialize CNN filters; this ensures that can be linear, sigmoidal, or tanh; the matrix W is the
that the CNN is initialized with both general and fine fea- weights; and the vector b is the biases. The matrix W and the
tures. vector b are the parameters of the network. The decoder maps
Our contributions are summarized below. the hidden representation y back to the original input values
1) The feature learning algorithm AdapDAE is proposed as using the function z = g(W  y + b ), where g is the decoding
a CNN pre-training procedure, combining the advantages of function. The parameters are optimized to minimize the mean
both SparseAE and DAE with an adaptive noise level. More- square error of x − z2 . Typically, W and W  are constrained
over, it overcomes the limitation that the noise level in the by W  = W T , where (.)T denotes transposition.
DAE is determined by experience and remains unchanged 2.2 Denoising autoencoders
throughout the training process.
2) The performance impact of different noise level strate- DAEs [13] are straightforward modifications of the classical
gies is analyzed, and a calculation formula for the adaptive AEs, which are trained to reconstruct a clean version of an
noise level is provided in AdapDAE based on the principle input from its corrupted version. Prior work has revealed that
of annealing, which starts with a high level of noise that de- DAEs can be stacked to learn features that are useful for nu-
creases as the training progresses. Thus, the features learned merous tasks. For example, DAEs have been established to be
by AdapDAE include a combination of features at different an empirically successful alternative to restricted Boltzmann
levels of granularity, where both global and fine-grained de- machines (RBMs) for pre-training deep networks.
tails are simultaneously exploited. A DAE is trained by first finding the latent representation
3) Another contribution of our work is the proposal of a y = f (W x̃ + b), from which the original input z = g(W  y + b )
hybrid model called CNN-AdapDAE. In this method, the fea- is to be reconstructed; here, x̃ is the corrupted version ob-
tures learned by AdapDAE are applied to initialize the CNN tained by defining a noise distribution p( x̃|x, v). The amount
filters; this overcomes the limitations wherein a randomly ini- of corruption is regulated by the parameter v. Common noise
1142 Front. Comput. Sci., 2018, 12(6): 1140–1148

selections include additive isotropic Gaussian noise, salt and from the last full connection layer to complete the classifica-
pepper noise for gray-scale images, and masking noise [4]. tion.
Masking noise has been used in most simulations. The objec-
tive function for a DAE can be expressed as L(x, z) = x−z2 .

2.3 Sparse autoencoders

SparseAEs are another modification of the classical AEs,


where a sparsity constraint is imposed on the hidden units.
Sparsity has become a concept of interest since it was intro-
duced in computational neuroscience in the context of sparse Fig. 1 Typical CNN structure with seven layers [17] (C1 is the first convo-
lutional layer with N1 feature maps, and S2 is the first subsampling layer)
coding in visual systems [14]. It has been a key element of
deep convolutional networks exploiting a variant of AEs [15]
with a sparse distributed representation; it has also become a 3 Proposed algorithm
key ingredient in DBNs [16].
The objective for SparseAE can be expressed as In this section, we describe the proposed CNN-AdapDAE al-
 gorithm. The features are first learned using the AdapDAE
L(x, z) = x − z2 + β KL(ρρ̂ j ), (1) algorithm by training random patches; then, these features
j are applied to initialize the CNN filters. Finally, the improved
where β controls the weight of the sparsity penalty term; CNN is trained for classification. Figure 2 shows our frame-
ρ is a sparsity parameter, which is typically a small value work. The framework involves several stages and is similar
close to zero; ρ̂ is the average activation of hidden units; and to those used in computer vision [8, 18].
KL(ρρ̂ j ) = ρ log ρ̂ρj + (1 − ρ) log 1−
1−ρ Our model performs the following steps to learn features:
ρ̂ j is the Kullback–Leibler
divergence, which is a standard function for measuring the • extract random patches from the original unlabeled
difference between two distributions. training images;
2.4 CNNs • apply pre-processing to the patches;
• learn features using the AdapDAE algorithm.
A CNN is a special type of multi-layer neural networks. Simi-
lar to other neural networks, they are trained with a version of Given the learned features and a set of labeled training im-
the backpropagation algorithm; the difference between them ages, we can perform initialization and classification:
is in the structure.
• initialize the CNN filters from the above learned fea-
Figure 1 shows a typical CNN structure with seven lay- tures;
ers [17]. The first layer is the input layer, followed by a num-
• train the CNN and perform classification.
ber of convolution-subsampling layers, and finally, a full con-
nection layer and classifier layer. The C1-layer is a convo- The framework and its parameters are detailed below.
lutional layer that contains N1 feature maps. Each convolu-
tional layer is followed by a subsampling layer. Using local 3.1 Feature learning
receptive fields and weight-tying in the convolutional layers
3.1.1 Data
can reduce the number of learnable parameters and the com-
putational complexity. The output of the convolutional layer First, the random patches are extracted from unlabeled input
is used as the input for the next subsampling layer. The S2- images. Each patch has dimension w × w and has d chan-
layer is a subsampling layer, where the number of feature nels. Each w × w patch can be represented as a vector in
maps is still N1; however, their size is reduced by pooling RN using pixel intensity values, with N = w × w × d. A
over neighboring units. The C3-layer and S4-layer are sim- dataset is then constructed from m randomly sampled patches
ilar to the C1-layer and S2-layer, except for containing N2 X = {x(1), x(2), . . . , x(m)}, where x(i) ∈ RN . To obtain a rea-
feature maps. There is a full connection between F5 and S4, sonable feature representation, we should ensure that there is
wherein each feature map in the F5-layer is connected to all an adequate number of patches. Finally, the patches are ap-
the N2 feature maps in the S4-layer. The output is obtained plied to the pre-processing and unsupervised learning steps.
Qianjun ZHANG et al. Convolutional adaptive denoising autoencoders for hierarchical feature extraction 1143

3.1.2 Pre-processing input to AdapDAE. The encoder and decoder processes are
as follows.
Before running a learning algorithm on our input data points,
The encoder is a nonlinear transfer function f that trans-
it is useful to normalize the patches. Each patch x(i) is nor-
forms the corrupted input vector x̃ into a hidden representa-
malized by subtracting the mean and dividing by the stan-
tion y expressed as
dard deviation of its elements. After normalization, whiten-
ing is performed to de-correlate pixels and remove redundant Y = f (W x̃ + b), (4)
features from the raw images [13]; this is an important pre-
processing step for numerous unsupervised algorithms. where f is the encoding function, the matrix W ∈ RH×N is the
weights, H is the size of the hidden layers, and b ∈ RH is the
3.1.3 AdapDAE algorithm vector of the biases. The matrix W and the vector b are the
parameters of the network.
After pre-processing, the AdapDAE algorithm is used to
The decoder is a function g that reconstructs the input vec-
learn features from the unlabeled data. The improved AE in
tor from the hidden representation and is expressed as
AdapDAE combines an adaptive DAE and a sparse AE. The
DAE remains robust to the input by first corrupting the origi- Z = g(W  y + b ), (5)
nal data and then minimizing the reconstruction loss function,
while SparseAE requires fewer neurons in the active hidden where g is the decoding function, z is the reconstructed value,
layer to learn meaningful, abstract, and robust features for and W and W  are constrained by W  = W T .
classification. Our method thus integrates the advantages of The objective function is obtained using Eq. (1), and the
both DAE and SparseAE. weights are updated using stochastic gradient descent. After
The corrupted version x̃ of the original input x is obtained AdapDAE is trained, W ∈ RH×N contains the learned features.
by defining a noise distribution p( x̃|x, v) with the amount of
3.2 CNN-AdapDAE algorithm
corruption controlled by the parameter v. That is, for each
i = 1, 2, . . . , m, we sample x̃ independently as Having obtained the learned features through the above steps,
⎧ we apply the learned features to initialize the CNN fil-


⎨0, with probability v;
x̃i = ⎪
⎪ (2) ters and then train the CNN network for classification. The
⎩ xi , otherwise.
CNN structure in this paper is illustrated in Fig. 2 and in-
In DAEs, the noise level v is kept fixed during the whole cludes the input layer, two convolutional-subsampling layers,
training process. Vincent et al. [13] noticed that using a low one fully-connected layer, and finally the output layer. Each
level of noise results in learning blob detectors, while increas- convolutional-subsampling layer is composed of a convolu-
ing the noise results in obtaining detectors of strokes or parts tional layer and a subsampling layer.
of digits. They also recognized that either marginal or exces-
sive noise impairs the learned representation. Regarding the
selection of noise level, researchers have proposed numerous
new methods [19, 20].
In this paper, the adaptive noise level for DAE is presented,
which calculates the noise level for each epoch using the prin-
ciple of annealing.
The noise level v for each epoch is gradually decreased
from the noise hyper-parameter V1 to the noise hyper-
parameter VE over E epochs, where V1 and VE are selected
Fig. 2 The structure of the proposed method
using the least classification error on validation data. The
noise level for epoch e is calculated as
(1) Convolutional layer
Ve = V1 − ΔVe , (3) In our system, the CNN filters are initialized by sampling
filters K from the learned features W. We use S to denote the
where ΔVe = (V1 −VE )×(e − 1)/(E − 1), V1  · · · Ve  · · · VE . filter bank size, and the CNN filters are initialized based on
The corrupted version x̃ ∈ RN of the original inputs x is
then obtained using the noise level Ve , and x̃ is used as the [wT1 wT2 ... wTi ... wTH ] = W T . (6)
1144 Front. Comput. Sci., 2018, 12(6): 1140–1148

Typically, H is larger than or equal to S . When H is equal


Algorithm CNN-AdapDAE algorithm
to S , the filters are set to be K = W T ; meanwhile, when H is
1: Extract random patches of dimension w Πw from the original unla-
larger than S , we sample S vectors from W as follows: beled training images X = {x(1), x(2), . . . , x(m)}.
First, the contribution ci of the ith weight wi to the activa- 2: Apply a pre-processing stage to the patches by normalizing and
tion value of the hidden neurons is computed: whitening.
 3: Learn features using AdapDAE.
ci = wi j x j . (7) AdapDAE:
j 4: for epoch e = 1 to E do
5: Calculate the noise level Ve for epoch e using Eq. (3).
The corresponding weights of the S largest ci values are
6: Obtain the corrupted version x̃ of the original input x using the noise
then selected as the CNN filters. That is, if c1 is the largest level Ve : ⎧


contribution value, the filter k1 is set to be k1 = wT1 . The sam- ⎨ 0, with probability Ve ;
x̃ = ⎪ (9)

⎩ x, otherwise.
pled filters can be expressed as
7: Use x̃ as the input to SparseAE and obtain the reconstructed version
[k1 k2 ··· ki ··· kS ] = K. (8) z using Eqs. (4) and (5).
8: Obtain the objective function using Eq. (1).
The size of the filter ki is identical to that of the input vector
9: Perform the feedforward and backpropagation phases, and update
to AdapDAE. Each input image (n-by-n pixels) is then con- the weights using stochastic gradient descent.
volved with the S filters, resulting in S filter responses. Color 10: end for
images can also be used similarly. Initializing and training CNN:
11: Initialize the CNN filters by sampling K filters of dimension w Πw
(2) Subsampling layer
from the above learned features.
In each subsampling layer, a pooling operation is intro- 12: Train the CNN, and perform the classification.
duced to obtain translation-invariant representations; this im-
plies that the same (pooled) feature will be active notwith- First, natural images exhibit the property of being station-
standing whether the image undergoes (marginal) transla- ary, implying that the statistics of one part of the image are
tions. The number of input maps and output maps is identical; identical to those of all other parts. This indicates that the
however, the size of the output maps is reduced. Generally, features learned from one part of the image can be applied to
the mean value or maximum value of a particular feature is other parts of the image and that identical features can be used
computed over a non-overlapping region of the image, which at all locations. More precisely, having learned features over
are called mean pooling and max pooling, respectively. small (say 5 × 5) patches sampled randomly from the larger
In this study, max pooling is used to introduce sparsity over image, we can then apply this learned 5 × 5 feature detector
the hidden representation by erasing all the non-maximal val- anywhere in the image. Specifically, the learned 5×5 features
ues in non-overlapping regions. This compels the feature de- are convolved with the larger image, thus obtaining different
tectors to become more broadly applicable. During the re- feature activation values at each location in the image.
construction phase, such a sparse latent code decreases the Second, the features are learned by training the AdapDAE
average number of filters contributing to the decoding of each algorithm on the small random patches. In AdapDAE, the av-
pixel and compels the filters to be more general [7]. erage noise level for each epoch is obtained by the princi-
After training the first convolutional layer, the output of ple of annealing. Initially, the average noise is kept at a high
the first subsampling layer is used as the input for further level; therefore, the input data are heavily corrupted and the
feature learning. The process of training the second convolu- AdapDAE algorithm learns global features. The noise is re-
tional layer is similar to that for the first convolutional layer. duced as the training progresses, and the network is able to
After training the multiple networks in a greedy fashion, the learn features for reconstructing the finer details of the train-
weights are fine-tuned using backpropagation, as the top level ing data. The AE incorporates new learning into the existing
activations can be used as feature vectors for support vector knowledge about the data during the training process. At the
machines or other classifiers. Our CNN-AdapDAE algorithm end of the epoch, the network includes a combination of both
is summarized as follows. general and fine features that are simultaneously exploited.
The learned features are then used to initialize the CNN fil-
3.3 Analysis of the algorithm
ters, ensuring that the CNN is initialized with both general
In this section, we analyze the effectiveness of the CNN- and fine features.
AdapDAE algorithm. Based on the above analysis, our algorithm is reasonable.
Qianjun ZHANG et al. Convolutional adaptive denoising autoencoders for hierarchical feature extraction 1145

In the next section, we experimentally evaluate the perfor- The number of hidden neurons in AdapDAE for pre-training
mance of our algorithm. the first convolutional layer of CNN is 128 and that for the
second convolutional layer is 256. A sigmoid nonlinear func-
tion is used in the encoder and decoder, with masking noise
4 Experiments used for image corruption and stochastic gradient descent
We evaluate the proposed algorithm on the STL-10 [10], (SGD) used for optimization. The number of epochs is fixed
CIFAR-10 [11], and MNIST datasets and compare them to at 200. The learning rate and additional hyper-parameters are
previous unsupervised feature learning methods. also selected through the errors in the validation set.
There are three parts of our experiment: First, for the un- The noise levels V1 and VE are selected by the least classifi-
supervised learning algorithm, the AdapDAE algorithm is cation error on the validation data, with the calculation results
trained with two noise level strategies: fixed noise level and presented in Table 1.
adaptive noise level. In this part, the effect of the noise level Table 1 Noise levels V1 and VE for various datasets
on the MNIST training set is evaluated. The learned features Level STL-10 CIFAR-10 MNIST
are applied to initialize the CNN filters, and the improved V1 0.6 0.7 0.7
CNN is then trained for classification as described in Sec- VE 0.3 0.1 0.1
tion 3.2. Finally, our approach is compared with other unsu-
pervised feature learning methods. 4.3 Effect of noise level
4.1 Experimental data
In AdapDAE, the adaptive noise level for DAE is presented
The MNIST handwritten digit dataset consists of 60,000 to address the limitation wherein the noise level v is kept
training images and 10,000 test images of size 28 × 28 pixels. fixed during the whole training process. To understand the
Each training set is randomly separated into 50,000 training effect of noise level, the DAE is trained on the MNIST
cases and 10,000 validation cases, with 10,000 cases for test- dataset, with different noise level strategies. To visualize the
ing. learned features more clearly, the DAE is trained on the full-
The CIFAR-10 and STL-10 datasets are also used. The sized MNIST dataset rather than on the randomly extracted
CIFAR-10 dataset consists of 50,000 training images and patches. In this experiment, the network architecture consists
10,000 test images of size 32 × 32 pixels, with each pixel of an input layer, a hidden layer, and an output layer. The
having three color channels. The STL-10 dataset is inspired number of hidden neurons is 1,000. First, the noise level v
by the CIFAR-10 dataset albeit with certain modifications; it is fixed at 0.1 and 0.7, and then, the adaptive noise level is
consists of 500 training images and 800 test images per class used. Figure 3 shows a visualization of the features from the
of size 96 × 96 pixels. In particular, each class has fewer la- different noise level strategies.
beled training examples than in CIFAR-10; however, 100,000
unlabeled images are provided for unsupervised learning.

4.2 Experimental setup

The CNN network architecture in our experiment has five


hidden layers: 1) a convolutional layer with 128 filters per
input channel; 2) a max-pooling layer of size 2 × 2; 3) a con-
Fig. 3 Features of AdapDAE for different noise level strategies, learned
volutional layer with 256 filters per map; 4) a max-pooling
from full-sized MNIST with 1000 hidden units. (a) DAE(0.1); (b) DAE(0.7);
layer of size 2 × 2; and 5) a fully connected layer of 512 (c) adaptive
hidden neurons. The output layer has a softmax activation
function with one neuron per class. Dropout is applied to the From Fig. 3 it is observed that for high noise levels (e.g.,
fully connected layer. The learning rate and additional hyper- v = 0.7), the algorithm learns very global features; however,
parameters are selected through the errors in the validation for lower noise levels (e.g., v = 0.1), the input is marginally
set. corrupted, and the features tend to be more local. For adap-
The AdapDAE network architecture in our experiment tive noise levels, the algorithm learns a combination of global
consists of an input layer, a hidden layer, and an output layer. and local features.
1146 Front. Comput. Sci., 2018, 12(6): 1140–1148

Figure 4 shows the test error rate of the CNN-AdapDAE ness of our algorithm.
algorithm with different noise level strategies on the MNIST The testing accuracy of our algorithm is then compared
dataset. It is observed that the classification performance us- with those of other unsupervised feature learning methods. To
ing adaptive noise level is higher than that using the other further verify the effectiveness of our algorithm on a deeper
noise levels. It is also observed from the results that the per- network, a convolutional layer with 512 filters of size 4 × 4
formance is strongly correlated to the noise level. pixels and a max-pooling layer of size 2 × 2 are added to the
original network. We call this CNN-AdapDAE (large net).
The results are presented in Table 3, where bold-faced values
represent optimum performance.

Table 3 Testing accuracies/% of all the methods

Model STL-10 CIFAR-10 MNIST


CNN-AdapDAE 63.6 81.9 99.28
CNN-AdapDAE (large net) 64.3 82.0 99.31
Convolutional K-means [21] 60.1 82.0 -
View-invariant K-means [22] 63.7 81.9 -
Discriminative CNN (small net) [23] 71.9 81.4 -
Discriminative CNN (large net) [23] 72.8 82.0 -
Convolutional net LeNet-5 [24] - - 99.05
Conv.DBN (two layers) [25] - 78.90 -
Fig. 4 Test error for different noise levels on MNIST dataset DBM [2] - - 99.05
DBN [12] - - 98.87
4.4 Classification results

To demonstrate the effectiveness of the proposed algorithm, Table 3 lists comparisons of the testing accuracies of sev-
we compare its performance with several unsupervised fea- eral existing unsupervised feature learning methods. The per-
ture learning methods. formance of our algorithm is higher than those of a few
First, the testing accuracy of our algorithm is compared of these methods (such as Convolutional net LeNet-5 [24],
with CNN (random filters according to a Normal(0,  2 ) dis- Conv.DBN [25]). These results further demonstrate the effec-
tribution, where  = 0.01) and CNN-AE (pre-training filters tiveness of our algorithm. It is also observed that the larger
by AE). To remove bias, the results are obtained for CNN- network obtains better results, establishing that our algorithm
AdapDAE, CNN (random filters), and CNN-AE on an iden- is effective for deeper networks and that the classification ac-
tical network architecture, the difference being the CNN ini- curacy is contingent upon the network architecture, with a
tialization. The results are presented in Table 2. larger network improving the classification accuracy. When
the network for discriminative CNN is larger than ours, the
Table 2 Testing accuracies/% of several CNN methods
accuracy is higher; however, when the network is smaller, the
Model STL-10 CIFAR-10 MNIST
result is almost identical to those of ours or possibly worse.
CNN-AdapDAE 63.6 81.9 99.28
CNN(Random filters) 58.1 79.2 99.01 Table 4 lists the network architectures for the different
CNN-AE 59.8 80.7 99.15 models. The name coding for the network architectures is
identical to that for Discriminative CNN [23]: NcF stands
As observed from Table 2, our algorithm learns better for a convolutional layer with N filters of size F × F pixels,
representations and yields a higher classification accuracy. and N f stands for a fully connected layer with N neurons.
While these results are comparable with those of CNN (ran- For example, 64c5-128 f denotes a network with a convo-
dom filters) and CNN-AE (pre-training filters by AE), our al- lutional layer containing 64 filters of size 5 × 5 pixels, fol-
gorithm and CNN-AE are both more effective than CNN (ran- lowed by a fully connected layer with 128 neurons; whereas,
dom filters). This establishes that pre-training CNN is more 64 f -64 f denotes two fully connected layers containing 64
effective than random initialization. Our algorithm makes fur- neurons each. In this coding, the pooling layer is omitted.
ther improvements upon CNN-AE and obtains the optimum For clarity, the detailed experimental parameters, such as the
classification result, demonstrating that reasonable CNN ini- learning rates, momenta, weight decay coefficients, and num-
tialization aids classification and establishing the effective- ber of training epochs, are not presented for each model.
Qianjun ZHANG et al. Convolutional adaptive denoising autoencoders for hierarchical feature extraction 1147

Table 4 Network architectures for the different models Why does unsupervised pre-training help deep learning? Journal of
Model Network architecture Machine Learning Research, 2010, 11(3): 625–660
CNN-AdapDAE 128c5-256c5-512f 6. Bengio Y. Learning deep architectures for AI. Foundations and Trends
CNN-AdapDAE (large net) 128c5-256c5-512c4-512f in Machine Learning, 2009, 2(1): 1–127
Convolutional K-means [9] 1600c6-3200c2-3200f 7. Masci J, Meier U, Ciresan D, Schmidhuber J. Stacked convolutional
View-invariant K-means [3] - auto-encoders for hierarchical feature extraction. In: Proceedings of
Discriminative CNN (small net) [2] 64c5-64c5-128f the 21st International Conference on Artificial Neural Networks. 2011,
52–59
Discriminative CNN (large net) [2] 64c5-128c5-256c5-512f
8. Lee H, Grosse R, Ranganath R, Ng A Y. Convolutional deep belief
Convolutional net LeNet-5 [19] 6c5-16c5-120f-84f
networks for scalable unsupervised learning of hierarchical representa-
Conv.DBN (two layers) [23] -
tions. In: Proceedings of International Conference on Machine Learn-
DBM [15] 500f-1000f
ing. 2009, 609–616
DBN [21] 1000f-500f-250f-30f
9. Ji M Q, Fang L, Zheng H T, Strese M, Steinbach E. Preprocessing-free
surface material classification using convolutional neural networks pre-
trained by sparse Autoencoder. In: Proceedings of the 25th IEEE Inter-
5 Conclusion national Workshop on Machine Learning for Signal Processing. 2015
10. Coates A, Ng A Y, Lee H. An analysis of single-layer networks in
In this paper, we have proposed a hybrid CNN-AdapDAE unsupervised feature learning. Journal of Machine Learning Research,
model that applies the features learned by AdapDAE, to ini- 2011, 15: 215–223
11. Krizhevsky A, Hinton G. Learning multiple layers of features from tiny
tialize the CNN filters and train the improved CNN for clas-
images. Technical Report, 2009
sification. In this model, AdapDAE is an improved DAE with
12. Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data
an adaptive noise level achieved by computing the noise level with neural networks. Science, 2006, 313(5786): 504–507
of the input neurons for each epoch based on the principle of 13. Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P A. Stacked
annealing. The noise level is kept high during the initial train- denoising autoencoders: learning useful representations in a deep net-
ing phase and is gradually reduced as the training progresses. work with a local denoising criterion. Journal of Machine Learning
Research, 2010, 11(6): 3371–3408
At the end of the epoch, the AdapDAE network includes a
14. Olshausen B A, Field D J. Sparse coding with an overcomplete ba-
combination of both general and fine features that are useful sis set: a strategy employed by V1? Vision Research, 1997, 37(23):
for CNN initialization. The experimental results demonstrate 3311–3325
that (i) the proposed algorithm outperforms CNN (random 15. Ranzato M, Boureau Y L, Lecun Y. Sparse feature learning for deep
filters), CNN-AE (pre-training filters by AE), and a few other belief networks. Advances in Neural Information Processing Systems,
2007, 1185–1192
unsupervised feature learning methods; (ii) the classifica-
16. Lee H, Ekanadham C, Ng A Y. Sparse deep belief net model for visual
tion accuracy depends on the depth of network architecture, area V2. Advances in Neural Information Processing Systems, 2008,
with an increased network depth significantly improving the 20: 873–880
results, such that the method competes with a few state-of- 17. Dahl J V, Koch K C, Kleinhans E, Ostwald E, Schulz G, Buell U, Han-
the-art algorithms; this establishes the effectiveness of our rath P. Convolutional networks and applications in vision. In: Proceed-
ings of IEEE International Symposium on Circuits and Systems. 2010,
algorithm for deeper networks.
253–256
18. Agarwal A, Triggs B. Hyperfeatures - multilevel local coding for visual
Acknowledgements This work was supported by the National Natural Sci-
recognition. In: Proceedings of European Conference on Computer Vi-
ence Foundation of China (Grant Nos. 61322203 and 61332002).
sion. 2006, 30–43
19. Geras K J, Sutton C. Scheduled denoising autoencoders. 2014, arXiv
preprint arXiv:1406.3269
References
20. Chandra B, Sharma R K. Adaptive noise schedule for denoising au-
1. Hinton G E, Osindero S, Teh Y W. A fast learning algorithm for deep toencoder. In: Proceedings of International Conference on Neural In-
belief nets. Neural Computation, 2006, 18(7): 1527–1554 formation Processing. 2014, 535–542
2. Salakhutdinov R, Larochelle H. Efficient learning of deep Boltzmann 21. Coates A, Ng A Y. Selecting receptive fields in deep networks. Ad-
machines. Research Gate, 2010, 9(8): 693–700 vances in Neural Information Processing Systems, 2011, 2528–2536
3. LeCun Y, Boser B, Denker J S, Henderson D, Howard R E, Hubbard W, 22. Hui K Y. Direct modeling of complex invariances for visual object fea-
Jackel L D. Backpropagation applied to handwritten zip code recogni- tures. In: Proceedings of International Conference on Machine Learn-
tion. Neural Computation, 1989, 1(4): 541–551 ing. 2013, 352–360
4. Tan S Q, Li B. Stacked convolutional auto-encoders for steganalysis of 23. Dosovitskiy A, Springenberg J T, Riedmiller M, Brox T. Discrimina-
digital images. In: Proceedings of Asia-Pacific Conference on Signal tive unsupervised feature learning with convolutional neural networks.
and Information Processing Association. 2014, 1–4 Advances in Neural Information Processing Systems, 2014, 766–774
5. Erhan D, Bengio Y, Courville A, Manzagol P A, Vincent P, Bengio S. 24. Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning ap-
1148 Front. Comput. Sci., 2018, 12(6): 1140–1148

plied to document recognition. Proceedings of the IEEE, 1998, 86(11): Lei Zhang received the BS and MS degrees
2278–2324
in mathematics and the PhD degree in com-
25. Krizhevsky A. Convolutional deep belief networks on CIFAR-10.
puter science from the University of Elec-
Technical Report, 2010
tronic Science and Technology of China,
China in 2002, 2005, and 2008, respec-
Qianjun Zhang received her BS and MS
tively. She was a post-doctoral research fel-
degrees in computer science from Xidian
low with the Department of Computer Sci-
University, China in 2008 and 2011, re-
ence and Engineering, Chinese University
spectively. Currently, she is working to-
of Hong Kong, China from 2008 to 2009. She is currently a profes-
ward a PhD degree at the Machine Intel-
sor with Sichuan University, China. Her current research interests
ligence Laboratory, College of Computer
include theory and applications of neural networks based on neo-
Science, Sichuan University, China. Her
cortex computing and big data analysis methods by infinity deep
current research interests include big data,
neural networks.
neural networks, and deep learning.

Das könnte Ihnen auch gefallen