Beruflich Dokumente
Kultur Dokumente
November 9, 2018.
Digital Object Identifier 10.1109/ACCESS.2018.2875720
ABSTRACT Object detection in streaming images is a major step in different detection-based applica-
tions, such as object tracking, action recognition, robot navigation, and visual surveillance applications.
In most cases, image quality is noisy and biased, and as a result, the data distributions are disturbed and
imbalanced. Most object detection approaches, such as the faster region-based convolutional neural network
(RCNN), single shot multibox detector with 300Œ300 inputs (SSD300), and you only look once version 2
(YOLOv2), rely on simple sampling without considering distortions and noise under real-world changing
environments, despite poor object labeling. In this paper, we propose an incremental active semi-supervised
learning (IASSL) technology for unseen object detection. It combines batch-based active learning (AL)
and bin-based semi-supervised learning (SSL) to leverage the strong points of AL’s exploration and SSL’s
exploitation capabilities. A collaborative sampling method is also adopted to measure the uncertainty and
diversity of AL and the confidence in SSL. Batch-based AL allows us to select more informative, confident,
and representative samples with low cost. Bin-based SSL divides streaming image samples into several bins,
and each bin repeatedly transfers the discriminative knowledge of convolutional neural network deep learning
to the next bin until the performance criterion is reached. The IASSL can overcome noisy and biased labels
in unknown, cluttered data distributions. We obtain superior performance, compared with the state-of-the-art
technologies, such as Faster RCNN, SSD300, and YOLOv2.
INDEX TERMS Object detection, convolutional neural network, incremental deep learning, active learning,
semi-supervised learning.
2169-3536
2018 IEEE. Translations and content mining are permitted for academic research only.
61748 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 6, 2018
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
D. K. Shin et al.: Incremental Deep Learning for Robust Object Detection in Unknown Cluttered Environments
samples are substantially imbalanced, and the collected sam- Batch mode AL is employed for learning a whole group of
ples are very often biased or badly labeled. Supervised samples at one time, rather than learning one sample at a time.
learning–based detection methods adopted most of the state- A confident sample is defined as having correct information
of-the-art techniques, and showed the only promise for on object category, attributes, position, and size (to be used
automatic object detection under the assumption that the in efficient learning), whereas a noisy sample does not. For
training and testing sets have the same data distribution. Most example, the attributes of a noisy region of interest (a ground-
previous object detection approaches [11], [13], [14] rely on truth object bounding box) represented by a CNN deep fea-
simple sampling strategies without considering distortion and ture may be ill-posed or too noisy to be modeled in the
noise in the data due to changing environments. Noisy sample training time. The dataset of noisy samples is biased and tends
selection and poor-quality labeling cause imbalanced data to be imbalanced in distribution. A noisy sample should be
distribution and significant performance degradation when a handled differently from confident samples, since imperfect
conventional CNN is used. However, in a real-world situation, samples do not contribute to, or might even degrade, detector
the operational environment of an individual system widely performance. We first build an initial object-detection CNN
changes. Therefore, the static world assumption is not valid model using a small number of confidently labeled samples.
in constructing an efficient object detector. The noisy samples are partitioned into several clusters using
The active learning (AL) method adds more informative a k-means algorithm and make the batch pool of informa-
labeled data to the training set in each iteration from unla- tive and representative samples for AL-based exploration.
beled or imperfect streaming samples. It selects the most Finally, we construct an adaptive object detector based on
uncertain samples (i.e., the highest disagreement among clas- bin-based SSL, where the bin is generated with a streaming
sifiers) relying on uncertainty and diversity criteria. It uses image sequence in a real-world application. It begins with the
a selective sampling strategy followed by queries for con- batch-based CNN model, and improves detection accuracy by
tinuous, adaptive, and incrementally improved detection per- increasing the confident samples using bin-based Incremental
formance. The assumption of dependable human labeling active semi-supervised learning (IASSL) in noisy streaming
is not valid for streaming object detection. The cost of a sample distribution. In this way, IASSL provides the means
high-quality labeling process becomes too expensive to be for both exploration and exploitation using combined infor-
acceptable in real-time object detection, since strict labeling mative and reliable sampling methods. The novelties of the
rules cannot apply in AL. Labeling is performed by a current proposed IASSL are summarized in the following.
classification model with an assumption that samples in the 1) A novel framework is proposed which significantly
same cluster are likely to be of the same class. Due to wrong improve the performance of object detection by combin-
clustering assumptions in many real-world environments, ing batch-based AL and bin-based SSL incrementally. Thus,
semi-supervised learning (SSL) alone cannot produce bet- IASSL takes advantage of both informative and reliable sam-
ter performance, and even degrades classification accuracy pling properties. To the best of our knowledge, our work is the
in streaming object detection. Stream-based sampling [26], first research which incorporates the collaborative sampling
adaptive sampling [27], [28], and many other approaches are and active semi-supervised learning in the area of object
often bordered by sample selection bias, where some sam- detection.
ples are error-prone and may not satisfy a random sampling 2) In real world scenario an individual system widely
assumption to avoid local overfitting. Furthermore, in many changes due to the noisy factors as a result the static world
applications, the collected dataset tends to be imbalanced assumption is not suitable for efficient object detection.
in class distribution. Streaming samples used in an SSL Our hierarchical object detection tree using IASSL partly
approach result in incorrect modeling, and lead to degradation solve this problem. Our method effectively leverages the
of system performance. Owing to the intrinsic complexity discriminative capability of deep features with a collabora-
of fallible labeling and imbalanced data samples in object tive sample-selection strategy satisfying uncertainty, diver-
detection, we need a novel approach to sample selection and sity, and confidence requirements of both AL and the SSL
an adaptive learning method. methods.
Researchers in the machine-learning community employ 3) A significant amount of experiment conducted on
the collaborative strategy of exploring and exploiting noisy openly available benchmark dataset such as PASCAL VOC
or unlabeled samples using the discrimination capability of and MS COCO combined with local dataset. Our method
both humans and classifiers to label the samples [29]. Such obtains higher mean average precision improvement and
collaborative sampling methods combined with AL and SSL reduced the error object–detection rate.
provide a successful direction. However, to the best of our The rest of this paper is organized as follows. We introduce
knowledge few researchers in object detection have focused related work in active learning, semi-supervised learn-
on this promising direction. ing, and the combination of AL and SSL in Section II.
AL and SSL methods work well on classification In Section III, we propose the IASSL framework. Hier-
accuracy improvement using both labeled and unlabeled archical object detection tree is discussed in Section IV,
data [30]–[32]. We adopted them to overcome the limitations and Incremental active semi-supervised learning is dis-
of object detectors under dynamically varying environments. cussed in Section V. The experimental results are given in
Section VI. Finally, the concluding remarks are drawn in has been shown to improve the diversity of the sample selec-
Section VII. tion [32], especially the clustering structure to avoid uncertain
sampling redundancy. The batch of samples is determined by
the Fisher criteria, considering a tradeoff between uncertainty
II. RELATED WORK and diversity. Monte Carlo simulation [27] is employed to
A. ACTIVE LEARNING select the best-matched sample distribution from a sequential
Active learning leverages the known information of the test- policy and to query the samples. In 2017, Ucar et al. [19]
ing data from human decisions, which is most beneficial introduced semi-supervised learning that leveraged their clas-
to the learning process of a classifier, by selecting samples sifier performance using both labeled and unlabeled sam-
that are expected to improve the learning performance of the ples, which was applied when huge unlabeled or imperfectly
classifier [29], [33]. Many researchers have investigated the labeled data are available. However, the amount of labeled
active leaning method for image classification [30]. Active data used in their learning method was relatively small. SSL
learning has been employed for region labeling and image uses unlabeled or noisy data directly in the training process
recognition tasks with less annotation effort and better anno- without any human-labeling efforts [28], [30], [31].
tation quality. Since annotation quality varies from people to SSL approaches are divided into self-training, co-training,
people, some researchers have investigated automatic anno- generative probabilistic models, and graph-based SSL.
tation precision assurance [30], [31]. Assuming that a sin- In self-training SSL, the classifier is first constructed with
gle prominent object of interest exists in an image, some a small amount of labeled training samples, and secondly,
approaches have tried to learn object models directly from a portion of the unlabeled training dataset is labeled using the
noisy keyword search. Diverse studies have been performed current classifier. Thirdly, the most confident samples among
in active learning frameworks such as real-time stream-based the predicted labeled samples are repeatedly added to the
sampling and adaptive sampling. training dataset until it achieves convergence. The uncertainty
For most active learning, only uncertain and erroneous sampling method in AL is a complementary approach, where
portions of the training data are required to make queries the least confident samples are selected for querying. In co-
that are annotated manually with minimum human effort. training, an ensemble method is employed. First, separate
These annotated data are added to the training dataset to models are learned using independently labeled datasets. The
achieve the highest gain in classification accuracy. There- current models classify the unlabeled data, and learn the next
fore, the query/sampling strategy is the main consideration models using a few selected samples with the most confident
in the active learning technique to reduce user involvement. predicted labels, which minimizes the version space.
In 2017, Gordon and Hernández-Lobato et al. [34] and
Zhang et al. [35] adapted an uncertainty sampling strategy B. COMBINATION OF AL AND SSL
to select samples closer to the decision boundary. While AL and SSL try to solve the same problem from opposite
uncertainty-based sampling has the advantage of explor- points of view. AL and SSL are based on different principles
ing uncertain boundaries to accelerate quick convergence but have the common goal of high classification accuracy
of learning curves, it also has some disadvantages, such with minimum human-labeling effort [30]. AL and SSL
as selected sample points that easily include mistaken out- methods can be combined to exploit both labeled and tenta-
liers. Since many uncertain sample points are often not good tively labeled samples for classifier training and to explore
representatives of the whole distribution, it degrades sam- new samples labeled manually. Combination methods of
pling quality as well as the performance of the classifier. AL and SSL are divided into sequential combination, SSL
Settles and Craven [36] considered both informative and rep- embedding in AL, and collaborative labeling. Sequential
resentative characteristics of the data distribution in their combination understands that the initial training set is critical
information density framework. They developed a classifier for SSL to converge to the target performance. This method
on representative cluster sets where the most representative employs AL in order to establish an effective initial training
samples were selected for label propagation in the same set in the first phase, and it allows SSL to improve accuracy
cluster. The sampling strategy called the query by com- by using the unlabeled samples in the second phase. First,
mittee (QBC) ensemble learning method relies on different AL is applied iteratively to add additional labeled samples
hypotheses of a committee of classifiers by selecting infor- to constitute informative training samples, and SSL improves
mative samples where disagreement between the classifiers the classification accuracy by leveraging the information of
is maximal. It is a critical issue for QBC to construct an unlabeled samples. For example, QBC is combined with
accurate and, at the same time, diverse committee. Consid- Expectation-Maximization SSL in a sequential manner and
ering this diversity, the selection strategy of sample batches is employed to assign labels to unlabeled samples.
in each iteration allows speeding up the learning process SSL embedding in AL treats SSL as the classifier in an AL
with different considerations, such as minimizing the margin framework. Muslea et al. adopted this strategy by employ-
and maximizing diversity [33]. Batch AL was introduced ing multiple views for both AL and SSL [32]. Exploiting
to consider unlabeled examples as an optimization of the unlabeled data in another manner allows both human experts
discriminative classifiers. The use of clustering in batch AL and classifiers to collaboratively label the selected noisy
the number of channels, as shown in Fig. 2. We use 1 × 1 conditional label probability distribution, prediction P(y, b|x)
filters to compress feature representation and batch normal- is vulnerable to sample selection bias. In a streaming object
ization to train the model. It has five max pooling layers and detection system, a sampling method includes some sophis-
19 convolutional layers. ticated mechanisms to minimize the effects of sampling bias
and class imbalance for acceptable detection accuracy.
A. PROBLEM FORMALIZATION
In an incremental SSL step, the super class is decided by CNN
detector (C1 , C2 , . . . , Csup , Dsup ), which is constrained by the
super class prototype model. Here C1 , C2 , and Csup , represent
first, second, and final super class convolution layer and Dsup
is super class detector layer. The object class detector is
associated with a super class node in the first level of the
hierarchical tree. Each super class node builds CNN detector
(C1 , C2 , . . . , Cobj , Dobj ) to decide the object node by predict-
ing the object class and bounding box. Here C1 , C2 , and Cobj ,
FIGURE 2. The flowchart for feature extraction from deep CNN represent first, second, and final object class convolution
architecture Darknet-19.
layer and Dobj is object class detector layer. The bounding
box of feature vector FV, denoted by BB(FV ), indicates the
Hierarchical object detection based on IASSL is divided
bounding box region within which the best position of deep
into three levels, as shown in Fig. 3. The three levels are the
feature descriptor FV can be found with high probability.
super class, object class, and sub class.
The hierarchical tree has three types of object node based on
confusion table analysis.
We analyzed each object class performance and catego-
rized them based on the following three criteria.
i) Case 1: Existing object class, e.g., Inha University,
Hi-tech Building table. If the class already exists in a bench-
mark dataset (e.g., PASCAL VOC) we consider that class an
existing object class.
ii) Case 2: Combined object class with one or more existing
object classes, for example, Inha hi-tech lobby, sofa of Inha
University with PASCAL VOC sofa dataset. Because the
class has a strong likelihood, and data size is not sufficient
FIGURE 3. Hierarchical object detection tree using IASSL. for a IASSL experiment, we combined the new object class
data with the existing object class.
In Fig. 3, the top of the hierarchical tree represents the root, iii) Case 3: Local new object class as a new class, e.g., arena
which has commonly shared convolutional layers to learn chair. If the new class does not have a single-likelihood value
the common deep representations for all the object classes. greater than the threshold, we consider that object class a new
In the root, the super class detector (i.e., indoor objects, ani- class.
mals, persons, and vehicles) is built using the convolutional Given a test scene image, I , and a training dataset, D,
layers and max pooling layers to learn the super classes. assume that a prior distribution over FV exists. Then, FV can
The first-level nodes of the hierarchical tree are associated be treated as a random variable in Bayesian statistics. The
with the object class detectors. The second-level nodes of the posterior distribution of FV is represented by
hierarchical tree are associated with the sub class detectors.
This level can be expanded to represent unseen objects and p (I |FV , D) p (FV |D)
p (FV | I , D) = (1)
new classes of objects. The third level of the hierarchical ∫
tree is replaced with the softmax layer in order to train deep Since the hierarchical tree detector consists of super-class,
CNN features jointly, since each group of the super class may object-class, and sub-class detectors, equation (1) can be
contain a small number of existing object classes. rewritten as
p I |FV obj , FV sup , D p FV obj |FV sup , D Algorithm 1 Block Coordinate Descent-Based Optimization
× Input: Image I , D with the hierarchical tree, regularization
∫
thresholds (λ1 , λ2 )
d d d
p I |FV sub , FV obj , FV sup , D p FV sub |FV obj , FV sup ,D
× Output: ( FV sup , FVobj , FVsub )
∫ 1. Method: d d d
(2)
2. Initialize FV sup , FVobj , and FV sub
3. k ← 0
where FV sup , FV obj , and FV sub are localized feature descrip-
4. repeat
tions of super-class, object-class, and sub-class nodes, respec- k k
5. FV k+1 k
sup ⇐ argmax Jsup (FV sup ; FV obj , FV sub )
tively. The maximum posterior of (FV sup , FV obj , FV sub ) is B
calculated as follows: 6. FV k+1 k k k
obj ⇐ argmax Jobj (FV obj ; FV sup , FV sub )
d d d C
sub ⇐ argmax Jsub (FV sub ; FV sup , FV obj )
FV k+1
7. k k k
( FV sup , FVobj , FVsub )MAP (I , D) D
8. until convergence.
[p I |FV sup , D p FV sup , D
= argmax
FV sup ,FV obj ,FV sub
× p I |FV obj , FV sup , D p FV obj | FV sup , D
B. FEATURE VECTOR OPTIMIZATION
We rewrite equation (5) by the block coordinate descent-
× p I |FV sub , FV obj , FV sup , D based optimization formulation [40] as follows:
d d
× p FV sub |FV obj , FV sup , D ] (3) ( FV sup , FVobj |I , D
= argmin f (FV sup ;FV obj ; FV sub , I , D)
Since the object class is constrained by part-location search FVsup
areas, part r is assumed to be treated as conditionally indepen- + argmin f (FV obj ;FV sup ; FV sub , I , D) (6)
dent. Given priors, the hierarchical tree detector ensemble is FVobj
looking for optimal feature vectors FV sup , FV obj , and FV sub The solution to equation (6) is a nonconvex, but it is a
satisfying the following: convex w.r.t. each of the optimization variables. We adopted
d d [ [ an optimal algorithm based on the block coordinate descent
( FV sup , FVobj , part 1 , .., part R )MAP (I , D) method [40], which minimizes equation (6) iteratively w.r.t.
each variable, while the remaining variables are fixed. The
p I |FV sup , D p FV sup , D
= argmax entire process is summarized in Algorithm 1.
FV sup ,FV obj ,FV sub
where Jsup , Jobj , and Jsub are objective functions defined
× p I |FV obj , FV sup , D p FV obj | FV sup , D
from equation (6), respectively. B, C, and D are optimization
parameters for feature vectors FV sup , FV obj , and FV sub .
YR Nevertheless, the detection result with the maximum like-
× p(I |part r , FV obj , FV sup , D)
r=1 lihood may not be correct object detection, and it may not
be consistent with other feature points. These kinds of errors
× p part r | FV obj , FV sup , D (4)
always occur in most object detectors relying on a limited
labeled training dataset. In many cases, detectors are unstable
Note that p I |FV sup , D and p I |FV obj , FV sup , D are
due to noise, pose variances, cluttered background, and illu-
the likelihood functions of FV sup and are estimated by the mination changes. The hierarchical object detection tree often
super-class detector and object-class detector discussed in the under-fits due to the shortage of initial labeled training data;
following subsections. We minimize the negative of the log- the class models are biased, and the classification bound-
arithm of the posterior, rather than maximizing equation (5), aries determined by the hierarchical tree are often far from
as follows: being the best choice. In this context, we introduce a data-
d d [ [ driven semi-supervised framework that can learn incremen-
tally using both labeled and unlabeled datasets to minimize
( FV sup , FVobj , part1 , .., partR )MAP (I , D)
the effects of troublesome patterns by preventing outliers.
log p I |FV sup , D + log p FV sup , D
= argmax
FV sup ,FV obj ,FV sub
C. INCREMENTAL ACTIVE SEMI-SUPERVISED LEARNING
+ log p I |FV obj , FV sup , D + logp FV obj | FV sup , D
The proposed IASSL combines the uncertainty and diversity
XR properties from the AL paradigm and the confidence property
+ [logp I |part r , FV obj , FV sup , D from the incremental SSL paradigm. It minimizes not only
r=1
the training time but also costly human intervention, and at
+ log p part r | FV obj , FV sup , D ] (5)
the same time, keeps a high-quality training dataset similar
to [27]. Considering the uncertainty criterion of AL, the most advantage of being incorporated with a diversity mea-
uncertain samples are selected as the most useful training sure [33]. The candidate pool of samples with a diversity
samples to be added, since those are expected to be incor- criterion is determined by selecting ϑ samples from the η
rectly classified by the current classification model with high candidates with a more diverse property, similar to [27].
probability. However, the uncertainty criterion may cause The distribution of the remaining samples is analyzed by
the selection of noisy or redundant samples. We adapted a K-means clustering algorithm to determine the uncertainty
pool-based (batch or bin) AL learning framework combined criterion. IASSL evaluates the distribution of the selected η
with an incremental SSL philosophy, based on the collabo- samples based on standard k-means clustering and removes
rative sampling method of AL and SSL in terms of uncer- outliers and similar samples. We define candidate sam-
tainty, confidence, and diversity criteria, which are expected ple set Ddiversity , which contains more informative samples
to select more informative and training samples with low within the rank ϑ measured by the deep confident scores,
redundancy. i.e., f (x):
We use an AL batch cycle similar to our previous work [36]
and added a bin cycle for incremental SSL. In the AL batch Ddiversity = {X |X ∈ Dtentative , 0 ≤ f (X ) ≤ 1}
cycle, a training dataset is divided into well-defined labeled s.t. Rank (x) < η (7)
training samples, Dwell , and weakly or unlabeled training
where Rank (x) denotes the decreasing order of f (x) mea-
samples, Dtentative . IASSL processes them to increase the
sured by the class score of the current object detector. We have
volume of Dwell above that of Dtentative . Initially, the IASSL-
constructed the batches of samples by incorporating a diver-
based image processing system learns using the well-defined
sity measure.
dataset, which is used to construct the pre-trained CNN. This
Next, we applied the incremental SSL philosophy by ini-
dataset is assumed to be correctly pre-labeled. The initial
tializing D1 with the sample Xtop = argmax x∈Ddiversity f (x),
well-defined dataset is sampled from the data that are used to
Xtop ∈ Ddiversity using confidence criterion parameter γ .
construct the CNN’s pre-trained model. A batch of samples is
At each step, our sampling strategy chooses a sample
selected considering the distribution of the prototype models
from Ddiversity and adds to D1 . Ddiversity becomes the
and class balancing. The confidence scores are (re)assigned
most similar sample in D1 in terms of confidence score,
to the weak samples by the current (CNN) detector. Confident
i.e.,
and well-defined samples are selected from the weak samples
according to the confidence score measured and ranked by Xtop = argmax x∈Ddiversity {max xi ,xj ∈D1 d(xi , xj )} (8)
the current (CNN) detector. A subset of weak samples is
selected using the collaborative sampling strategy, whereby In (8), we use Euclidian distance between two features to
the current detector reassigns new labels or assign labels with calculate d(xi , xj ).
high scores; some ambiguous samples will be removed or When the cardinality of D1 becomes γ , the sample selec-
relabeled by oracle after being filtered by uncertainty and tion process is stopped, and the final sample set is D1 .
diversity criteria. Note that we can minimize human effort by We retrain the CNN using the pool of samples, and the
exploring only a small portion of the weakly labeled samples, process is repeated until a convergence criterion is satisfied.
while classification accuracy can be improved. More infor- The entire process is summarized in Algorithm 2.
mative samples reflecting the diversity criterion of the active
learning paradigm are mined for a current batch of samples. VI. EXPERIMENTS
The selected confident samples are added to the current train- The main goal of experiment is to confirm the effi-
ing dataset and generate a new well-defined training dataset ciency of IASSL framework. To achieve this goal, we con-
to retrain the CNN detector. ducted a number of experiments on benchmark datasets,
Instead of only the selection of one sample in an iteration, such as PASCAL VOC as well as a local dataset, and
IASSL minimizes the learning time by building a pool of D1 compared the results with state-of-the-art detectors, such
samples based on the uncertainty, diversity and confidence as YOLOv2. All implementations are on a single server
criterion, where D1 is a final sample set determined at oper- with cuDNN [41], a single NVIDIA TITAN X and
ation time, and a sample pool can be batch or bin (see Fig. 1). Tensorflow [42].
For each class, a set of samples is selected and scored using
the current detector and added to the training set. We select η
A. DATASET OVERVIEW
samples with closer object class scores, f (x), in each half of
the margin according to the uncertainty principle. We have a 1) PASCAL VOC DATASET
total of η samples for what we call candidate samples. From The PASCAL VOC 2007 dataset [16] contains 20 classes
the total we select η samples with a f (x) score that is between with four categories: person, animal, vehicle, indoor. Among
0 and 1. When uncertainty parameter η decreases then f (x) them, our experimental subject is the indoor data cat-
distance increase. egory, which includes bottle, chair, dining table, potted
However, the uncertainty criterion cannot avoid the plant, sofa, and tv/monitor. It has a total of 9963 images
selection of similar samples. IASSL also provides the (train/validation/test) with 24,640 annotated objects. On the
FIGURE 8. Data model for IASSL: the root node consists of PASCAL VOC
2007 and 2012 and the local dataset. The local datasets are captured.
FIGURE 10. The number of images sampled and labeled by the IASSL
method during each sampling phase.
FIGURE 13. ASSL and SSL performance comparison.
has 100 labeled data items and 300 unlabeled data items. TABLE 2. Comparison results of state-of-the-art object detection
algorithms and our method in COCO validation 2017 dataset.
Thus, local chair, sofa, and table images were combined with
the PASCAL VOC test dataset for a fair evaluation. Here,
the black bounding box shows outstanding performance when
detecting chairs, sofas, and tables. Besides, IASSL performs
well under various illumination changes. Here the Adam
optimizer used with a learning rate of 0.001 for the parameter
selection.
This improves learning algorithms for a streaming object [19] A. Uçar, Y. Demir, and C. Güzeliş, ‘‘Object recognition and detection with
detector in the presence of changing and noisy environments. deep learning for autonomous driving applications,’’ Simulation, vol. 93,
no. 9, pp. 759–769, 2017.
Active learning uses a collaborative sampling method for [20] Y. Bengio, A. Courville, and P. Vincent, ‘‘Representation learning:
measurement of uncertainty and diversity, and it collabo- A review and new perspectives,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
rates with semi-supervised learning based on a confidence vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
[21] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,
criterion. Our proposed model produces higher performance ‘‘Object detection with discriminatively trained part-based models,’’ IEEE
with fewer errors, higher accuracy, and less human effort, Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627–1645,
in comparison with semi-supervised learning methods. These Sep. 2010.
[22] K. Makantasis, A. Doulamis, N. Doulamis, and K. Psychas, ‘‘Deep learn-
achievements encourage further improvement of the pro- ing based human behavior recognition in industrial workflows,’’ in Proc.
posed method. Future research directions include finding IEEE Int. Conf. Image Process. (ICIP), Sep. 2016, pp. 1609–1613.
ways to deal more flexibly with the learning sequence that [23] M. Hasan and A. K. Roy-Chowdhury, ‘‘A continuous learning framework
for activity recognition using deep hybrid feature models,’’ IEEE Trans.
reduces the number of erroneous labeled data and that of
Multimedia, vol. 17, no. 11, pp. 1909–1922, Nov. 2015.
discarded good labeled data. [24] B. J. Lee et al., ‘‘Perception-action-learning system for mobile social-
service robots using deep learning,’’ in Proc. Assoc. Advancement Artif.
Intell. Conf. Artif. Intell. (AAAI), Feb. 2018.
REFERENCES
[25] J. Han, R. Quan, D. Zhang, and F. Nie, ‘‘Robust object co-segmentation
[1] I. Dimitrovski, D. Kocev, I. Kitanovski, S. Loskovska, and S. Džeroski, using background prior,’’ IEEE Trans. Image Process., vol. 27, no. 4,
‘‘Improved medical image modality classification using a combination pp. 1639–1651, Apr. 2018.
of visual and textual features,’’ Comput. Med. Imag. Graph., vol. 39, [26] C. C. Loy, T. Xiang, and S. Gong, ‘‘Stream-based active unusual event
pp. 14–26, Jan. 2015. detection,’’ in Computer Vision—ACCV (Lecture Notes in Computer Sci-
[2] Y. LeCun et al., ‘‘Backpropagation applied to handwritten zip code recog- ence), vol. 6942. Cham, Switzerland: Springer, 2010, pp. 161–175.
nition,’’ Neural Comput., vol. 1, no. 4, pp. 541–551, Dec. 1989. [27] J. Kwon and K. M. Lee, ‘‘Tracking of a non-rigid object via patch-based
[3] R. Benenson, M. Omran, J. Hosang, and B. Schiele, ‘‘Ten years of dynamic appearance modeling and adaptive basin hopping monte carlo
pedestrian detection, what have we learned?’’ in Computer Vision—ECCV sampling,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009,
2014 Workshops (Lecture Notes in Computer Science), vol. 8926. Cham, pp. 1208–1215.
Switzerland: Springer, 2014, pp. 613–627. [28] Z. Lu, X. Wu, and J. C. Bongard, ‘‘Active learning through adaptive
[4] J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu, ‘‘Advanced deep-learning heterogeneous ensembling,’’ IEEE Trans. Knowl. Data Eng., vol. 27, no. 2,
techniques for salient and category-specific object detection: A survey,’’ pp. 368–381, Feb. 2015.
IEEE Signal Process. Mag., vol. 35, no. 1, pp. 84–100, Jan. 2018. [29] P. K. Rhee, E. Erdenee, S. D. Kyun, M. U. Ahmed, and S. Jin, ‘‘Active and
[5] R. Girshick, ‘‘Fast R-CNN,’’ in Proc. IEEE Int. Conf. Comput. Vis., semi-supervised learning for object detection with imperfect data,’’ Cogn.
Dec. 2015, pp. 1440–1448. Syst. Res., vol. 45, pp. 109–123, Oct. 2017.
[6] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and [30] B. Settles, ‘‘Active learning literature survey,’’ Mach. Learn., vol. 15, no. 2,
Y. LeCun. (2014). ‘‘OverFeat: Integrated recognition, localization and pp. 201–221, 2010.
detection using convolutional networks.’’ [Online]. Available: https://arxiv. [31] X. Zhu, ‘‘Semi-supervised learning literature survey contents,’’ Sci. York,
org/abs/1312.6229 vol. 10, no. 1530, p. 10, 2008.
[7] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real- [32] I. Muslea, S. N. Minton, and C. A. Knoblock, ‘‘Active learning with strong
time object detection with region proposal networks,’’ in Proc. NIPS, 2015, and weak views: A case study on wrapper induction,’’ in Proc. 18th Int.
pp. 1–10. Joint Conf. Artif. Intell., 2003, pp. 415–420.
[8] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Spatial pyramid pooling in deep [33] C. Persello and L. Bruzzone, ‘‘Active and semisupervised learning for the
convolutional networks for visual recognition,’’ IEEE Trans. Pattern Anal. classification of remote sensing images,’’ IEEE Trans. Geosci. Remote
Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sep. 2015. Sens., vol. 52, no. 11, pp. 6937–6956, Nov. 2014.
[9] W. Liu et al., ‘‘SSD: Single shot multibox detector,’’ in Computer Vision— [34] J. Gordon and J. M. Hernández-Lobato. (2014). ‘‘Bayesian semisu-
ECCV (Lecture Notes in Computer Science), vol. 9905. Cham, Switzer- pervised learning with deep generative models.’’ [Online]. Available:
land: Springer, 2016, pp. 21–37. https://arxiv.org/abs/1706.09751
[10] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. (2015). [35] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, ‘‘Joint face detection and alignment
‘‘You only look once: Unified, real-time object detection.’’ [Online]. using multitask cascaded convolutional networks,’’ IEEE Signal Process.
Available: https://arxiv.org/abs/1506.02640 Lett., vol. 23, no. 10, pp. 1499–1503, Oct. 2016.
[11] J. Redmon and A. Farhadi. (2016). ‘‘YOLO9000: Better, faster, stronger.’’ [36] B. Settles and M. Craven, ‘‘An analysis of active learning strategies for
[Online]. Available: https://arxiv.org/abs/1612.08242 sequence labeling tasks,’’ in Proc. Conf. Empirical Methods Natural Lang.
[12] J. Redmon and A. Farhadi. (2018). ‘‘YOLOv3: An incremental improve- Process. (EMNLP), 2008, pp. 1070–1079.
ment.’’ [Online]. Available: https://arxiv.org/abs/1804.02767 [37] L. Wan, K. Tang, M. Li, Y. Zhong, and A. K. Qin, ‘‘Collaborative active and
[13] J. Dai, Y. Li, K. He, and J. Sun. (2016). ‘‘R-FCN: Object detec- semisupervised learning for hyperspectral remote sensing image classifi-
tion via region-based fully convolutional networks.’’ [Online]. Available: cation,’’ IEEE Trans. Geosci. Remote Sens., vol. 53, no. 5, pp. 2384–2396,
https://arxiv.org/abs/1605.06409 May 2015.
[14] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, ‘‘Focal loss for [38] A. Sorokin and D. Forsyth, ‘‘Utility data annotation with Amazon
dense object detection,’’ in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017, Mechanical Turk,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
pp. 2999–3007. Recognit. Workshops (CVPR), Jun. 2008, pp. 1–8.
[15] O. Russakovsky et al., ‘‘ImageNet large scale visual recognition chal- [39] D. Tuia, F. Ratle, F. Pacifici, M. F. Kanevski, and W. J. Emery, ‘‘Active
lenge,’’ Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015. learning methods for remote sensing image classification,’’ IEEE Trans.
[16] M. Everingham and J. Winn, ‘‘The PASCAL visual object classes chal- Geosci. Remote Sens., vol. 47, no. 7, pp. 2218–2232, Jul. 2009.
lenge 2007 (VOC2007) development kit,’’ in Proc. Challenge, 2007, [40] J. Mareček, P. Richtárik, and M. Takáč, ‘‘Distributed block coordinate
pp. 1–23. descent for minimizing partially separable functions,’’ in Numerical Anal-
[17] T.-Y. Lin et al., ‘‘Microsoft COCO: Common objects in context,’’ in ysis and Optimization (Springer Proceedings in Mathematics & Statistics),
Computer Vision—ECCV (Lecture Notes in Computer Science), vol. 8693. vol. 134. Cham, Switzerland: Springer, 2015, pp. 261–288.
Cham, Switzerland: Springer, 2014, pp. 740–755. [41] S. Chetlur et al. (2014). ‘‘cuDNN: Efficient primitives for deep learning.’’
[18] G. Cheng, C. Yang, X. Yao, L. Guo, and J. Han, ‘‘When deep learn- [Online]. Available: https://arxiv.org/abs/1410.0759
ing meets metric learning: Remote sensing image scene classification [42] M. Abadi et al. (2016). ‘‘TensorFlow: Large-scale machine learning
via learning discriminative CNNs,’’ IEEE Trans. Geosci. Remote Sens., on heterogeneous distributed systems.’’ [Online]. Available: https://arxiv.
vol. 56, no. 5, pp. 2811–2821, May 2018. org/abs/1603.04467
[43] Y. Bengio, ‘‘Practical recommendations for gradient-based training of deep MINHAZ UDDIN AHMED received the B.S. and
architectures,’’ in Neural Networks: Tricks of the Trade (Lecture Notes M.S. degrees in computer science from National
in Computer Science), vol. 7700. Cham, Switzerland: Springer, 2012, University, Bangladesh, in 2006 and 2010, respec-
pp. 437–478. tively. He is currently pursuing the Ph.D. degree
[44] K. Diederik and B. Jimmy. (2017). ‘‘Adam: A method for stochastic with the Intelligent Technology Laboratory, Inha
optimization.’’ [Online]. Available: https://arxiv.org/abs/1412.6980 University, South Korea. His research interests
[45] C.-C. Kao, T.-Y. Lee, P. Sen, and M.-Y. Liu. (2018). ‘‘Localization-aware include human action recognition, facial expres-
active learning for object detection.’’ [Online]. Available: https://arxiv.
sion recognition, machine learning, and deep
org/abs/1801.05124
learning.
[46] K. Wang, L. Lin, X. Yan, Z. Chen, D. Zhang, and L. Zhang, ‘‘Cost-effective
object detection: Active sample mining with switchable selection criteria,’’
IEEE Trans. Neural Netw. Learn. Syst., to be published.
[47] K. Shmelkov, C. Schmid, and K. Alahari, ‘‘Incremental learning of object
detectors without catastrophic forgetting,’’ in Proc. IEEE Int. Conf. Com-
put. Vis. (ICCV), Oct. 2017, pp. 3420–3429.
[48] C. Mitash, K. E. Bekris, and A. Boularias, ‘‘A self-supervised learning
system for object detection using physics simulation and multi-view pose
estimation,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Sep. 2017,
pp. 545–551. PHILL KYU RHEE received the B.S. degree in
[49] B. Yuan, J. Chen, W. Zhang, H. S. Tai, and S. McMains, ‘‘Iterative cross electrical engineering from Seoul National Univer-
learning on noisy labels,’’ in Proc. IEEE Winter Conf. Appl. Comput. sity, Seoul, South Korea, in 1982, the M.S. degree
Vis. (WACV), Jan. 2018, pp. 757–765. in computer science from East Texas State Univer-
sity, Commerce, TX, USA, in 1986, and the Ph.D.
degree in computer science from the University
of Louisiana, Lafayette, LA, USA, in 1990. From
1982 to 1985, he was a Research Scientist with the
DONG KYUN SHIN received the B.S. degree Systems Engineering Research Institute, Seoul.
in computer science and information engineering In 1991, he joined the Electronic and Telecom-
from Inha University, Incheon, South Korea, in munication Research Institute, Seoul, as a Senior Research Staff Member.
2017, where he is currently pursuing the mas- From 1992 to 2001, he was an Associate Professor with the Department of
ter’s degree majoring in computer engineering. Computer Science and Information Engineering, Inha University, Incheon,
His research interests include object detection and South Korea, where he has been a Professor since 2001. His current research
tracking, deep learning, machine learning, and interests are pattern recognition, machine intelligence, and autonomic cloud
computer vision. computing. He is a member of the IEEE Computer Society and the Korea
Information Science Society.