Sie sind auf Seite 1von 5

Semantic Hierarchy Preserving Deep Hashing

for Large-scale Image Retrieval


Xuefei Zhe∗ Le Ou-Yang
xfzhe2-c@my.cityu.edu.hk leouyang@szu.edu.cn
Department of Electronic Engineering Guangdong key Laboratory of
City University of Hong Kong intelligent information Processing
Shenzhen University,Guangdong,China

Shifeng Chen Hong Yan


arXiv:1901.11259v2 [cs.CV] 12 Apr 2019

shifeng.chen@siat.ac.cn h.yan@cityu.edu.hk
Shenzhen Institutes of Advanced Technology Department of Electronic Engineering
CAS, China City University of Hong Kong

Otter Shark
Shark

Fish
Hamster

Shrew
Dolphin
Hamster Ray

Shrew
Ray

(a)Without Hierarchy (b)With Hierarchy

Figure 1: Feature spaces w/o and w/ Semantic hierarchy. (a): Feature space without hierarchy can only learn the similarity
within fine-level classes. (b): Using semantic hierarchy, the learnt feature space preserves the similarity between classes.
ABSTRACT semantic labels or convert them to similar/dissimilar labels for train-
Convolutional neural networks have been widely used in content- ing. The natural semantic hierarchy structures are ignored in the
based image retrieval. To better deal with large-scale data, the deep training stage of the deep hashing model. In this paper, we present
hashing model is proposed as an effective method, which maps an effective algorithm to train a deep hashing model that can pre-
an image to a binary code that can be used for hashing search. serve a semantic hierarchy structure for large-scale image retrieval.
However, most existing deep hashing models only utilize fine-level Experiments on two datasets show that our method improves the
fine-level retrieval performance. Meanwhile, our model achieves
state-of-the-art results in terms of hierarchical retrieval.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation KEYWORDS
on the first page. Copyrights for components of this work owned by others than the Deep neural networks, learn to hash, hierarchical image retrieval
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission ACM Reference Format:
and/or a fee. Request permissions from permissions@acm.org.
Xuefei Zhe, Le Ou-Yang, Shifeng Chen, and Hong Yan. 2019. Semantic
Conference’17, July 2017, Washington, DC, USA
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Hierarchy Preserving Deep Hashing for Large-scale Image Retrieval. In
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 Proceedings of ACM Conference (Conference’17). ACM, New York, NY, USA,
https://doi.org/10.1145/nnnnnnn.nnnnnnn 5 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
Conference’17, July 2017, Washington, DC, USA Xuefei Zhe, Le Ou-Yang, Shifeng Chen, and Hong Yan

1 INTRODUCTION Although different loss functions are proposed to conduct simi-


It has been witnessed in past few years that deep learning has sig- larity learning, the above-mentioned methods can only handle the
nificantly improved the quality of content-based image retrieval single level semantic labels. The semantic labels can be naturally
(CBIR) [7, 10, 12, 15]. With the explosive growth of online visual organized with a hierarchical structure. The WordNet [6], as an
data, there is an urgent need to develop more efficient deep learning example, is a large lexical database that groups more than 80, 000
models. The deep hashing model has been proposed as a promising concepts in a hierarchical structure. ImageNet [18] provides image
method for large-scale image retrieval. It can project images to bi- data organized based on the WordNet hierarchy. Hundreds or even
nary codes that are used for effective approximate nearest neighbor thousands of images are contained in each node of the hierarchy.
search, which can save both storage and computation costs. This hierarchical-structured image dataset has significantly boosted
However, most existing deep hashing models [1, 13, 14, 19] only a wide range of research areas.
utilize single level semantic labels for training. Their training pro- Recently, the semantic hierarchy learning problem has been ad-
cesses are supervised with the fine-level labels or similar/dissimilar dressed in several works. The hierarchical semantic image retrieval
labels that are converted from fine-level labels. With such supervi- model is proposed in [4]. This work encodes hierarchy in semantic
sion, the deep hashing model can only learning semantic similarity similarity. By combining the coarse and fine level labels, work in [5]
within classes. The relationship between classes is not well pre- proves that the image classification performance can be improved
served as the semantic hierarchy structure, which not only limits further. A similar idea has been shared in [16], which considers
the deep hashing model to learn a better semantic hashing space using a hierarchical training strategy to handle the face recognition
but also causes difficulty for the hierarchical retrieval. For example, task. These two works are only considered the general Euclidean
as shown in Figure 1, without hierarchical information, the fea- space. Recent work in [16] integrates the semantic relationship
ture space can only guarantee that images of Otters has a shorter between class levels into deep learning. The hierarchical similar-
distance than images from other class but cannot keep them close ity learning problem has also been addressed for deep hashing
to the images of dolphins that belongs to the same parent class in SHDH [20]. SHDH tackles the semantic hierarchy learning by
(Aquatic Mammals). However, utilizing the semantic hierarchy la- proposing a weighted Hamming distance. However, SHDH also
bel, images of otters would gather together. Meanwhile, they would uses pair-wise relation, which has been proved not as efficient as
be closer to images of dolphins than images from other parent using class-level labels [22, 24].
classes
In this work, we propose an effective deep hashing model that 3 PROPOSED APPROACH
can preserve the semantic hierarchy structure in the learnt hash- In a hierarchically labeled dataset, every image X n is marked with
ing space. Our model defines a hierarchy loss function based on a K-level semantic label denoted with a K-dim vector Yn . The la-
the Gaussian distribution that can directly utilize hierarchy labels bel vector consists of the lowest category to the highest one in a
for end-to-end training. Experiments on two hierarchical datasets: semantic tree. For example, CIFAR-100 has two-level semantic struc-
CIFAR-100 [11] and the North America birds dataset (NABirds) ture. An image of a dolphin is labels with Y = {Dolphin, Aquatic
show that our method obtains much better hierarchical retrieval Mammals}.
performance along with the improvement of fine-level retrieval. Our goal is to find a mapping function, a CNN here, which can
project images to a special Hamming space. In this Hamming space,
for any semantic level, the distance between points from the same
class should be smaller than the one between points from different
2 RELATED WORKS classes.
The deep hashing method is one of the most promising supervised In many previous works, only the fine level semantic information
learning to hashing methods. In general, deep hashing models can (categories at leaf nodes) is utilized to train deep hashing models.
directly map an image to a binary code that can be used for content- With such supervision, deep models can only preserve the images
based hashing search. Currently, many deep hashing models rely from the same lowest class gathering together. However, the sim-
on the similar/dissimilar labels that are generated from class labels. ilarities between the parent classes are out-of-order. To keep a
Image pairs from the same class are denoted as similar, otherwise, semantic hierarchical structure in the Hamming space, we propose
they are dissimilar. a multi-level Gaussian model.
DSPH [14] firstly proposes to utilize pair-wise labels to train the Denote the output of the mapping function as r n = f (X n , Θ)
end-to-end deep hashing model. HashNet [2] defines a weighted and r n ∈ {−1, 1} L . We formulate our objective function as follows
maximum likelihood of a pairwise logistic loss function for bal-
N Õ
K exp{− 2σ1 2 d(r n , µynk )}
ancing the similar and dissimilar labels. DTSH [21] extends the Õ
pair-wise supervision to the triplet one for capturing the seman- min J =− log ÍC k
Θ, M 1
i=1 exp{− 2σ 2 d(r n , µ ki )}
k
n=1 k =1
tic information more effectively. To fully use the semantic class k (1)
labels, several works design loss functions directly based on the subject to r n = f (X n , Θ) ∈ {−1, 1} L ,
class labels. SSDH [22] utilizes the softmax classifier to train the
hashing model. DCWH [24] constructs a Gaussian distribution- µ ki ∈ {−1, 1} L ,
based objective function to take the advantage of the class-level where Θ are parameters of the CNN and µ ki is the class center
information. DSRH [23] and DSDH [13] both combine the pair-wise of class i in the k-th level of the semantic hierarchy. Ck is the
and class-level supervisions. number of classes at k-th level and σk is a parameters to control the
Semantic Hierarchy Preserving Deep Hashing
for Large-scale Image Retrieval Conference’17, July 2017, Washington, DC, USA
Table 1: Results on CIFAR-100

mAP@all mAHP@2.5k
Methods
32-bit 64-bit 128-bit 32-bit 64-bit 128-bit
DPSH 0.1657 0.1829 0.1767 0.2334 0.2506 0.2218
DSDH 0.1790 0.2128 0.1987 0.2532 0.3201 0.3301
DCWH 0.7680 0.8178 0.8008 0.4443 0.5066 0.4616
Ours 0.8067 0.8259 0.8144 0.8595 0.8667 0.8630

class gaps. d(·, ·) is the Hamming distance function. To solve this 4 EXPERIMENTS
discrete optimization problem, we follow the optimization method
4.1 Implementation Details
in DCWH [24] that firstly relaxes the r n to [−α, α] where α is set to
1.1 as [24]. Then the distance d(·, ·) can be replaced with a Euclidean We implement our model with an open source deep learning frame-
distance. The loss function can be derived as work, MXNet [3]. We use Resnet-50 [8] as our base neural networks.
The models are pre-trained on the ImageNet [18] and are down-
K exp{− 2σ1 2 ∥r n − µynk ∥}
Õ loaded from the official website of MXNet. The final parameters of
L =− log ÍC k
k
exp{− 1 ∥r , µ ∥} (2) fully-connected layer are initialized by the Xavier method with the
k =1 i=1 2σk 2 n ki
magnitude of 1. The mini-batch stochastic gradient descent (SGD)
+ {ReLU (−α − r n ) + ReLU (r n − α)}, with momentum is used for training. The learning rate is decayed
where η 1 is a regulation term. ReLU is the rectified linear unit from 10−4 to 10−5 . Other hyperparameters are set based on the
defined as ReLU (x) = max(0, x). This loss function is differentiable dataset specifically following the stand cross-validation. We con-
and the classical back-propagation can be used for optimizing the ducted experiments on CIFAR-100 [11] and North American birds
parameters of CNN, Θ datasets. Images are firstly resized to 240 × 240, and then randomly
We update the fine-level (leaf nodes) class centers {µ 0i } with the cropped to 224 × 224 as the input of networks. Experiments are
training data using the following formulation. run on a single Nvidia GTX-1080 GPU server. The results of DPSH,
DSDH, and DCWH are based on the codes given by the original
N 0i
1 Õ authors. Other results are cited from their own papers.
µ 0i = f (x n , Θ), (3)
N 0i n=1
4.2 Evaluation Metrics
where N 0i is the number of images X n that belong to the i-th class in
the lowest level of hierarchy. The upper class centers are calculated We present both results of the single level retrieval and hierarchi-
with their own lower level class centers as follows cal retrieval performance. The fine level retrieval performance is
Ck i evaluated with the mean average precision (mAP@all). As for the
1 Õ hierarchical retrieval performance, we calculate the mean average
µ ki = µ , (4)
Cki c=1 (k −1),c Hierarchical Precision (mAHP@K) and the normalized Discounted
where Cki is the number of level-k classes belonging to the i-th Cumulative Gain (nDCG) [9] measurements.
class at level k. By such a recursive calculation, we can save com- Considering xq is a query image with label yq , the ordered re-
putation cost and eliminate the influence of imbalanced training trieval results is denoted as {(x 1 , y1 ), (x 2 , y2 ), . . . , (x N , y N )}. The
data between classes. To ensure that the parent class gaps are larger hierarchical precision (HP) [4, 16] at N-th returning images is de-
than the child ones, the hyperparameters σk should be chosen to fined as Ín
i=1 Sim(yq , yi )
satisfy the following equation, HP@N = , (5)
maxo ni=1 Sim(yq , yoi )
Í
σk2 < σk2+1
where Sim is score function for measuring the similarity between
The two-stage strategy described in [24] is used to reduce the the classes based on the category hierarchy. The denominator is
quantization error. The whole algorithm is summarized in Algo- the largest summation that can be achieved by the best N database
rithm 1. items.
Algorithm 1 Algorithm
4.3 CIFAR-100
Initialize CNN parameters Θ with a pre-trained model CIFAR-100 comprses of 60, 00 tinny images. It has a two-level se-
repeat mantic hierarchy. Each fine class contains 500 training images and
1. Compute features r n = f (x n , Θ) with Θ 100 for testing. The 100 fine classes are grouped into 20 coarse cat-
2. Update the fine-level class centers {µ 0i} with Eq. 3 egories. Each coarse class is consisted by 20 fine ones. The official
3. Update upper-level class centers with Eq. 4 training set is used for training deep hashing models. During the
4. Using {µ ki } and loss function Eq. 2 to optimize Θ for N testing, the training set serves as the database, and the testing im-
iterations ages are used as query ones. DPSH, DSDH, and DCWH are trained
until Convergence with fine labels. Our method uses both fine and coarse labels for
training.
Conference’17, July 2017, Washington, DC, USA Xuefei Zhe, Le Ou-Yang, Shifeng Chen, and Hong Yan

Table 2: Hierarchical Retrieval performance on NABirds

mAHP@352 mAHP@1087
Methods
64-bit 128-bit 256-bit 64-bit 128-bit 256-bit
DCWH 0.5939 0.5940 0. 6511 0.5215 0.5110 5733
Ours 0.8323 0.8372 0.8275 0.8631 0.8806 0.8797

We present the mAP@all as the fine level retrieval perfor- Table 4: MAP of NABirds
mance. Considering each coarse categories has 2500 images, we eval-
uate the hierarchical retrieval performance with mAHP@2,500. mAP@all
Both results are presented in Table 1. It can be found that, for Methods
64-bit 128-bit 256-bit
general retrieval performance, our method is slightly higher than
DCWH. At 32-bit, the mAP of our method is 3.87% higher than DCWH 0.5113 0.5538 0.5957
the one of DCWH, and 0.8% and 1.36% higher for 64- and 128-bit, Ours 0.6425 0.6590 0.6358
respectively. However, our method achieves significantly better re-
sults than the previous best one. Our method obtains 0.8631 mAHP
in average, which is 39.22% higher than DCWH’s 0.4708. The huge to supervise the training of deep hashing models. The fine level
difference of mAP and mAHP of DCWH indicates that the semantic retrieval performance is measured with mAP@all and the results
hierarchical Hamming space is hard to learn if only use the fine are presented in Table 4. It can be observed that by using the se-
level class labels. mantic hierarchical information, our method significantly improves
the fine level retrieval performance. The best performance of our
Table 3: nDCG@100 on CIFAR-100 method is 0.6590 at 128-bit, which is 6.33% higher than the best
of DCWH at 256-bit. The average performance gain is 9.21% for
nDCG@100 tested three bit numbers.
Methods
32-bit 64-bit In order to evaluate the hierarchical retrieval performance better,
DPSH∗ 0.5650 0.5751 we present the mAHP at different numbers of return images, which
SHDH∗ 0.6141 0.6460 are the median value (352) and the mean value (1087) of images for
DCWH 0.7894 0.8022 the highest-level categories. Both results are shown in Table 2. For
Ours 0.8189 0.8242 mAHP@352, our method obtains the best performance, 0.8372 at
128-bit, which is 23.42% higher than DCWH’s. For mAHP@1, 087,
We also provide nDCG@100 for easy comparison with SHDH [20], the average performance gain achieved by our method is 36.7%. In
which also utilizes the hierarchical label information for deep hash- conclusion, by utilizing the hierarchical information our method not
ing models. We follow the testing protocol in SHDH and present only significantly improves the hierarchical retrieval performance
the nDCG@100 result of DCWH and ours in Table 3. The results but also obtains better fine level retrieval results.
of DPSH ∗ and SH DH ∗ are cited from [20]. It can be found from
the table that our method surpasses SHDH with a clear margin 5 CONCLUSION
under nDCG@100 metric. Our model surpasses SHDH by 21.48% In this work, we address the problem of deep hashing learning with
and 17.82% for 32-bit and 64-bit, respectively. semantic hierarchical labels. We propose a novel hierarchical deep
hashing model based on a multi-level Gaussian loss function The
4.4 North America Birds proposed loss function allows our model to utilize the semantic
The North America Birds dataset [17] collects more than 48, 000 hierarchical labels. Meanwhile, it takes advantage of class-level
images from more than 550 visual categories of North America birds similarity learning as deep class-wise hashing [24], which enables
species. These categories are organized taxonomically into a five- our method to achieve much better performance than SHDH [20]
level hierarchy. Because all the categories belong to one root node that is based on pair-wise similarity learning.
“bird” that does not provide additional information gain, we ignore Experiments on two important hierarchical image datasets, CIFAR-
the final level and only use lower four-level hierarchy. We use the 100 and NABirds, show that our method surpasses other state-of-
official split that has 23, 929 images for training and 24, 633 testing the-art deep hashing methods in terms of the fine-level retrieval
ones are used for query images. The bounding box annotations are performance. Moreover, with help of the semantic hierarchy, our
not used in this experiment. We directly resize the original images method obtains significantly better hierarchical retrieval perfor-
to 240 × 240 and then randomly crop them to 224 × 224 as inputs of mance, which indicates that our model can provide more user-
the network. Randomly mirroring and randomly jittering are used desired retrieval results.
for data augmentation.
Considering that the performances of DPSH and DSDH signifi- REFERENCES
cantly decrease when the number of categories increases, we only [1] Yue Cao, Mingsheng Long, Jianmin Wang, Han Zhu, and Qingfu Wen. 2016. Deep
Quantization Network for Efficient Image Retrieval.. In AAAI. 3457–3463.
present the results of DCWH and ours. The main difference between [2] Z. Cao, M. Long, J. Wang, and P. S. Yu. 2017. HashNet: Deep learning to hash by
DCWH and ours is that we utilize the multi-level class information continuation. ArXiv preprint arXiv:1702.00758 (2017).
Semantic Hierarchy Preserving Deep Hashing
for Large-scale Image Retrieval Conference’17, July 2017, Washington, DC, USA

[3] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Joint Conference on Artificial Intelligence (IJCAI). 1711–1717.
Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and [15] Yeqing Li, Chen Chen, Wei Liu, and Junzhou Huang. 2014. Sub-Selective Quanti-
efficient machine learning library for heterogeneous distributed systems. arXiv zation for Large-Scale Image Search.. In AAAI. 2803–2809.
preprint arXiv:1512.01274 (2015). [16] Yuhao Ma, Meina Kan, Shiguang Shan, and Xilin Chen. 2018. Hierarchical
[4] Jia Deng, Alexander C Berg, and Li Fei-Fei. 2011. Hierarchical semantic indexing Training for Large Scale Face Recognition with Few Samples Per Subject. In 2018
for large scale image retrieval. (2011). 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2401–2405.
[5] Anuvabh Dutt, Denis Pellerin, and Georges Quénot. 2017. Improving Image [17] Cornell Lab of Ornithology. [n. d.]. NABirds Dataset. http://dl.allaboutbirds.org/
Classification using Coarse and Fine Labels. In Proceedings of the 2017 ACM on nabirds Accessed:2019-01-25.
International Conference on Multimedia Retrieval. ACM, 438–442. [18] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
[6] Christiane Fellbaum. 2010. WordNet. In Theory and applications of ontology: Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C.
computer applications. Springer, 231–243. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge.
[7] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. 2013. International Journal of Computer Vision 115, 3 (2015), 211–252.
Iterative quantization: A procrustean approach to learning binary codes for [19] Xiang Cheng Rong Bi Sen Su, Gang Chen. 2017. Deep Supervised Hashing with
large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Nonlinear Projections. In Proceedings of the International Joint Conference on
Intelligence 35, 12 (2013), 2916–2929. Artificial Intelligence (IJCAI). 2786–2792.
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual [20] Dan Wang, Heyan Huang, Chi Lu, Bo-Si Feng, Liqiang Nie, Guihua Wen, and
learning for image recognition. In CVPR. 770–778. Xian-Ling Mao. 2017. Supervised Deep Hashing for Hierarchical Labeled Data.
[9] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation arXiv preprint arXiv:1704.02088 (2017).
of IR techniques. ACM Transactions on Information Systems 20, 4 (2002), 422–446. [21] Xiaofang Wang, Yi Shi, and Kris M Kitani. 2016. Deep supervised hashing with
[10] Mohammad Ahangar Kiasari, Minho Lee, and Jixiang Shen. 2016. Content-based triplet labels. In Asian Conference on Computer Vision. 70–84.
image retrieval by using deep kernel Machine with Gaussian Mixture Model. In [22] Huei-Fang Yang, Kevin Lin, and Chu-Song Chen. 2017. Supervised Learning
Neural Networks (IJCNN), 2016 International Joint Conference on. IEEE, 3371–3377. of Semantics-Preserving Hash via Deep Convolutional Neural Networks. IEEE
[11] Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features Transactions on Pattern Analysis and Machine Intelligence (2017). in press.
from tiny images. (2009). University of Toronto Technical Report. [23] Ting Yao, Fuchen Long, Tao Mei, and Yong Rui. 2016. Deep Semantic-Preserving
[12] Brian Kulis, Prateek Jain, and Kristen Grauman. 2009. Fast similarity search for and Ranking-Based Hashing for Image Retrieval.. In Proceedings of the Interna-
learned metrics. IEEE Transactions on Pattern Analysis and Machine Intelligence tional Joint Conference on Artificial Intelligence (IJCAI). 3931–3937.
31, 12 (2009), 2143–2157. [24] Xuefei Zhe, Shifeng Chen, and Hong Yan. 2018. Deep Class-Wise Hash-
[13] Qi Li, Zhenan Sun, Ran He, and Tieniu Tan. 2017. Deep Supervised Discrete ing: Semantics-Preserving Hashing via Class-wise Loss. arXiv preprint
Hashing. arXiv preprint arXiv:1705.10999 (2017). arXiv:1803.04137 (2018).
[14] Wu-Jun Li, Sheng Wang, and Wang-Cheng Kang. 2016. Feature learning based
deep supervised hashing with pairwise labels. In Proceedings of the International

Das könnte Ihnen auch gefallen