Sie sind auf Seite 1von 7

Published in IET Image Processing

Received on 5th September 2010


Revised on 13th April 2011
doi: 10.1049/iet-ipr.2011.0005
ISSN 1751-9659
Research on pornographic images recognition method
based on visual words in a compressed domain
L. Sui J. Zhang L. Zhuo Y.C. Yang
Signal and Information Processing Laboratory, Beijing University of Technology, Beijing 100124,
Peoples Republic of China
E-mail: zhj@bjut.edu.cn
Abstract: In order to recognise and lter pornographic images, visual-word-based image representation has attracted more and
more attention. An image can be represented as a bag of visual words, which is analogous to the bag-of-words representation of
text documents. However, most of the existing approaches create visual words from images in the pixel domain, which requires
extra processing time to decompress images, since most images are stored in compressed formats. A novel pornographic images
recognition method based on visual words in a compressed domain is proposed in this study. There are four steps in this method:
(i) low-resolution image is constructed from compressed data; (ii) scale-invariant feature transform (SIFT) descriptors are
extracted from this low-resolution image; (iii) a visual vocabulary is created based on SIFT descriptors; (iv) pornographic
images are identied by using a support vector machine (SVM) classier. The experimental results indicate that the proposed
method can recognise pornographic images accurately with much less computational time.
1 Introduction
With the rapid development of multimedia and Internet, web
pages including harmful content, such as pornographic
images, are widely available, which have done great harm
to social stability and peoples physical and mental health,
especially for adolescents. Therefore how to identify
effectively and lter automatically the proliferation of
pornographic images is a challenging issue [1].
In recent years, the technology of pornographic images
recognition based on visual words has attracted more and
more attention. Inspired by the text content analysis,
researchers have viewed an image as a word (visual words)
combination, and applied text analysis method to image
classication, scene analysis and image semantic
annotation, which has achieved remarkable results [25].
Today visual word is introduced to the pornographic image
recognition, in which the following three steps are involved
[6]: (i) scale-invariant feature transform (SIFT) descriptors
extraction; (ii) visual vocabulary creation; (iii) image
representation and recognition. In [6], SIFT descriptors
were extracted to create visual vocabulary, and then ve
highest-sensitive regions were selected by analysing the
distribution of the image-sensitive word; nally C4.5
decision tree was used to recognise pornographic images.
Analogous to the word, phrase and paragraph in a text, a
three-level hierarchical strategy, which was visterm, visual
phrase and region of interest, was proposed to analyse the
image content in [7]. Deselaers et al. [8] introduced
principal component analysis (PCA) to reduce SIFT
descriptors dimensions, and employed unsupervised training
of Gaussian mixture model to create visual vocabulary.
Based on SIFT descriptors, Lienhart and Hauke [9] used
probabilistic latent semantic analysis for pornographic
images recognition. Such methods can achieve better
recognition results for the normal image with colour
background. This is done without the usage of any skin
and/or shape models. However, it is required to decompress
image data completely for these approaches, which results
in extra computation and slower recognition speed.
With the introduction of many image compression
standards, the network images are stored and transmitted in
compressed formats such as JPEG. Since traditional image-
processing method must fully decompress the compressed
data, it will be time consuming in the pixel domain [10]. An
effective method is to process image data in a compressed
domain, which can make full use of image compression
processing and characteristics of the compressed data. A
novel skin detection method in JPEG compressed domain
was proposed in our work [11]. Colour and texture features
of the image blocks were extracted from the entropy decoded
discrete cosine transform (DCT) coefcients rstly.
Secondly, data mining method, for example, decision tree,
was applied to establish the skin colour model. Based on
research results in [11], a pornographic image recognition
method in a compressed domain was further proposed in our
work [12]. At rst, various features in the compressed
domain were extracted, such as skin feature, global features
(texture and colour), as well as face feature and so on, then a
multi-cost sensitive decision tree construction method was
used for pornographic images recognition.
In order to reduce the time of visual word creation, a
pornographic images recognition method based on visual
words in compressed domain is proposed in this paper.
There are four steps in this method: (i) low-resolution
image is constructed from compressed data; (ii) SIFT
IET Image Process., 2012, Vol. 6, Iss. 1, pp. 8793 87
doi: 10.1049/iet-ipr.2011.0005 & The Institution of Engineering and Technology 2012
www.ietdl.org
descriptors are extracted from low-resolution images; (iii) a
visual vocabulary is created based on SIFT descriptors; (iv)
pornographic images are identied by using a support
vector machine (SVM) classier.
The remainder of this paper is organised as follows:
Section 2 introduces the proposed method and describes
the components of SIFT descriptors extraction, visual
vocabulary creation, image representation and recognition.
Results and performance evaluation of the proposed method
are shown in Section 3. Finally, conclusions and future
work are given in Section 4.
2 Proposed pornographic images
recognition method
The image processing in compressed domain refers to
operating the compressed data directly without full
decoding. This method will greatly reduce the
computational costs and resource requirements by avoiding
the decompression steps in whole or partially and
eliminating the needs of storing the reconstructed images
[13]. The image-processing scheme in JPEG compressed
domain is shown in Fig. 1, in which non-decompressed
process corresponds to the entropy decoder ago, the partly
decompressed process is after entropy decoding or before
de-quantising.
In the proposed method, we can construct low-resolution
images from the compressed data before inverse discrete
cosine transform (IDCT) as denoted with the partly
decompressed process in Fig. 1, and extract SIFT descriptors
from low-resolution images. Fig. 2 illustrates the image
recognition process that consists of three steps: (i) low-
resolution images construction; (ii) SIFT descriptors
extraction; (iii) image representation and recognition. There
are some advantages for low-resolution images
reconstruction: (i) low-resolution image construction can
avoid the IDCT process, which can reduce the computational
complexity and save decoding time; (ii) SIFT descriptors
extraction for low-resolution image is further time saving.
2.1 Low-resolution image construction in a
compressed domain
The proposed method constructs low-resolution images
directly from the compressed image data in order to avoid
the IDCT process and improve the speed of SIFT
descriptors extraction by using the low-resolution image.
Here, two kinds of low-resolution images (1/4 1/4 and 1/
2 1/2 of the original image resolution) can be obtained.
For example, the original image resolution is 800 800, 1/
4 1/4 version of original image is 200 200 and 1/
2 1/2 version image is 400 400. In addition, 1/8 1/8
version of original image is constructed by combining DC
coefcients of each 8 8 block.
2.1.1 1/4 1/4 version of original image
construction: To construct a 1/4 1/4 version image, the
rst four DCT coefcients of zigzag sequence in every
8 8 block, are rstly used for constructing a 2 2-
dimensional matrix A after inverse quantisation in the
decoder. Secondly, the 2 2-dimensional pixel matrix I is
calculated by following (1). Finally, matrix I is combined to
construct 1/4 1/4 version image
I = CAC
T
(1)
The expanded (1) is given by
I
0,0
I
0,1
I
1,0
I
1,1
_ _
=
c
0
c
1
c
0
c
1
_ _
A
0,0
A
0,1
A
1,0
A
1,1
_ _
c
0
c
0
c
1
c
1
_ _
(2)
where
I
0,0
= (c
0
c
0
A
0,0
+c
0
c
1
A
0,1
) +(c
0
c
1
A
1,0
+c
1
c
1
A
1,1
)
I
0,1
= (c
0
c
0
A
0,0
c
0
c
1
A
0,1
) +(c
0
c
1
A
1,0
c
1
c
1
A
1,1
)
I
1,0
= (c
0
c
0
A
0,0
+c
0
c
1
A
0,1
) (c
0
c
1
A
1,0
+c
1
c
1
A
1,1
)
I
1,1
= (c
0
c
0
A
0,0
c
0
c
1
A
0,1
) (c
0
c
1
A
1,0
c
1
c
1
A
1,1
)

(3)
where c
0
and c
1
are calculated by (4)
c
0
= 1/
..
8

c
1
=
1
2
C
1
4
C
1
8
C
1
16

(4)
where C
n
N
= cos(np/N).
Assuming that Q
0,0
, Q
0,1
, Q
1,0
and Q
1,1
are the rst four
quantisation table elements and A
Q
i, j
is DCT coefcient
before inverse quantisation, a 1/4 1/4 version image
construction procedure is shown in Fig. 3.
It can be seen that each pixel of original version image
requires only 0.125 times additions and 0.0625 times
multiplications from Fig. 3. In addition, we only decode
four DCT coefcients in the proposed method, which will
greatly improve construction speed.
Fig. 1 Image-processing scheme in the compressed domain
Fig. 3 1/4 1/4 version image construction Fig. 2 Proposed method processing steps
88 IET Image Process., 2012, Vol. 6, Iss. 1, pp. 8793
& The Institution of Engineering and Technology 2012 doi: 10.1049/iet-ipr.2011.0005
www.ietdl.org
2.1.2 1/2 1/2 version of original image
construction: Similar to the above method, we keep the
rst 16 DCT coefcients of zigzag sequence and employ a
4 4-dimensional matrix I

to construct 1/2 1/2 version


images
I

= C

C
T
(5)
where
C

= E +F =
c
0
c
1
c
2
c
3
c
0
c
1
c
2
c
3
c
0
c
1
c
2
c
3
c
0
c
1
c
2
c
3

+
0 0 0 0
0 c
3
c
1
0 c
3
c
1
0 c
1
c
3
0 c
1
c
3
0 0 0 0

, c
0
= 1/
..
8

,
c
1
=
1
2
C
1
16
C
2
16
, c
1
=
1
2
C
2
16
C
4
16
and c
3
=
1
2
C
3
16
C
6
16
.
Hence, (5) can be rewritten as
I

= (E +F)A

(E +F)
T
= EA

E
T
+EA

F
T
+FA

E
T
+FA

F
T
= I1 +I2 +I3 +I4 (6)
The elements of I1, I2, I3 and I4 are I1
i,j
, I2
i,j
, I3
i,j
,
I4
i,j
, respectively. In matrix I

, each element I

i,j
is calculated
by (7).
I

i, j
= I1
i, j
+I2
i, j
+I 3
i, j
+I4
i, j
0 i, j 3 (7)
The matrix I1 calculation is shown in Fig. 4.
The I2, I3 and I4 are sparse because F is a sparse matrix.
Hence, we only calculate non-zero elements of three matrices
as follows
e = (c
1
c
3
)(A
0,2
A
0,1
)
f = (c
1
c
3
)(A
1,2
A
1,1
)
m = (c
1
c
3
)(A
2,2
A
2,1
)
n = (c
1
c
3
)(A
3,2
A
3,1
)
I 2
0,1
= I 2
0,3
= e
I 2
1,1
= I 2
1,3
= f
I 2
2,1
= I 2
2,3
= m
I 2
3,1
= I 2
3,3
= n

(8)
Fig. 4 Matrix I1 calculation
IET Image Process., 2012, Vol. 6, Iss. 1, pp. 8793 89
doi: 10.1049/iet-ipr.2011.0005 & The Institution of Engineering and Technology 2012
www.ietdl.org
where A
i,j
is shown as Fig. 4.
e = c
0
(c
1
c
3
)(A
Q
3,0
Q
3,0
+A
Q
1,0
Q
1,0
)
f = c
1
(c
1
c
3
)(A
Q
3,1
Q
3,1
+A
Q
1,1
Q
1,1
)
m = c
2
(c
1
c
3
)(A
Q
3,2
Q
3,2
+A
Q
1,2
Q
1,2
)
n = c
3
(c
1
c
3
)(A
Q
3,3
Q
3,3
+A
Q
1,3
Q
1,3
)
e
1
= e +f
f
1
= e f
m
1
= m +n
n
1
= m n
I 3
2,0
= I3
1,0
= e
1
+m
1
I 3
2,1
= I3
1,1
= e
1
m
1
I 3
2,2
= I3
1,2
= f
1
+n
1
I 3
2,3
= I3
1,3
= f
1
+n
1

(9)
e = (c
1
c
3
)(c
1
c
3
)(A
Q
3,1
Q
3,1
+A
Q
1,1
Q
1,1
)
f = (c
1
c
3
)(c
1
c
3
)(A
Q
3,3
Q
3,3
+A
Q
1,3
Q
1,3
)
e
1
= e +f
I 4
1,1
= I4
1,2
= I4
2,1
= I4
2,2
= e
1

(10)
It can be seen that each pixel of original resolution version
image only requires 1.25 times additions and 0.65625 times
multiplications from Fig. 4 and (8) (10). In addition, we
only decode 16 DCT coefcients.
2.2 SIFT descriptors extraction
After low-resolution images construction in compressed data,
the Difference-of-Gaussian (DoG) [13] method is used to
detect keypoints in the images and calculate SIFT
descriptors according to keypoints scale and orientation.
SIFT descriptors are invariant to image rotation and scale
and robust across a substantial range of afne distortion,
addition of noise and change in illumination [14].
The SIFT descriptors extraction includes the following
steps: (i) building DoG scale space, (ii) nding local
extreme points of the DoG scale space, (iii) calculating the
scales of the keypoints, (iv) assigning orientation to the
keypoints, (v) calculating SIFT descriptors. The DoG
functions are dened as follows
G
d
(x, y, d) =
1
2pd
2
exp
x
2
+y
2
2d
2
_ _
(11)
I
kd
(x, y, d) = G
d
(x, y, kd) I(x, y) (12)
D
d
= I
kd
I
d
(13)
where I denotes an input image, k =
..
2

, d 1.6.
The DoG scale space is constructed by calculating the
above functions. The local extreme points are elements
whose 3 3 3 neighbours (in space and scale) have all
smaller (or larger) value. Then all extreme points with the
absolute value |D
d
(X)| less than 0.03 will be discarded. The
absolute value is dened
|D
d
(X)| = D
d
+
1
2
D
T
d
X

2
D
1
d
x
2
D
d
x
_ _

(14)
where D
d
is the scale-space function at the extreme point and
X (x, y, d)
T
is the offset from the extreme point. Then those
extreme points with a strong response along edges are also
removed as follows: given an extreme point (x, y, d), a
Hessian matrix is calculated, and then the extreme points
meeting the following criterion are retained
H(x, y, d) =

2
D
d
x
2

2
D
d
xy

2
D
d
yx

2
D
d
y
2

(15)
(tr H(x, y, d))
2
det H(x, y, d)
,
(t
e
+1)
2
t
e
(16)
where t
e
is the edge threshold and equals to 10.
For the orientation assignment, a histogram of the gradient
orientations is calculated in a Gaussian window with a
standard deviation that is 1.5 times that of the scale of the
keypoints. The gradient magnitude m(x, y) and orientation
u(x, y) of each keypoints are calculated as follows (see (17))
u(x, y) = tan
1
((I
d
(x +1, y) I
d
(x 1, y))/(I
d
(x +1, y)
I
d
(x 1, y))) (18)
This histogram is then smoothed and the maximum, the 80%
of the maximum remains as orientation, respectively. Finally,
SIFT descriptors of a keypoint are extracted. SIFT descriptors
are three-dimensional spatial histograms (4 4 8) of the
image gradients, formed by the pixel location and the
gradient orientation.
2.3 Visual vocabulary creation
As SIFT descriptors normally have 128 dimensions, some
dimensionality reduction techniques are considered. One
common way is PCA. At rst, we extract SIFT descriptors
to obtain data matrix X
n128
(n is the number of keypoints
in one image), and calculate the mean for each of the data
dimensions
E(X
j
) =
S
n
i=1
X
i, j
n
, j = 1, . . . , 128 (19)
Then all the possible covariance values between the different
dimensions are calculated and stored in the covariance matrix
C
128128
, which is dened as
C
128128
= (c
i, j
, c
i, j
= cov(X
i
, X
j
)), i,
j = 1, . . . , 128, i =j (20)
where
cov(X
i
, X
j
) =
(X
i
E(X
i
))(X
j
E(X
j
))
n
Since the covariance matrix is square, we can calculate its
eigenvectors and eigenvalues. Then, we choose only the
rst 30 eigenvectors, and form a new feature vector matrix
m(x, y) =
................................................................
(I
d
(x +1, y) I
d
(x 1, y))
2
+(I
d
(x +1, y) I
d
(x 1, y))
2
_
(17)
90 IET Image Process., 2012, Vol. 6, Iss. 1, pp. 8793
& The Institution of Engineering and Technology 2012 doi: 10.1049/iet-ipr.2011.0005
www.ietdl.org
A with these eigenvectors in the columns
A
12830
= (a
1
, a
2
, . . . , a
30
) (21)
Finally, we take the transpose of the vector and multiply it on
the left of the original dataset
D
30n
= (A
12830
)
T
B
128n
(22)
where B
128n
is the mean-adjusted data transposed, that is, the
data items are in each column, with each row holding a
separate dimension.
After PCA transform is used, we can obtain SIFT
descriptors for all keypoints of each image. Then all
keypoints are grouped into some clusters using the K-means
method. The centre of each cluster is viewed as a visual
word to represent an image.
2.4 Image representation and recognition
After the visual vocabulary is created, all images
are represented as a bag of visual words and features are
the visual-words histogram where each bin represents the
occurrence frequency of every visual word in an image.
Then an SVM classier is applied to pornographic images
recognition [15]. The process includes three stages: the
optimal parameter selection, model training and prediction.
For classication, the Radial Basis Function (RBF) kernel
is a reasonable choice (g is kernel parameter), there are three
reasons: (i) this kernel is non-linear and handles the relation
between class labels and attributes; (ii) compared to
polynomial kernel, the RBF kernel has less hyper-parameter
that can reduce the complexity of model selection; (iii) the
RBF kernel has fewer numerical difculties
K(x
i
, x
j
) = exp(g||x
i
x
j
||
2
), g . 0 (23)
Based on the above reasons, RBF kernel with a ve-fold cross
validation scheme is used for choosing the penalty parameter
C and kernel parameter. After obtaining these parameters,
model training and prediction are completed by C-support
vector classication(C-SVC). Given training vectors
x
i
[ R
d
i 1, . . . , n, d 30, in two classes (pornographic
and non-pornographic), and a vector x [ R
l
such that
y
i
[ {1, 21}, C-SVC solves the following problem
min
w,b,j
1
2
w
T
w +C

l
i=1
j
i
subject to
y
i
(w
T
f(x
i
) +b) 1 j
i
j
i
0, i = 1, . . . , l
(24)
Its dual is
min
a
1
2
a
T
Qa e
T
a
subject to
y
T
a = 0
0 a
i
C, i = 1, . . . , l
(25)
where e is the vector of all ones, C . 0 is the upper bound, Q
is a l l positive semi-denite matrix, Q
ij
; y
i
y
j
K(x
i
, x
j
) and
K(x
i
, x
j
) is the RBF kernel. Here training vectors x
i
(i 1,
. . . , n) are mapped into a higher dimensional space by the
function f. Finally, we employ the following decision
function to predict the test data.
sgn

l
i=1
y
i
a
i
K(x
i
, x) +b
_ _
(26)
3 Experimental results and analysis
In order to evaluate the performance of the proposed method,
we collected 200 images from the web as the dataset. The
lowest resolution is 240 105 and the highest resolution is
1451 2137 in the dataset. Examples of selected images
are shown in Fig. 5. Two groups of experiments are
conducted. In the rst experiment, for testing the feasibility
of our method and conrming the advantages of speed,
comparisons are drawn between the proposed method and
traditional method [14]. In the second experiment, by using
visual words based on different version images, comparison
with recognition results to verify the performance of
recognition methods. Through a large number of
experiments, we extract SIFT descriptors from 1/2 1/2
version images and obtain less than 600 visual words. The
experimental platform is PC with Pentium IV 3.00 GHz
Fig. 5 Examples of pornographic (rst row) and non-pornographic images (second row) from image database
IET Image Process., 2012, Vol. 6, Iss. 1, pp. 8793 91
doi: 10.1049/iet-ipr.2011.0005 & The Institution of Engineering and Technology 2012
www.ietdl.org
CPU, 1 G memory, Windows XP operating system,
VC++6.0 programming.
3.1 Comparison results of SIFT descriptors
extraction
Table 1 gives the processing time and shows that the
performance of our method is clearly improved. Table 2
shows a comparison of executing the algorithm on low-
resolution and high-resolution images. A comparison of
decoded DCT coefcients number is given in Table 3 for
three version images. Results indicate that low-resolution
images construction from the compressed data directly is
efcient and rapid. As the low-resolution image constructed
from the compressed image data is smaller than original
image, we can extract SIFT descriptors quickly, that is, the
computational complexity for 1/2 1/2 version image is
about a quarter of original image. As mentioned earlier, the
proposed method has better performances.
3.2 Comparison results of visual words
In the second experiment, all images were manually divided
into two categories: pornographic and non-pornographic.
More than 180 000 (original), 50 000 (1/2 1/2 low-
resolution) keypoints were selected and grouped into some
clusters using the K-means method, respectively. In the
experiments ve-fold cross-validation was used for keeping
train and test data separately. Suppose the value of the
vocabulary size (k) selected in our experiments is within the
range (50, 600). For the optimal parameter selection [15],
the penalty of the SVM error term (C) was varied in a
logarithmic scale from 2
25
to 2
15
and kernel parameter was
between 2
215
and 2
3
.The kernel parameter and C values
that achieved the best recognition rate were submitted to
search for C and kernel parameter. The experimental results
are given in Tables 45.
From experimental results in Tables 45, train models are
obtained to predict test data. In Fig. 6, the best recognition
precision for all tested values of k is plotted for two methods.
Owing to the lack of standard annotated databases for
pornographic image recognition, it is difcult to compare
results obtained by different methods. Nevertheless, by
drawing a coarse comparison, our method achieves high
recognition precision, which indicates it is effective and
efcient for pornographic images recognition. Note that the
results are achieved with possibly the simplest SVM classier.
4 Conclusions and future work
Different from the conventional method to create visual words
in the pixel domain, this paper gives an effective pornographic
image recognition method based on visual words in the
Table 1 Comparison on computational time
Time, ms Methods
Original SFIT
descriptors
Improved SFIT
descriptors
average time for image
reconstruction
41 31
average time for SIFT
descriptors extraction
4307 1335
average time for image
recognition
44 365 37 579
Table 2 Comparison on different resolution images
Time, ms Image
Image
resolution
Original
image
1/2 1/2
version image
time for image
reconstruction
533 800 47 32
1451 2137 250 219
time for SIFT
descriptors extraction
533 800 6438 1922
1451 2137 24 891 7469
time for image
recognition
533 800 43 831 25 721
1451 2137 65 812 46 468
Table 4 Average recognition rate of original images achieved
for each vocabulary size on the model selection experiment
Vocabulary
size
Penalty
parameter
Kernel
parameter
Average
recognition
precision, %
50 0.5 3.0517578125e-05 86.25
100 2.0 3.0517578125e-05 91.25
150 0.5 3.0517578125e-05 95.0
200 2.0 3.0517578125e-05 85.0
250 2.0 3.0517578125e-05 86.25
300 2.0 3.0517578125e-05 83.75
350 2.0 3.0517578125e-05 82.5
400 2.0 3.0517578125e-05 82.5
450 2.0 3.0517578125e-05 83.75
500 0.5 3.0517578125e-05 86.25
550 0.5 3.0517578125e-05 82.5
600 0.5 3.0517578125e-05 85.0
Table 5 Average recognition rate of 1/2 1/2 low-resolution
version images achieved for each vocabulary size on the model
selection experiment
Vocabulary
size
Penalty
parameter
Kernel
parameter
Average
recognition
precision, %
50 0.5 0.0001220703125 95.0
100 2.0 3.0517578125e-05 91.25
150 8.0 3.0517578125e-05 91.25
200 2.0 0.0001220703125 93.75
250 0.5 0.0001220703125 91.25
300 0.5 0.0001220703125 95.0
350 0.5 0.0001220703125 92.5
400 0.5 0.0001220703125 95.0
450 0.5 0.0001220703125 95.0
500 0.5 0.0001220703125 91.25
550 0.5 0.0001220703125 90.0
600 0.5 0.0001220703125 90.0
Table 3 Comparison on DCT coefcients decoded
Number Image
Original
image
1/2 1/2
version image
1/4 1/4
version image
number of DCT
coefcients
64 16 4
92 IET Image Process., 2012, Vol. 6, Iss. 1, pp. 8793
& The Institution of Engineering and Technology 2012 doi: 10.1049/iet-ipr.2011.0005
www.ietdl.org
compressed domain. Firstly, low-resolution images are
constructed from compressed data. Secondly, SIFT
descriptors are extracted from the low-resolution images and
visual words are obtained through k-means clustering
algorithm. Finally, an SVM classier is applied in our
pornographic images recognition system. The experimental
results show that the proposed method can improve the
speed of creating a visual vocabulary by reducing the
computational complexity, and achieve higher recognition rate.
There are several future works along different directions:
1. Expanding visual vocabulary. We can collect more images
from the web, build larger databases and increase the number
of visual words in the visual vocabulary.
2. Adding a tag ltering component to the visual vocabulary
creation process. We will solve semantic synonymous and
polysemous problems.
3. Building pornographic images recognition system based
on multi-features. Besides the visual words introduced in
this paper, more features can be added, such as skin colour
region, the results of image retrieval, face and global texture
and colour features, which are extracted from the
compressed image.
4. We will dene more than two kinds of image classes, such
as normal image, soft-porn images pornographic images etc.
By adding more sophisticated classiers and adopting
pornographic value, implementing recognition process will
achieve better results.
5 Acknowledgments
The work in this paper is supported by the National Science
Foundation of China (No.61003289, No.61100212), the
Natural Science Foundation of Beijing (No.4102008),
Supported by the Excellent Science Program for the
Returned Overseas Chinese Scholars of Ministry of Human
for the Resources and Social Security of China, Science
Research Foundation for the Returned Overseas Chinese
Scholars of MOE.
6 References
1 Chantrapornchai, C., Promsombat, C.: Experimental studies on
pornographic web ltering techniques. Fifth Int. Conf. Electrical
Engineering/Electronics, Computer, Telecommunications and
Information Technology, 2008. ECTI-CON 2008, vol. 1, pp. 109112
2 Abdel-Hakim, E., Farag, A.A.: CSIFT: A SIFT descriptors with color
invariant characteristics. IEEE Conf. on Computer Vision and Pattern
Recognition, New York, USA, 2006, pp. 19781983
3 Bosch, A., Zisserman, A., Muoz, X.: Scene classication using a hybrid
generative/discriminative approach, IEEE Trans. Pattern Anal. Mach.
Intell., 2008, 30, (4), pp. 712727
4 van de Sande, K., Gevers, T., Snoek, C.: Evaluation of color descriptors
for object and scene recognition. IEEE Computer Society Conf. on
Computer Vision and Pattern Recognition (CVPR), 2008
5 Wang, M., Yang, K.Y., Hua, X.S., Zhang, H.J.: Visual tag dictionary:
interpreting tags with visual words. ACM, 2009, pp. 18
6 Wang, Y.S., Ning, L.Y., Gao, W.: Detecting pornographic images
with visual words, Trans. Beijing Inst. Technol., 2008, 28, (5),
pp. 410413
7 Wang, Y.S., Huang, Q.M., Gao, W.: Pornographic image detection
based on multilevel representation, Int. J. Pattern Recognit., 2009,
23, (8), pp. 16331655
8 Deselaers, T., Pimenidis, L., Ney, H.: Bag-of-visual-words models for
adult image classication and ltering. Proc. Int. Conf. on Pattern
Recognition, Tampa, Florida, USA, 2008, pp. 14
9 Lienhart, R., Hauke, R.: Filtering adult image content with topic
models. Proc. IEEE Int. Conf. on Multimedia and Expo, New York
City, USA, 2009, pp. 14721475
10 Shen, L.S., Zhang, J., Li, X.G.: The overview of image processing
technology in compressed domain, in Yang, L. (Ed.): Image retrieval
and compressed domain processing (Post & Telecom Press, 2008),
Ch. 10, pp. 287307
11 Zhao, S.W., Zhuo, L., Xiao, Z., Shen, L.S.: A data-mining based skin
detection method in JPEG compressed domain. Sixth Int. Conf. on
Fuzzy Systems and Knowledge Discovery, 2009, pp. 297301
12 Zhao, S.W., Zhuo, L., Wang, S.Y., Xiao, Z., Li, X.G., Shen, L.S.:
Pornographic image recognition in compressed domain based on
multi-cost sensitive decision tree. Third IEEE Int. Conf. on Computer
Science and Information Technology, 2009
13 Chang, S.F.: Compressed-domain techniques for image/video indexing
and manipulation. IEEE Int. Conf. on Image Processing (ICIP95),
1995, pp. 314317
14 Lowe, D.G.: Distinctive image features from scale invariant keypoints,
Int. J. Comput. Vis., 2004, 60, (2), pp. 91110
15 Chang, C.-C., Lin, C.-J.: LIBSVM: a practical guide to support vector
classication, http://www.csie.ntu.edu.tw, accessed 2003
Fig. 6 Recognition precision of two images at different resolution levels achieved for different vocabulary sizes
IET Image Process., 2012, Vol. 6, Iss. 1, pp. 8793 93
doi: 10.1049/iet-ipr.2011.0005 & The Institution of Engineering and Technology 2012
www.ietdl.org

Das könnte Ihnen auch gefallen