Beruflich Dokumente
Kultur Dokumente
I. INTRODUCTION
Vision is an important way for robots to perceive working
targets. It has become one of the key technologies of Fig.1. Target color reference template
intelligent robots, and has made great progress in recent years.
Robot vision for target recognition includes three steps: target After the color of target image is converted into the colors
image acquisition, feature extraction and target recognition. of standard reference template, the normalized color histogram
Target image acquisition is accomplished by visual sensors. of the target image is obtained as the expression of the target
Target feature extraction is a key step, there are three kinds of color features.
extraction methods for different target images: shape-based,
color-based, and texture-based. Target recognition needs to B. Neural network structure of color-based recognition
design a good model as a classifier, most applications have For each target image, a 16-color histogram is generated
proved that neural network model is a very effective target after color clustering. In the workspace of the robot, the
classifier[1-4]. However, color-based, shape-based or texture- proportion of background appearing in the target image
based models have their own flaws when they are used for changes greatly due to the different orientation of the image
target recognition alone, when the illumination, background acquisition. Therefore, the histogram of background should be
and orientation of the target change greatly, the recognition removed in image preprocessing. In fact, there are only 15
accuracy may decrease. For example, using color histogram, colors as target feature. The classifier for color recognition uses
the spatial information of the target will be lost, and using back propagation neural network(BPNN), which adopts three-
shape will lose color and texture feature. In this paper, a layer structure, i.e., input layer, hidden layer and output layer.
single neural network of color, shape and texture is combined The normalized histogram of color is used as the input of
to form a multi-neural network-based robot vision model. BPNN to form a 15-dimensional input vector xi, and the output
Each neural network may have different classification results dimension of BPNN depends on the category of target to be
for identical target, even conflicting with each other, using D- classified. Here, five kinds of medicine packages are taken as
S evidence theory to solve this problem, so that more accurate examples to discuss the problem of target recognition of robots,
results can be obtained to robot target recognition in complex so the output of the neural network is the five-dimensional
environment. vector yi, and the number of hidden layer nodes is set to 20, as
shown in Fig.2.
(e) Color histogram of target class5 (f) Train curve of BPNN
Fig.4. Color histogram of target image and train curve
2572
A. The structure of CNN for robot texture recognition
CNN is usually connected alternately by convolution layer
C and pooling layer S, and finally the output is formed by a
fully connected BPNN. The convolution layer C is used to
extract target features. Each convolution layer is the
convolution result of the current layer and the previous layer
using a convolution kernel, except the first layer is the original
image, the other layers are feature maps. Generally, each
convolution layer has multiple feature maps. Nodes in the
convolution layer (neurons) are only associated with nodes in
the convolution nucleus, so it is called local connection, which
greatly reduces the training parameters of CNN. In order to
further reduce the number of nodes in feature map and ensure
less loss of features, the a pooling layer S is introduced after
each convolution layer to subsample the feature map. Industrial
robots require high real-time visual response, the recognition of
targets should be fast enough. In addition, considering the Fig.6. The training curve of CNN
different tasks, the working targets change at any time, and the
sampling and training of working targets should be completed IV. BP NEURAL NETWORK MODEL SHAPE-BASED TARGET
in a relatively short time under the condition of a single CPU, RECOGNITION
the CNN network layer can not be too deep. In this paper, three
pairs of C-S connections(C1-S1-C2-S2-C3-S3) are used to Shape is another important feature of target, which is often
form a local connection of CNN for feature extraction of target used in target recognition of robot vision[11]. Shape can be
texture. The last layer of local connection, S3, is stacked into a expressed by the external boundary of target or by the region
feature vectors and input into a 3-layer fully connected surrounded by the boundary of target. the region and
(FC)BPNN. The fully connected BPNN consists of input boundary is usually converted into binary image, as shown in
layer(FC1), hiden layer(FC2) and output layer(O) as a classifer Fig.7.
of target recognition for robot vision. The whole 10-layer
CNN structure used in this paper is shown in Fig. 5
B. Training of CNN for texture recognition In this paper, a three-layer BP neural network is used as a
classifier of target shape. Its structure is similar to that of
The training data set of the whole CNN network consists of color-based BPNN, but the number of nodes in input layer and
1200 image samples and their class labels. Gradient descent hidden layer is different. In order to keep the original shape of
method is used to train CNN. 1200 samples are divided into 24 target unchanged, the target is normalized to 30 pixels in its
batches, 50 samples as a batch, and the network parameters are height, and the width of target is scaled, and the normalized
modified once for every training iteration until the performance target image is stacked into a feature vector as the input of
requirements are met. The performance measure is the sum of BPNN. The dimension of the vector is variable, and the input
squares of network output error E, here, we take performance of BPNN requires that the vector dimension is fixed, so we
goal E<=0.001. The training curve of CNN is shown in Fig. 6. extend the dimension of the input vector to 2700 and keep it
unchanged. For feature vectors whose dimension is less than
2700, its remaining portion will be filled with 0, as shown in
Fig.8.
2573
neural network, CNN, with local connection and full
connection, as is shown in Fig.9.
Three models can be used to identify the identical target at
the same time, however, when the recognition results are
inconsistent or conflicting, how does the robot give the right
judgment? A better way to solve evidence conflict is D-S
evidence theory, here it is used for the fusion of three neural
networks.
D-S evidence theory proposed by Dempster and Shafer is
an effective method to deal with multiple sources of evidence,
especially when evidence conflicts with each other. This paper
use it to fuse the output of the combined neural network.
Fig.8. Shape-based BP neural network for target recognition Firstly, a framework of finite and complete universe set U is
defined, in which each element is mutually exclusive. For any
V. FUSION OF MULTI- NEURAL NETWORK FOR TARGET proposition A in the universe, it contains a power set of U, a
RECOGNITION mapping, m:2U ė[0,1], and meets m(A)=1, m()=0 ( is
From a biological point of view, human vision perception empty set), m is called a basic belief assignment. Let m1, m2
of targets depends on shape, color and texture features. Based and m3 be the basic reliability allocation functions (also called
on this idea, the visual perception of robots can also be mass functions) corresponding to the evidence of three neural
modeled by combining shape, color and texture. As mentioned networks, and there are three propositions, A, B and C,
respectively. For proposition A based on color-BPNN, the
above, a color-based BPNN, shape-based BPNN and texture-
corresponding focal elements {A1, A2, A3, A4, A5}= {target1,
based CNN are designed respectively, after training with the
target2, target3, target4, target5}, for proposition B based on
same labeled samples, they are applied to recognize the
shape-BPNN, the corresponding focal elements {B1, B2, B3, B4,
working target of robot workspace, such as the medicine
packages here. Of course, each neural network can be used B5}= {target1, target2, target3, target4, target5}, for
independently for the target recognition. However, the proposition C based on texture-CNN, corresponding focal
accuracy and robustness of single neural network for target element {C1, C2, C3, C4, C5}= {target1, target2, target3,
recognition will decrease when the illumination of robot target4, target5}, then the D-S synthesis rule can be expressed
environment changes and the orientation of camera is different. as follows:
To solve this problem, integrating the three neural networks of
color, shape and texture for target recognition, a combined
vision model of multi-neural networks is formed. ¦ m1( Ai )m2 ( B j )m3 (Ck )
° Ai B j Ck D
m( D) ® D U and D z M (1)
1 K
°
¯0 D M
K ¦
Ai B j Ck M
m1 ( Ai )m2 ( B j )m3 (Ck ) 1 (2)
t a (i )
m1 ( Ai ) 5
i 1, 2, 3, 4, 5
¦ ta ( p )
p 1
(3)
tb ( j )
m2 (B j ) 5
j 1, 2, 3, 4, 5
Fig.9. Combination of color,shape and texture neural network
¦p 1
tb ( p ) (4)
For the combined neural networks, the first two are fully
connected BP neural networks, The third one is a hybrid
2574
symbol "T" is used to indicate the correct classification, and
tc ( k ) the symbol "F" is used to indicate the classification error.
m 3 (C k ) 5
k 1, 2, 3, 4, 5
¦t
p 1
c ( p) (5)
TABLE I. TEST RESULTS OF THREE NEURAL NETWORKS
it can be seen that the output of the above three neural Target 1# 2# 3# 4# 5#
networks after normalization obviously meets the
requirements of the mass function. 0.896 0.266 0.231 0.139 0.256
Color 0.301 0.898 0.187 0.097 0.097
° ¦ m ( A) 1, m (M ) 0 BPNN 0.097 0.092 0.685 0.649 0.688
° AU ta(i) 0.115 0.146 0.701 0.587 0.316
°
® ¦ m(B) 1, m (M ) 0 (6) 0.413 0.203 0.441 0.331 0.655
° B U
° T/F T T F T F
°̄ C¦
m (C ) 1, m (M ) 0
U 0.772 0.477 0.211 0.118 0.592
Shape 0.652 0.704 0.235 0.136 0.255
BPNN 0.220 0.199 0.793 0.501 0.106
VI. EXPERIMENTAL RESULTS AND DISCUSSIONS tb(j) 0.201 0.205 0.541 0.562 0.118
0.336 0.376 0.101 0.126 0.580
Taking the robot medicine package sorting system as an
example, the proposed multi-neural network fusion model is T/F T T T T F
validated and discussed. The experimental device consists of 0.907 0.396 0.219 0.205 0.321
EPSON Scara robot, vision sensor, pneumatic sucker and Texture 0.761 0.919 0.088 0.112 0.101
master computer, there are five kinds of medicine package to 0.224 0.203 0.745 0.432 0.685
CNN
be classified on a workbench, as shown in Fig.10.
tc(k) 0.087 0.234 0.522 0.786 0.211
0.366 0.195 0.492 0.221 0.671
T/F T T T T F
Target 1# 2# 3# 4# 5#
0.4918 0.1657 0.1029 0.0771 0.1272
Color 0.1652 0.5595 0.0833 0.0538 0.0482
BPNN 0.0532 0.0573 0.3051 0.3600 0.3419
m1(Ai) 0.0631 0.0910 0.3122 0.3256 0.1571
0.2267 0.1265 0.1964 0.1836 0.3255
T/F T T F T F
0.3540 0.2432 0.1122 0.0818 0.3586
Fig.10. Robot vision system for medicine package sorting Shape 0.2989 0.3590 0.1249 0.0942 0.1545
BPNN 0.1009 0.1015 0.4216 0.3472 0.0642
When the vision sensor carried by the robot sweeps
through the workspace, the target image, medicine package, is m2(Bj) 0.0922 0.1045 0.2876 0.3895 0.0715
captured and input to the master computer, The input image is 0.1541 0.1917 0.0537 0.0873 0.3513
automatically converted into the outputs of trained color,
T/F T T T T F
shape and texture neural networks respectively, and mass
function m1, m2 and m3 can be calculated according to (3), (4) 0.3868 0.2034 0.1060 0.1167 0.1614
and (5). 0.3245 0.4720 0.0426 0.0638 0.0508
Texture
Table 1 gives a test results of three neural networks, in CNN 0.0955 0.1043 0.3606 0.2460 0.3444
which each column lists the output values of 5 nodes in three m3(Ck) 0.0371 0.1202 0.2527 0.4476 0.1061
kinds of neural networks for one test sample. Five test samples
are given here, which are distinguished by symbol #, and each 0.1561 0.1002 0.2381 0.1259 0.3374
sample represents a medicine packaging category. For each T/F T T T T F
neural network, the subscript corresponding to the maximum
value of the five nodes serves as the classification label. The
Table 2 gives the mass function of each neural network
2575
responding to Table 1(5 test samples), from the fifth column in
Table 2, we can see that neither of the neural networks of color,
¦
Ai Bj Ck D4
m1( A4 )m2 (B4 )m3(C4 )
shape or texture correctly identifies the working target, this is m(D4 )
the worst result. We will take fifth sample as an example to
1 K
discuss the problem of evidence synthesis using D-S evidence 0.1571u0.0715u0.1061
0.0216
theory. Let the framework of evidence combination be D 1-0.9449
D={D1, D2, D3, D4, D5}= target set{target1, target2, target3,
target4, target5}, for the evidence m1, m2 and m3 given by ¦ m1( A5)m2 (B5)m3(C5)
Ai Bj Ck D5
three neural networks of color, shape and texture are shown in m(D5 )
Table 3. 1 K
0.3255 u 0.3513u 0.3374
TABLE III. MASSES OF THREE NETWORKS FOR SAMPLE 5#
0.7005
1-0.9449
The calculation of the masses of the remaining four samples
m1 m2 m3 is similar to that of the fifth sample, which is listed in Table 4.
A1 0.1272 B1 0.3586 C1 0.1614 TABLE IV. MASSES OF FINAAL OUTPUT AFTER FUSION(5 SAMPLES)
A2 0.0482 B2 0.1545 C2 0.0508
D-S evidence theory fusion
A3 0.3419 B3 0.0642 C3 0.3444
A4 0.1571 B4 0.0715 C4 0.1061 Target
A5 0.3255 B5 0.3513 C5 0.3374
D1 0.7520 0.0765 0.0167 0.0081 0.1337
From Table3, we can calculate the amount of conflict D2 0.1790 0.8845 0.0061 0.0036 0.0069
between the three evidences as follows,. D3 0.0057 0.0057 0.6332 0.3394 0.1373
D4 0.0024 0.0107 0.3098 0.6266 0.0216
K ¦
Ai B j C k M
m1 ( Ai ) m 2 ( B j ) m3 (C k ) D5 0.0609 0.0227 0.0343 0.0223 0.7005
¦
T/F T T T T T
1- m1 ( Ai ) m 2 ( B j ) m3 (C k ) 0.9449
Ai B j C k zM
From the fifth column of Table 1, we can see that neither
According to the combination rule of D-S evidence theory,
color-based, shape-based nor texture-based neural networks
the mass of D1 in the set D can be expressed as
can give correct recognition results for sample 5, but through
the fusion of D-S evidence theory, correct judgment results
¦
Ai B j Ck D1
m1( A1 )m2 ( B1 )m3 (C1 ) can be obtained. Table 4 is the result of evidence fusion
m( D1 ) processing for all the samples in Table 1, all column results
1 K corresponding to the last row are “T”, which indicates that the
0.1272 u 0.3586 u 0.1614 recognition results of all samples are correct. This shows that
0.1337 in the application of robot vision, multi-neural network fusion
1-0.9449 can improve the reliability of working target recognition.
In the same way, we can calculate m(D2), m(D3), m(D4)
and m(D5). VII. CONCLUSIONS
¦
Ai Bj Ck D2
m1( A2 )m2 (B2 )m3(C2 ) In this paper, a color-based, shape-based and texture-based
m(D2 ) neural network is established for working target recognition in
1 K robot vision applications. The color-based neural network uses
0.0482 u 0.1545 u 0.0508 the color histogram of target image as the input feature vector,
0.0069 the shape-based neural network uses the binary image of the
1-0.9449 target region as the input of network, and the texture
recognition uses a 10-layer-convolution neural network. We
¦
Ai Bj Ck D3
m1( A3 )m2 (B3 )m3(C3) discuss the problems of working target recognition using these
m(D3 ) three kinds of neural networks independently, and then to
1 K combine three kinds of networks and use D-S evidence theory
0.3419 u 0.0642 u 0.3444 to fuse their output to get a better judgment. The classification
0.1373 test of five kinds of medicine package proves that the proposed
1-0.9449 model can improve the reliability of target recognition in robot
vision, and it can be used in automatic control systems such as
feeding, assembly, sorting and tracking of industrial robots.
2576
ACKNOWLEDGMENT on Computer Vision and Pattern Recognition (CVPR), Boston, MA,
USA, pp.648-656, 2015.
This work is supported by the Guangdong National [8] K. Simonyan K, A. Zisserman, “Very deep convolutional networks for
Natural Science Foundation under Grant No.2016A030313003 large-scale image recognition,”. Proc. Of International Conference on
and Jiangmen Science and Technology Bureau under Grant Learning Representations (ICLR), San Diego, CA, USA, pp.1-14, 2015.
No.20140060117111. [9] Y. LeCun, Y. Bengio and G. E. Hinton, “Deep learning,” Nature,
vol.521, no.7553, pp. 436-444, 2015.
REFERENCES [10] A. Gangopadhyay, S. M. Tripathi, I. Jindal and et al. “dynamic scene
classification using convolutional neural networks,” arXiv preprint ar
[1] D. Ramachandram and M. Rajeswari, “ Neural network-based robot Xiv:1502.05243, 2015.
visual positioning for intelligent assembly,” Journal of Intelligent
Manufacturing, vol.15, No.2, pp. 219-231, 2004. [11] M. Marszaek and C. Schmid, “Accurate Object Recognition with Shape
Masks,” International Journal of Computer Vision, vol. 97, issue 2, pp.
[2] I. Lenz, H. Lee, A. Saxena, “Deep learning for detecting robotic grasps,” 191–209, 2012.
Int J Robotics Res, vol.34, pp. 705–724, 2015.
[12] N. R. ManiˈD. M. PotukuchiˈCh. Satyanarayana, “A novel approach
[3] M. Z. Alom, M. Hasan, C. Yakopcic and et al, “Improved inception- for shape-based object recognition with curvelet transform,”
residual convolutional neural network for object recognition,” Neural International Journal of Multimedia Information Retrieval, vol. 5, issue
Computing and Applications, pp. 1-15, 2018. 4, pp. 219–228, 2016.
[4] J. Redmon, A. Angelova, “Real-time robotic grasp detection using
convolutional neural networks.”, In: 2015 IEEE International
Conference on Robotics and Automation (ICRA). IEEE, pp.1316–1322,
2015.
[5] A. Anuse and V. Vyas, “A novel training algorithm for convolutional
neural network,” Complex & Intelligent Systems, vol.2, issue 3, pp.
221-234, 2016.
[6] S. Levine, P. Pastor, A. Krizhevsky and et al, “Learning hand-eye
coordination for robotic grasping with deep learning and large-scale data
collection,” Int J Robotics Res, vol.37, pp. 421–436, 2018.
[7] J. Tompson J, R. Goroshin, A. Jain and et al, “Efficient object
localization using convolutional networks,” Proc. of IEEE Conference
2577