Multi Person Pose Estimation With Attention: Jaekyu Sim Kwanghyun Park

Multi Person Pose Estimation with Attention
Jaekyu Sim Kwanghyun Park

The school of robotics The school of robotics
Kwangwoon University Kwangwoon University
Seoul, South Korea Seoul, South Korea
worb1605@kw.ac.kr akaii@kw.ac.kr
Abstract—The existing pose estimation networks extract con-

fidence maps and Part Affinity Fields (PAFs) from all pixels
in images. However, we found that the recognition rate can be
decreased due to the background. This limitation arises from the
process to extract feature maps from the image. To overcome this
limitation, we adopt the attention block in front of the backbone
network. To show that our proposed network can reduce the
error and increase the precision, we compared PCKh and mAP
with the existing pose estimation network.
Index Terms—attention, pose, estimation
I. I NTRODUCTION Fig. 1: The Architecture of our proposed Network. The gray

block is attention block. And the green one is VGG-19 Net-
The Open Pose1 [1]extracts confidence maps and part
work, the blue one and orange one are Branch1 and Branch2.
affinity fields (PAFs) from input data for pose estimation.
Based on the confidence maps and PAFs, locations of all body
joints and limbs are extracted, respectively, for every person based on PAFs that express the directional vector of the body
in the image. Then the pose of each person is estimated. The limb. The voting is performed by a greedy algorithm [4] and
result of the Open Pose achieved the top class in COCO 2016 Hungarian algorithm [5].
keypoint challenge and became the state-of-the-art network in
II. BACKGROUNDS
the field of multi-person pose estimation. However, we have
found that the common failures are mostly caused by the A. Pose Estimation
background of the image. To reduce the detection error due The pose estimation for a single person extracts only
to the background, in this paper, we adopt the visual attention one body part from an image, and thus it is not necessary
[2] which recognizes images in the same way as human visual to know who owns each joint. Since the pair of joints to
perception. We placed the attention block at the very beginning be connected is determined in advance, we can obtain the
of our network to make attention maps. Fig. 1 shows the completed pose estimation by extracting only the confidence
structure of our network which consists of attention block, map [6]. However, the pose estimation for multiple persons
VGG-19 network [3], and stages 1 to N, divided into two should be able to know who owns each joint. For example, if
branches to extract confidence maps and PAFs. In our network, the pose estimation is performed using the MPII Human Pose
the number of stages is 6(N = 6). Fig. 2 shows the inputs and Dataset [7], 16 joints are detected by using a 3D matrix with
the output of the attention block. The inputs are RGB images 16 channels. When an image with three people is entered as
and the outputs indicate the probability that a body exists shown in Fig. 3, there will be three points per one channel in
at each image pixel. Our network carries out the depth-wise PAFs and we need to know who owns each joint to connect
convolution of the input and output images. Someone says them. A greedy algorithm [4] and Hungarian algorithm [5]
depthwise convolution as pixelwise multiplication. The depth- are used to know the owner of joints. Fig. 3 shows an image
wise convolution emphasizes the pixels including a human input (a) and the output (b) of our network. The blue box
body and darkens the background. indicates the confidence maps of right ankles and knees, and
The output of the attention block is entered into the VGG- the orange box indicates PAFs of right shins. In this case, we
19 network which extracts feature maps. The feature maps should distinguish the owners of right ankles and knees and
flow to Branch1 and Branch2 as shown in Fig. 1, and pass connect each other.
through stages 1 to N to extract confidence maps and PAFs.

The joints in the confidence maps are connected by voting Mvector (x, y) = 1 if V ector map(x, y) is on P AF s(x, y)
(1)
Mvector (x, y) = 0 if V ector map(x, y) is not on P AF s(x, y)
1 ”Realtime Multi-Pereson 2D Pose Estimation using Part Affinity Fields”’s
network called as Open Pose To know the owners, we make vector fields from a right ankle
978-1-7281-4199-2/20/$31.00 ©2020 IEEE 513 ICOIN 2020

(a) (b)
Fig. 3: Pose Estimation example. (a) is an input image. (b)
is output of our network. Images in blue box are confidence
map of right ankle and right knee. Image in orange box is a
Fig. 2: Examples of attention blocks output. The top line is in- PAFs of right shin.
put images. And middleline of the images are attention blocks
image. The bottom line is depthwise convolution ofinput image
and output of attention block. require a network with more features to find the attention
map of human body. We added three convolution layers to the
inception-v3 structure to extract more features, and as a result
to a right knee considering PAFs in the blue box. If a vector we obtained the attention map as shown in Fig. 2. Then, our
belongs to the PAFs, we assign 1 to the vector field and network performs depthwise convolution with the input image
0 otherwise as shown in Eq (1). By obtaining vector fields and the output of the attention block to obtain the results in
marked the owner, we can perform the pose detection for Fig. 2.
multiple persons.
III. A PPROACH
B. Visual Attention A. Network Architecture
By performing visual attention, our proposed network con- Attention Block The Attention block receives data with
structs an attention map to perceive image in the same way as the size of [batch, width, height, 3] and generates the attention
human visual perception. When humans look some picture map with size of [batch, width’, height’, 1]. While the network
or scene, they don’t care all the objects in the image but in the existing method uses an inception v3 network for visual
focuses on the main subject as if the background is blurred attention, out network added more layers to extract human
by a Gaussian filter. The focus on the subject can lead to body limbs as shown Fig. 2, since the inception-v3 lacks the
remember the scene or subject. The attention block is trained capacity to extract human body limbs . The model of Attention
with 2D-field gray images. When the subject of the image Block is in Table I.
corresponds to the pre-selected domain, the network highlights VGG-19 Network We uses VGG-19 as a backbone network
at the pixels of the subject. By using this attention method, to extract confidence maps and PAFs like the Open Pose.
we can obtain the 2D field which is highlighted on the subject Since VGG-19 network is used in image classification, we
and darkened on the background, and reduce the distortion modified the existing VGG-19 network to extract feature maps
effect due to the background. To build the attention block, by removing the fully connected layer of VGG-19 network.
we use inception-v3 structures [8] [12]. However, unlike the Branch1 & Branch2 The feature map is divided into two
existing method using the MIT saliency benchmark [9], we branches and passes through N stages in each branch. In
Branch1, our network compare the output with label confi-
TABLE I: Attention block’s layer. Our network used seventh dence map S t in Eq (2). The output of Branch2 is compared
layer in inception-v3 network. And we added 3 convolution with PAFs Lt in Eq (3).
layer to extract human body features. [2] S t = ρt (F, S t−1 , Lt−1 ), ∀t ≥ 2, (2)
Type patch size / stride input size Lt = φt (F, S t−1 , Lt−1 ), ∀t ≥ 2, (3)
conv 3x3 / 2 356x356x3
conv 3x3 / 1 175x175x32 Here, t is the number of stages. In our network, the number
conv 3x3 / 1 175x175x64 of stages is 6(N = 6) to compare with the previous network.
conv 3x3 / 1 175x175x64
conv 3x3 / 1 175x175x64 B. Depthwise Convolution
pool 3x3 / 2 87x87x64
conv 3x3 / 1 87x87x32 As shown in Fig. 2, the part where the person appears
conv 3x3 / 1 87x87x32 in the image is indicated by a gray scale. To get rid of the
conv 3x3 / 1 87x87x32
conv 3x3 / 2 44x44x1 background, the depthwise convolution [11] is carried out
which calculates the convolution of each spatial direction for
514
TABLE II: Results on the MPII Dataset.
ankle knee hip pelvis ... head top wrist elbow sholder mAP
Ours 76.24445 87.2025 91.39 91.681 ... 94.768 80.9975 87.374 95.48 89.598
Open Pose 65.271 71.7105 89.973 92.846 ... 95.115 67.5755 82.4 92.4315 84.22
TABLE III: Training results of Ours and Open Pose.
Open Pose Ours

Training time About 48min / epoch About 50min / epoch
Processing time 12.091693sec / 64images 12.164722sec / 64images
In equation (5), we could get confidence map loss using l2

loss with S t and S ∗ . And, if body part is in image, W(p) is 1,
if not, W(p) is 0. By putting W(p), we could concentrate on
body part shown on image. We could get (6), (7) in the same
way like (5)
We constructed an optimizer to allow the visual attention
Fig. 4: The processing of depthwise convolution.
block to converge faster than the other parts. In equation(8),
our network optimize attention block N times more than other
each channel of the feature map as shown in Fig. 4 and parts.
Eq (4). The size of the input image and the output of the T

attention block can be different. Thus, after passing through fTt otal = ((N ∗ fAt ) + fLt + fSt ) (8)
the attention block, we resize the attention block to the size t=1
of the input image. IV. E XPERIMENTS
K,L
A. Dataset
Depthwise Conv(W, y)(i,j) = W(k,l) y(i+k,j+l) (4)
k,l
Experiments were conducted using MPII Human Pose
Dataset. The MPII Human Pose Dataset [7] consists of 24,984
Here, W is the input image and y is the resized output of the images and an annotation file containing joint information. The
attention block for each channel. As a result, the subject is annotation contains 16 pieces of joint information as label
emphasized and the background is darkened in the image. data. To train our network, we constructed label confidence
C. Optimizer maps and label PAFs using the annotation information. We
also prepared an attention map to create a human silhouette
Our proposed network allows one-shot training without pre- by combining the limb data of label PAFs.
training on the attention block. We need to optimize the
attention block faster than the other parts with the highest B. Evaluation
priority in extracting the confidence maps and PAFs from the In order to evaluate the performance of our network, we
input image, since the parameters of Branch1 and Branch2 can used mAP (mean Average Precision) and PCKh (Percent of
be optimized after the parameters of the attention block are Correct Keypoints threshold), and compared them with the
optimized. Open Pose the state-of-the-art network in the field of multi-
person detection. Table III shows the training and processing
J
time of our network and Open Pose network.
fSt = W (p) · ||Sjt (p) − Sj∗ (p)||22 , (5) Our network and Open Pose are trained with same epochs
j=1 p and hyper parameters to compare the performance. There was
C
no significant difference in training time and processing time.
fLt = W (p) · ||Ltc (p) − L∗c (p)||22 , (6) We also compared PCKh and mAP to show accuracy. Table II
c=1 p
shows the performance comparison between our network and
R
the Open Pose. Comparing the PCKh numbers for each part,
fAt = W (p) · ||Atr (p) − A∗r (p)||22 , (7) our network shows better performance in the ankle, knee, wrist
r=1 p
and elbow, and weaker performance in the pelvis and head top
In equation (5), (6), S ∗ and L∗ are groundtruth part of entries. The mAP shows our network is better than Open Pose
confidence map and PAFs. And, we construct attention loss geting 5% higher in mAP@0.5. Our network got 89.598%
fAt to express Attention Block’s loss. We also construct accuracy and Open Pose got 84.22% score in PCKh metrics.
groundtruth attention map A∗ by adding all groundtruth PAFs. Fig. 5 (a) shows the comparison results of our network and
515
(a)
(a) (b) (c) (d)
Fig. 6: Input image and our network’s output. (a)input image

(b)attention map (c)image that summed all confidence map
(d)image that summed all PAFs parts
(b)
Fig. 5: Comparing output and mean Average Precision per

Normalized distance. (a) is a graph of mAP per normalized
distance. (b) is output of ours and open pose. Top is open
pose’s output. Bottom is our network’s output.
Fig. 7: Results with single person, multi person detection.

the Open pose in mAP per normalized distance. Futhermore,
the processing time and training time of our proposed network
are very slightly slower than [2] D.Zanca, “Visual attention driven by convolutional features,”
arXiv:1807.10576 2018.
V. C ONCLUSIONS [3] Simonyan, K. and Zisserman, A., “Very deep convolutional networks
for large-scale image recognition,” ICLR 2015.
In this paper, we proposed a network for pose estimation [4] Jungnickel D., “The Greedy Algorithm,” In: Graphs, Networks and Al-
with high accuracy by attention. In the experimental results, gorithms. Algorithms and Computation in Mathematics, vol 5. Springer
1999.
we showed that the extraction of each limb has better per- [5] H. W. Kuhn and Bryn Yaw, A., “The Hungarian method for the
formance than the Open Pose. As we can see in Table II, assignment problem,” Naval Res. Logist. Quart 1955.
while the detection of the body trunk is similar or slightly [6] Adrian, Bulat, and others, “Human pose estimation via convolutional
part heatmap regression,” ECCV 2016.
lower than the Open Pose, the accuracy of the extraction of [7] Mykhaylo Andriluka, Leonid Pishchulin, and others, “2D Human Pose
each limb is higher in our network. These output is result Estimation : New Benchmark and State of the Art Analysis,” CVPR
of removing background distortion with attention block. We 2014.
[8] C.Szegedy, V.Vanhoucke, and others, Rethinking the inception architec-
attach our network’s output in Fig. 6, Fig. 7. These output ture for computer vision, CVPR, 2016.
shows many case in single person detection and multi person [9] Zoya Bylinskii , Tilke Judd and others, MIT Saliency benchmark,
detection. In conclusion, by applying visual attention in human http://saliency.mit.edu/.
[10] Zweng A., Kampel M., “Introducing confidence maps to increase
pose estimation, our network could get rid of background the performance of person detectors.,” ISVC 2011. Lecture Notes in
distortion. And could get more accuracy in limbs. Computer Science, vol 6939. Springer, Berlin, Heidelberg.
[11] Chollet, François, “Xception: Deep Learning with Depthwise Separable
R EFERENCES Convolutions,” CVPR 2016.
[12] Christian Szegedy , Wei Liu, and toehrs, “Going deeper with convolu-
[1] Z.Cao, T.Simon, and others, “Realtime multi-person 2D pose estimation tions,” CVPR 2015 pp 1-9.
using part affinity fields,” CVPR 2017.
516

Multi Person Pose Estimation With Attention: Jaekyu Sim Kwanghyun Park

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Multi Person Pose Estimation With Attention: Jaekyu Sim Kwanghyun Park

Hochgeladen von

Copyright:

Verfügbare Formate

Multi Person Pose Estimation with Attention

Jaekyu Sim Kwanghyun Park

Abstract—The existing pose estimation networks extract con-

I. I NTRODUCTION Fig. 1: The Architecture of our proposed Network. The gray

978-1-7281-4199-2/20/$31.00 ©2020 IEEE 513 ICOIN 2020

TABLE III: Training results of Ours and Open Pose.

Open Pose Ours

In equation (5), we could get confidence map loss using l2

(a) (b) (c) (d)

Fig. 6: Input image and our network’s output. (a)input image

Fig. 5: Comparing output and mean Average Precision per

Fig. 7: Results with single person, multi person detection.

Das könnte Ihnen auch gefallen